# Use SerDes
<a name="serde-reference"></a>

Athena supports several SerDe (Serializer/Deserializer) libraries that parse data from a variety of data formats. When you create a table in Athena, you can specify a SerDe that corresponds to the format that your data is in. Athena does not support custom SerDes. 

Athena can use SerDe libraries to create tables from CSV, TSV, custom-delimited, and JSON formats; data from the Hadoop-related formats ORC, Avro, and Parquet; logs from Logstash, AWS CloudTrail logs, and Apache WebServer logs. Each of these data formats has one or more serializer-deserializer (SerDe) libraries that Athena can use to parse the data.

**Note**  
The formats listed in this section are used by Athena for reading data. For information about formats that Athena uses for writing data when it runs CTAS queries, see [Create a table from query results (CTAS)](ctas.md).

**Topics**
+ [Choose a SerDe for your data](supported-serdes.md)
+ [Use a SerDe to create a table](serde-create-a-table.md)
+ [Amazon Ion Hive SerDe](ion-serde.md)
+ [Avro SerDe](avro-serde.md)
+ [Grok SerDe](grok-serde.md)
+ [JSON SerDe libraries](json-serde.md)
+ [CSV SerDe libraries](serde-csv-choices.md)
+ [ORC SerDe](orc-serde.md)
+ [Parquet SerDe](parquet-serde.md)
+ [Regex SerDe](regex-serde.md)

# Choose a SerDe for your data
<a name="supported-serdes"></a>

The following table lists the data formats supported in Athena and their corresponding SerDe libraries.


**Supported data formats and SerDes**  

| Data format | Description | SerDe types supported in Athena | 
| --- | --- | --- | 
| Amazon Ion | Amazon Ion is a richly-typed, self-describing data format that is a superset of JSON, developed and open-sourced by Amazon. | Use the [Amazon Ion Hive SerDe](ion-serde.md). | 
|  Apache Avro  |  A format for storing data in Hadoop that uses JSON-based schemas for record values.  |  Use the [Avro SerDe](avro-serde.md).  | 
|  Apache Parquet  |  A format for columnar storage of data in Hadoop.  |  Use the [Parquet SerDe](parquet-serde.md) and SNAPPY compression.  | 
|  Apache WebServer logs  |  A format for storing logs in Apache WebServer.  |  Use the [Grok SerDe](grok-serde.md) or [Regex SerDe](regex-serde.md).  | 
|  CloudTrail logs  |  A format for storing logs in CloudTrail.  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html)  | 
|  CSV (Comma-Separated Values)  |  For data in CSV, each line represents a data record, and each record consists of one or more fields, separated by commas.  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html)  | 
|  Custom-Delimited  |  For data in this format, each line represents a data record, and records are separated by a custom single-character delimiter.  |  Use the [Lazy Simple SerDe for CSV, TSV, and custom-delimited files](lazy-simple-serde.md) and specify a custom single-character delimiter.  | 
|  JSON (JavaScript Object Notation)  |  For JSON data, each line represents a data record, and each record consists of attribute-value pairs and arrays, separated by commas.  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/supported-serdes.html)  | 
|  Logstash logs  |  A format for storing logs in Logstash.  |  Use the [Grok SerDe](grok-serde.md).  | 
|  ORC (Optimized Row Columnar)  |  A format for optimized columnar storage of Hive data.  |  Use the [ORC SerDe](orc-serde.md) and ZLIB compression.  | 
|  TSV (Tab-Separated Values)  |  For data in TSV, each line represents a data record, and each record consists of one or more fields, separated by tabs.  |  Use the [Lazy Simple SerDe for CSV, TSV, and custom-delimited files](lazy-simple-serde.md) and specify the separator character as `FIELDS TERMINATED BY '\t'`.  | 

# Use a SerDe to create a table
<a name="serde-create-a-table"></a>

To use a SerDe when creating a table in Athena, use one of the following methods:
+ Specify `ROW FORMAT DELIMITED` and then use DDL statements to specify field delimiters, as in the following example. When you specify `ROW FORMAT DELIMITED`, Athena uses the LazySimpleSerDe by default.

  ```
  ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY ','
  ESCAPED BY '\\'
  COLLECTION ITEMS TERMINATED BY '|'
  MAP KEYS TERMINATED BY ':'
  ```

  For examples of `ROW FORMAT DELIMITED`, see the following topics:

  [Lazy Simple SerDe for CSV, TSV, and custom-delimited files](lazy-simple-serde.md)

  [Query Amazon CloudFront logs](cloudfront-logs.md)

  [Query Amazon EMR logs](emr-logs.md)

  [Query Amazon VPC flow logs](vpc-flow-logs.md)

  [Use CTAS and INSERT INTO for ETL and data analysis](ctas-insert-into-etl.md)
+ Use `ROW FORMAT SERDE` to explicitly specify the type of SerDe that Athena should use when it reads and writes data to the table. The following example specifies the LazySimpleSerDe. To specify the delimiters, use `WITH SERDEPROPERTIES`. The properties specified by `WITH SERDEPROPERTIES` correspond to the separate statements (like `FIELDS TERMINATED BY`) in the `ROW FORMAT DELIMITED` example.

  ```
  ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
  WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ',',
  'collection.delim' = '|',
  'mapkey.delim' = ':',
  'escape.delim' = '\\'
  )
  ```

  For examples of `ROW FORMAT SERDE`, see the following topics:

  [Avro SerDe](avro-serde.md)

  [Grok SerDe](grok-serde.md)

  [JSON SerDe libraries](json-serde.md)

  [Open CSV SerDe for processing CSV](csv-serde.md)

  [Regex SerDe](regex-serde.md)

# Amazon Ion Hive SerDe
<a name="ion-serde"></a>

You can use the Amazon Ion Hive SerDe to query data stored in [Amazon Ion](https://amzn.github.io/ion-docs/guides/cookbook.html) format. Amazon Ion is a richly-typed, self-describing, open source data format. The Amazon Ion format is used in the open source SQL query language [PartiQL](https://partiql.org/).

Amazon Ion has binary and text formats that are interchangeable. This feature combines the ease of use of text with the efficiency of binary encoding.

To query Amazon Ion data from Athena, you can use the [Amazon Ion Hive SerDe](https://github.com/amzn/ion-hive-serde), which serializes and deserializes Amazon Ion data. Deserialization allows you to run queries on the Amazon Ion data or read it for writing out into a different format like Parquet or ORC. Serialization lets you generate data in the Amazon Ion format by using `CREATE TABLE AS SELECT` (CTAS) or `INSERT INTO` queries to copy data from existing tables.

**Note**  
Because Amazon Ion is a superset of JSON, you can use the Amazon Ion Hive SerDe to query non-Amazon Ion JSON datasets. Unlike other [JSON SerDe libraries](https://docs.aws.amazon.com/athena/latest/ug/json-serde.html), the Amazon Ion SerDe does not expect each row of data to be on a single line. This feature is useful if you want to query JSON datasets that are in "pretty print" format or otherwise break up the fields in a row with newline characters.

For additional information and examples of querying Amazon Ion with Athena, see [Analyze Amazon Ion datasets using Amazon Athena](https://aws.amazon.com/blogs/big-data/analyze-amazon-ion-datasets-using-amazon-athena/).

## Serialization library name
<a name="library-name"></a>

The serialization library name for the Amazon Ion SerDe is `com.amazon.ionhiveserde.IonHiveSerDe`. For source code information, see [Amazon Ion Hive SerDe](https://github.com/amazon-ion/ion-hive-serde) on GitHub.com.

## Considerations and limitations
<a name="ion-serde-considerations-and-limitations"></a>
+ **Duplicated fields** – Amazon Ion structs are ordered and support duplicated fields, while Hive's `STRUCT<>` and `MAP<>` do not. Thus, when you deserialize a duplicated field from an Amazon Ion struct, a single value is chosen non deterministically, and the others are ignored.
+ **External symbol tables unsupported** – Currently, Athena does not support external symbol tables or the following Amazon Ion Hive SerDe properties:
  + `ion.catalog.class`
  + `ion.catalog.file`
  + `ion.catalog.url`
  + `ion.symbol_table_imports`
+ **File extensions** – Amazon Ion uses file extensions to determine which compression codec to use for deserializing Amazon Ion files. As such, compressed files must have the file extension that corresponds to the compression algorithm used. For example, if ZSTD is used, corresponding files should have the extension `.zst`.
+ **Homogeneous data** – Amazon Ion has no restrictions on the data types that can be used for values in particular fields. For example, two different Amazon Ion documents might have a field with the same name that have different data types. However, because Hive uses a schema, all values that you extract to a single Hive column must have the same data type.
+ **Map key type restrictions** – When you serialize data from another format into Amazon Ion, ensure that the map key type is one of `STRING`, `VARCHAR`, or `CHAR`. Although Hive allows you to use any primitive data type as a map key, [Amazon Ion symbols](https://amzn.github.io/ion-docs/docs/symbols.html) must be a string type.
+ **Union type** – Athena does not currently support the Hive [union type](https://cwiki.apache.org/confluence/display/hive/languagemanual+types/#LanguageManualTypes-UnionTypesunionUnionTypes).
+ **Double data type** – Amazon Ion does not currently support the `double` data type.

**Topics**
+ [Serialization library name](#library-name)
+ [Considerations and limitations](#ion-serde-considerations-and-limitations)
+ [Create Amazon Ion tables](ion-serde-using-create-table.md)
+ [Use CTAS and INSERT INTO to create Amazon Ion tables](ion-serde-using-ctas-and-insert-into-to-create-ion-tables.md)
+ [Amazon Ion SerDe property reference](ion-serde-using-ion-serde-properties.md)
+ [Use path extractors](ion-serde-using-path-extractors.md)

# Create Amazon Ion tables
<a name="ion-serde-using-create-table"></a>

To create a table in Athena from data stored in Amazon Ion format, you can use one of the following techniques in a CREATE TABLE statement:
+ Specify `STORED AS ION`. In this usage, you do not have to specify the Amazon Ion Hive SerDe explicitly. This choice is the more straightforward option.
+ Specify the Amazon Ion class paths in the `ROW FORMAT SERDE`, `INPUTFORMAT`, and `OUTPUTFORMAT` fields.

You can also use `CREATE TABLE AS SELECT` (CTAS) statements to create Amazon Ion tables in Athena. For information, see [Use CTAS and INSERT INTO to create Amazon Ion tables](ion-serde-using-ctas-and-insert-into-to-create-ion-tables.md).

## Specify STORED AS ION
<a name="ion-serde-specifying-stored-as-ion"></a>

The following example `CREATE TABLE` statement uses `STORED AS ION` before the `LOCATION` clause to create a table based on flight data in Amazon Ion format. The `LOCATION` clause specifies the bucket or folder where the input files in Ion format are located. All files in the specified location are scanned.

```
CREATE EXTERNAL TABLE flights_ion (
    yr INT,
    quarter INT,
    month INT,
    dayofmonth INT,
    dayofweek INT,
    flightdate STRING,
    uniquecarrier STRING,
    airlineid INT,
)
STORED AS ION
LOCATION 's3://amzn-s3-demo-bucket/'
```

## Specify the Amazon Ion class paths
<a name="ion-serde-specifying-the-ion-class-paths"></a>

Instead of using the `STORED AS ION` syntax, you can explicitly specify the Ion class path values for the `ROW FORMAT SERDE`, `INPUTFORMAT`, and `OUTPUTFORMAT` clauses as follows.


****  

| Parameter | Ion class path | 
| --- | --- | 
| ROW FORMAT SERDE | 'com.amazon.ionhiveserde.IonHiveSerDe' | 
| STORED AS INPUTFORMAT | 'com.amazon.ionhiveserde.formats.IonInputFormat' | 
| OUTPUTFORMAT | 'com.amazon.ionhiveserde.formats.IonOutputFormat' | 

The following DDL query uses this technique to create the same external table as in the previous example.

```
CREATE EXTERNAL TABLE flights_ion (
    yr INT,
    quarter INT,
    month INT,
    dayofmonth INT,
    dayofweek INT,
    flightdate STRING,
    uniquecarrier STRING,
    airlineid INT,
)
ROW FORMAT SERDE
 'com.amazon.ionhiveserde.IonHiveSerDe'
STORED AS INPUTFORMAT
 'com.amazon.ionhiveserde.formats.IonInputFormat'
OUTPUTFORMAT
 'com.amazon.ionhiveserde.formats.IonOutputFormat'
LOCATION 's3://amzn-s3-demo-bucket/'
```

For information about the SerDe properties for `CREATE TABLE` statements in Athena, see [Amazon Ion SerDe property reference](ion-serde-using-ion-serde-properties.md).

# Use CTAS and INSERT INTO to create Amazon Ion tables
<a name="ion-serde-using-ctas-and-insert-into-to-create-ion-tables"></a>

You can use the `CREATE TABLE AS SELECT` (CTAS) and `INSERT INTO` statements to copy or insert data from a table into a new table in Amazon Ion format in Athena.

In a CTAS query, specify `format='ION'` in the `WITH` clause, as in the following example.

```
CREATE TABLE new_table
WITH (format='ION')
AS SELECT * from existing_table
```

By default, Athena serializes Amazon Ion results in [Ion binary format](https://amzn.github.io/ion-docs/docs/binary.html), but you can also use text format. To use text format, specify `ion_encoding = 'TEXT'` in the CTAS `WITH` clause, as in the following example.

```
CREATE TABLE new_table
WITH (format='ION', ion_encoding = 'TEXT')
AS SELECT * from existing_table
```

For more information about Amazon Ion specific properties in the CTAS `WITH` clause, see [Amazon Ion properties for the CTAS WITH clause](#ion-serde-ctas-with-clause-properties).

## Amazon Ion properties for the CTAS WITH clause
<a name="ion-serde-ctas-with-clause-properties"></a>

In a CTAS query, you can use the `WITH` clause to specify the Amazon Ion format and optionally specify the Amazon Ion encoding and/or write compression algorithm to use.

**format**  
You can specify the `ION` keyword as the format option in the `WITH` clause of a CTAS query. When you do so, the table that you create uses the format that you specify for `IonInputFormat` for reads, and it serializes data in the format that you specify for `IonOutputFormat`.  
The following example specifies that the CTAS query use Amazon Ion format.  

```
WITH (format='ION')
```

**ion\$1encoding**  
Optional  
Default: `BINARY`  
Values: `BINARY`, `TEXT`  
Specifies whether data is serialized in Amazon Ion binary format or Amazon Ion text format. The following example specifies Amazon Ion text format.  

```
WITH (format='ION', ion_encoding='TEXT')
```

**write\$1compression**  
Optional  
Default: `GZIP`  
Values: `GZIP`, `ZSTD`, `BZIP2`, `SNAPPY`, `NONE`  
Specifies the compression algorithm to use to compress output files.  
The following example specifies that the CTAS query write its output in Amazon Ion format using the [Zstandard](https://facebook.github.io/zstd/) compression algorithm.  

```
WITH (format='ION', write_compression = 'ZSTD')       
```
For information about using compression in Athena, see [Use compression in Athena](compression-formats.md). 

For information about other CTAS properties in Athena, see [CTAS table properties](create-table-as.md#ctas-table-properties).

# Amazon Ion SerDe property reference
<a name="ion-serde-using-ion-serde-properties"></a>

This topic contains information about the SerDe properties for `CREATE TABLE` statements in Athena. For more information and examples of Amazon Ion SerDe property usage, see [SerDe properties](https://github.com/amzn/ion-hive-serde/blob/master/docs/serde-properties.md) in the Amazon Ion Hive SerDe documentation on [GitHub](https://github.com/amzn/ion-hive-serde/tree/master/docs).

## How to specify Amazon Ion SerDe properties
<a name="ion-serde-specifying-ion-serde-properties"></a>

To specify properties for the Amazon Ion Hive SerDe in your `CREATE TABLE` statement, use the `WITH SERDEPROPERTIES` clause. Because `WITH SERDEPROPERTIES` is a subfield of the `ROW FORMAT SERDE` clause, you must specify `ROW FORMAT SERDE` and the Amazon Ion Hive SerDe class path first, as the following syntax shows.

```
...
ROW FORMAT SERDE
 'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
 'property' = 'value',
 'property' = 'value',
...
)
```

Note that although the `ROW FORMAT SERDE` clause is required if you want to use `WITH SERDEPROPERTIES`, you can use either `STORED AS ION` or the longer `INPUTFORMAT` and `OUTPUTFORMAT` syntax to specify the Amazon Ion format.

## Amazon Ion SerDe properties
<a name="ion-serde-ion-serde-properties"></a>

Following are the Amazon Ion SerDe properties that can be used in `CREATE TABLE` statements in Athena.

**ion.encoding**  
Optional  
Default: `BINARY`  
Values: `BINARY`, `TEXT`  
This property declares whether new values added are serialized as [Amazon Ion binary](https://amzn.github.io/ion-docs/docs/binary.html) or Amazon Ion text format.  
The following SerDe property example specifies Amazon Ion text format.  

```
'ion.encoding' = 'TEXT'
```

**ion.fail\$1on\$1overflow**  
Optional  
Default: `true`  
Values: `true`, `false`  
Amazon Ion allows for arbitrarily large numerical types while Hive does not. By default, the SerDe fails if the Amazon Ion value does not fit the Hive column, but you can use the `fail_on_overflow` configuration option to let the value overflow instead of failing.  
This property can be set at either the table or column level. To specify it at the table level, specify `ion.fail_on_overflow` as in the following example. This sets the default behavior for all columns.  

```
'ion.fail_on_overflow' = 'true'
```
To control a specific column, specify the column name between `ion` and `fail_on_overflow`, delimited by periods, as in the following example.  

```
'ion.<column>.fail_on_overflow' = 'false'
```

**ion.path\$1extractor.case\$1sensitive**  
Optional  
Default: `false`  
Values: `true`, `false`  
Determines whether to treat Amazon Ion field names as case sensitive. When `false`, the SerDe ignores case parsing Amazon Ion field names.  
For example, suppose you have a Hive table schema that defines a field `alias` in lower case and an Amazon Ion document with both an `alias` field and an `ALIAS` field, as in the following example.  

```
-- Hive Table Schema
alias: STRING

-- Amazon Ion Document
{ 'ALIAS': 'value1'} 
{ 'alias': 'value2'}
```
The following example shows SerDe properties and the resulting extracted table when case sensitivity is set to `false`:  

```
-- Serde properties
'ion.alias.path_extractor' = '(alias)'
'ion.path_extractor.case_sensitive' = 'false'

--Extracted Table
| alias    |
|----------|
| "value1" |
| "value2" |
```
The following example shows SerDe properties and the resulting extracted table when case sensitivity is set to `true`:  

```
-- Serde properties
'ion.alias.path_extractor' = '(alias)'
'ion.path_extractor.case_sensitive' = 'true'

--Extracted Table
| alias    |
|----------|
| "value2" |
```
In the second case, `value1` for the `ALIAS` field is ignored when case sensitivity is set to `true` and the path extractor is specified as `alias`.

**ion.*<column>*.path\$1extractor**  
Optional  
Default: NA  
Values: String with search path  
Creates a path extractor with the specified search path for the given column. Path extractors map Amazon Ion fields to Hive columns. If no path extractors are specified, Athena dynamically creates path extractors at run time based on column names.  
The following example path extractor maps the `example_ion_field` to the `example_hive_column`.  

```
'ion.example_hive_column.path_extractor' = '(example_ion_field)'
```
For more information about path extractors and search paths, see [Use path extractors](ion-serde-using-path-extractors.md).

**ion.timestamp.serialization\$1offset**  
Optional  
Default: `'Z'`  
Values: `OFFSET`, where `OFFSET `is represented as `<signal>hh:mm`. Example values: `01:00`, `+01:00`, `-09:30`, `Z` (UTC, same as 00:00)  
Unlike Apache Hive [timestamps](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp), which have no built-in time zone and are stored as an offset from the UNIX epoch, Amazon Ion timestamps do have an offset. Use this property to specify the offset when you serialize to Amazon Ion.  
The following example adds an offset of one hour.  

```
'ion.timestamp.serialization_offset' = '+01:00'       
```

**ion.serialize\$1null**  
Optional  
Default: `OMIT`  
Values: `OMIT`, `UNTYPED`, `TYPED`  
The Amazon Ion SerDe can be configured to either serialize or omit columns that have null values. You can choose to write out strongly typed nulls (`TYPED`) or untyped nulls (`UNTYPED`). Strongly typed nulls are determined based on the default Amazon Ion to Hive type mapping.  
The following example specifies strongly typed nulls.  

```
'ion.serialize_null'='TYPED'
```

**ion.ignore\$1malformed**  
Optional  
Default: `false`  
Values: `true`, `false`  
When `true`, ignores malformed entries or the whole file if the SerDe is unable to read it. For more information, see [Ignore malformed](https://github.com/amzn/ion-hive-serde/blob/master/docs/serde-properties.md#ignore-malformed) in the documentation on GitHub.

**ion.*<column>*.serialize\$1as**  
Optional  
Default: Default type for the column.  
Values: String containing Amazon Ion type  
Determines the Amazon Ion data type in which a value is serialized. Because Amazon Ion and Hive types do not always have a direct mapping, a few Hive types have multiple valid data types for serialization. To serialize data as a non-default data type, use this property. For more information about type mapping, see the Amazon Ion [Type mapping](https://github.com/amzn/ion-hive-serde/blob/master/docs/type-mapping.md) page on GitHub.  
By default, binary Hive columns are serialized as Amazon Ion blobs, but they can also be serialized as an [Amazon Ion clob](https://amzn.github.io/ion-docs/docs/stringclob.html#ion-clob) (character large object). The following example serializes the column `example_hive_binary_column` as a clob.  

```
'ion.example_hive_binary_column.serialize_as' = 'clob'       
```

# Use path extractors
<a name="ion-serde-using-path-extractors"></a>

Amazon Ion is a document style file format, but Apache Hive is a flat columnar format. You can use special Amazon Ion SerDe properties called `path extractors` to map between the two formats. Path extractors flatten the hierarchical Amazon Ion format, map Amazon Ion values to Hive columns, and can be used to rename fields.

Athena can generate the extractors for you, but you can also define your own extractors if necessary.

**Topics**
+ [Use Athena generated path extractors](ion-serde-generated-path-extractors.md)
+ [Specify your own path extractors](ion-serde-specifying-your-own-path-extractors.md)
+ [Use search paths in path extractors](ion-serde-using-search-paths-in-path-extractors.md)
+ [Path extractor examples](ion-serde-examples.md)

# Use Athena generated path extractors
<a name="ion-serde-generated-path-extractors"></a>

By default, Athena searches for top level Amazon Ion values that match Hive column names and creates path extractors at runtime based on these matching values. If your Amazon Ion data format matches the Hive table schema, Athena dynamically generates the extractors for you, and you do not need to add any additional path extractors. These default path extractors are not stored in the table metadata.

The following example shows how Athena generates extractors based on column name.

```
-- Example Amazon Ion Document
{
    identification: {
        name: "John Smith",
        driver_license: "XXXX"
    },
    
    alias: "Johnny"    
}

-- Example DDL
CREATE EXTERNAL TABLE example_schema2 (
    identification MAP<STRING, STRING>,
    alias STRING
)
STORED AS ION
LOCATION 's3://amzn-s3-demo-bucket/path_extraction1/'
```

The following example extractors are generated by Athena. The first extracts the `identification` field to the `identification` column, and the second extracts the `alias` field to the `alias` column.

```
'ion.identification.path_extractor' = '(identification)'
'ion.alias.path_extractor' = '(alias)'
```

The following example shows the extracted table.

```
|                  identification                    |  alias   |
|----------------------------------------------------|----------|
|{["name", "driver_license"],["John Smith", "XXXX"]} | "Johnny" |
```

# Specify your own path extractors
<a name="ion-serde-specifying-your-own-path-extractors"></a>

If your Amazon Ion fields do not map neatly to Hive columns, you can specify your own path extractors. In the `WITH SERDEPROPERTIES` clause of your `CREATE TABLE` statement, use the following syntax.

```
WITH SERDEPROPERTIES (
   "ion.path_extractor.case_sensitive" = "<Boolean>", 
   "ion.<column_name>.path_extractor" = "<path_extractor_expression>"
)
```

**Note**  
By default, path extractors are case insensitive. To override this setting, set the [ion.path_extractor.case_sensitive](ion-serde-using-ion-serde-properties.md#ioncase) SerDe property to `true`.

# Use search paths in path extractors
<a name="ion-serde-using-search-paths-in-path-extractors"></a>

The SerDe property syntax for path extractor contains a *<path\$1extractor\$1expression>*:

```
"ion.<column_name>.path_extractor" = "<path_extractor_expression>"         
```

You can use the *<path\$1extractor\$1expression>* to specify a search path that parses the Amazon Ion document and finds matching data. The search path is enclosed in parenthesis and can contain one or more of the following components separated by spaces.
+ **Wild card** – Matches all values.
+ **Index** – Matches the value at the specified numerical index. Indices are zero-based.
+ **Text** – Matches all values whose field names match are equivalent to the specified text.
+ **Annotations** – Matches values specified by a wrapped path component that has the annotations specified.

The following example shows an Amazon Ion document and some example search paths.

```
-- Amazon Ion document
{
    foo: ["foo1", "foo2"] ,
    bar: "myBarValue", 
    bar: A::"annotatedValue"
}

-- Example search paths
(foo 0)       # matches "foo1"
(1)           # matches "myBarValue"
(*)           # matches ["foo1", "foo2"], "myBarValue" and A::"annotatedValue"
()            # matches {foo: ["foo1", "foo2"] , bar: "myBarValue", bar: A::"annotatedValue"}
(bar)         # matches "myBarValue" and A::"annotatedValue"
(A::bar)      # matches A::"annotatedValue"
```

# Path extractor examples
<a name="ion-serde-examples"></a>

The following path extractor examples show how to flatten and rename fields or extract data as Amazon Ion text.

## Flatten and rename fields
<a name="ion-serde-flattening-and-renaming-fields"></a>

The following example shows a set of search paths that flatten and rename fields. The example uses search paths to do the following:
+ Map the `nickname` column to the `alias` field
+ Map the `name` column to the `name` subfield located in the `identification` struct.

Following is the example Amazon Ion document.

```
-- Example Amazon Ion Document
{
    identification: {
        name: "John Smith",
        driver_license: "XXXX"
    },
    
    alias: "Johnny"    
}
```

The following is the example `CREATE TABLE` statement that defines the path extractors.

```
-- Example DDL Query
CREATE EXTERNAL TABLE example_schema2 (
    name STRING,
    nickname STRING
)
ROW FORMAT SERDE
 'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
 'ion.nickname.path_extractor' = '(alias)',
 'ion.name.path_extractor' = '(identification name)'
 )
STORED AS ION
LOCATION 's3://amzn-s3-demo-bucket/path_extraction2/'
```

The following example shows the extracted data.

```
-- Extracted Table
| name         |   nickname   |
|--------------|--------------|
| "John Smith" |  "Johnny"    |
```

For more information about search paths and additional search path examples, see the [Ion Java Path Extraction](https://github.com/amzn/ion-java-path-extraction) page on GitHub.

## Extract flight data to text format
<a name="ion-serde-extracting-flight-data-to-text-format"></a>

The following example `CREATE TABLE` query uses `WITH SERDEPROPERTIES` to add path extractors to extract flight data and specify the output encoding as Amazon Ion text. The example uses the `STORED AS ION` syntax.

```
CREATE EXTERNAL TABLE flights_ion (
    yr INT,
    quarter INT,
    month INT,
    dayofmonth INT,
    dayofweek INT,
    flightdate STRING,
    uniquecarrier STRING,
    airlineid INT,
)
ROW FORMAT SERDE
 'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
 'ion.encoding' = 'TEXT',
 'ion.yr.path_extractor'='(year)',
 'ion.quarter.path_extractor'='(results quarter)',
 'ion.month.path_extractor'='(date month)')
STORED AS ION
LOCATION 's3://amzn-s3-demo-bucket/'
```

# Avro SerDe
<a name="avro-serde"></a>

Use the Avro SerDe to create Athena tables from Avro data.

## Serialization library name
<a name="avro-serde-library-name"></a>

The serialization library name for the Avro SerDe is `org.apache.hadoop.hive.serde2.avro.AvroSerDe`. For technical information, see [AvroSerDe](https://cwiki.apache.org/confluence/display/Hive/AvroSerDe) in the Apache documentation. 

## Use the Avro SerDe
<a name="avro-serde-using"></a>

For security reasons, Athena does not support using `avro.schema.url` to specify table schema; use `avro.schema.literal` instead. 

To extract schema from data in Avro format, use the Apache `avro-tools-<version>.jar` file located in the `java` subdirectory of your installed Avro release. Use the `getschema` parameter to return a schema that you can use in your `WITH SERDEPROPERTIES` statement, as in the following example.

```
java -jar avro-tools-1.8.2.jar getschema my_data.avro
```

To download Avro, see [Apache Avro releases](http://avro.apache.org/releases.html#Download). To download Apache Avro Tools directly, see the [Apache Avro tools Maven repository](https://mvnrepository.com/artifact/org.apache.avro/avro-tools).

After you obtain the schema, use a `CREATE TABLE` statement to create an Athena table based on the underlying Avro data stored in Amazon S3. To specify the Avro SerDe in your `CREATE TABLE` statement, use `ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'`. Specify the schema using the `WITH SERDEPROPERTIES` clause, as in the following example.

**Note**  
Replace *myregion* in `s3://athena-examples-myregion/path/to/data/` with the region identifier where you run Athena, for example, `s3://athena-examples-us-west-1/path/to/data/`.

```
CREATE EXTERNAL TABLE flights_avro_example (
   yr INT,
   flightdate STRING,
   uniquecarrier STRING,
   airlineid INT,
   carrier STRING,
   flightnum STRING,
   origin STRING,
   dest STRING,
   depdelay INT,
   carrierdelay INT,
   weatherdelay INT
)
PARTITIONED BY (year STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
WITH SERDEPROPERTIES ('avro.schema.literal'='
{
   "type" : "record",
   "name" : "flights_avro_subset",
   "namespace" : "default",
   "fields" : [ {
      "name" : "yr",
      "type" : [ "null", "int" ],
      "default" : null
   }, {
      "name" : "flightdate",
      "type" : [ "null", "string" ],
      "default" : null
   }, {
      "name" : "uniquecarrier",
      "type" : [ "null", "string" ],
      "default" : null
   }, {
      "name" : "airlineid",
      "type" : [ "null", "int" ],
      "default" : null
   }, {
      "name" : "carrier",
      "type" : [ "null", "string" ],
      "default" : null
   }, {
      "name" : "flightnum",
      "type" : [ "null", "string" ],
      "default" : null
   }, {
      "name" : "origin",
      "type" : [ "null", "string" ],
      "default" : null
   }, {
      "name" : "dest",
      "type" : [ "null", "string" ],
      "default" : null
   }, {
      "name" : "depdelay",
      "type" : [ "null", "int" ],
      "default" : null
   }, {
      "name" : "carrierdelay",
      "type" : [ "null", "int" ],
      "default" : null
   }, {
      "name" : "weatherdelay",
      "type" : [ "null", "int" ],
      "default" : null
    } ]
}
')
STORED AS AVRO
LOCATION 's3://athena-examples-myregion/flight/avro/';
```

Run the `MSCK REPAIR TABLE` statement on the table to refresh partition metadata.

```
MSCK REPAIR TABLE flights_avro_example;
```

Query the top 10 departure cities by number of total departures.

```
SELECT origin, count(*) AS total_departures
FROM flights_avro_example
WHERE year >= '2000'
GROUP BY origin
ORDER BY total_departures DESC
LIMIT 10;
```

**Note**  
The flight table data comes from [Flights](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&amp;DB_Short_Name=On-Time) provided by US Department of Transportation, [Bureau of Transportation Statistics](http://www.transtats.bts.gov/). Desaturated from original.

# Grok SerDe
<a name="grok-serde"></a>

The Logstash Grok SerDe is a library with a set of specialized patterns for deserialization of unstructured text data, usually logs. Each Grok pattern is a named regular expression. You can identify and re-use these deserialization patterns as needed. This makes it easier to use Grok compared with using regular expressions. Grok provides a set of [pre-defined patterns](https://github.com/elastic/logstash/blob/v1.4.2/patterns/grok-patterns). You can also create custom patterns.

## Serialization library name
<a name="library-name"></a>

The serialization library name for the Grok SerDe is `com.amazonaws.glue.serde.GrokSerDe`.

## How to use the Grok SerDe
<a name="grok-serde-using"></a>

To specify the Grok SerDe when creating a table in Athena, use the `ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'` clause, followed by the `WITH SERDEPROPERTIES` clause that specifies the patterns to match in your data, where:
+ The `input.format` expression defines the patterns to match in the data. This is required.
+ The `input.grokCustomPatterns` expression defines a named custom pattern, which you can subsequently use within the `input.format` expression. This is optional. To include multiple pattern entries into the `input.grokCustomPatterns` expression, use the newline escape character (`\n`) to separate them, as follows: `'input.grokCustomPatterns'='INSIDE_QS ([^\"]*)\nINSIDE_BRACKETS ([^\\]]*)')`.
+ The `STORED AS INPUTFORMAT` and `OUTPUTFORMAT` clauses are required.
+ The `LOCATION` clause specifies an Amazon S3 bucket, which can contain multiple data objects. All data objects in the bucket are deserialized to create the table.

## Examples
<a name="examples"></a>

The examples in this section rely on the list of predefined Grok patterns. For more information, see [grok-patterns](https://github.com/elastic/logstash/blob/v1.4.2/patterns/grok-patterns) on GitHub.com.

### Example 1
<a name="example-1"></a>

This example uses source data from Postfix maillog entries saved in `s3://amzn-s3-demo-bucket/groksample/`.

```
Feb  9 07:15:00 m4eastmail postfix/smtpd[19305]: B88C4120838: connect from unknown[192.168.55.4]
Feb  9 07:15:00 m4eastmail postfix/smtpd[20444]: B58C4330038: client=unknown[192.168.55.4]
Feb  9 07:15:03 m4eastmail postfix/cleanup[22835]: BDC22A77854: message-id=<31221401257553.5004389LCBF@m4eastmail.example.com>
```

The following statement creates a table in Athena called `mygroktable` from the source data, using a custom pattern and the predefined patterns that you specify:

```
CREATE EXTERNAL TABLE `mygroktable`(
   syslogbase string,
   queue_id string,
   syslog_message string
   )
ROW FORMAT SERDE
   'com.amazonaws.glue.serde.GrokSerDe'
WITH SERDEPROPERTIES (
   'input.grokCustomPatterns' = 'POSTFIX_QUEUEID [0-9A-F]{7,12}',
   'input.format'='%{SYSLOGBASE} %{POSTFIX_QUEUEID:queue_id}: %{GREEDYDATA:syslog_message}'
   )
STORED AS INPUTFORMAT
   'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
   's3://amzn-s3-demo-bucket/groksample/';
```

Start with a pattern like `%{NOTSPACE:column}` to get the columns mapped first, and then specialize the columns if needed.

### Example 2
<a name="example-2"></a>

In the following example, you create a query for Log4j logs. The example logs have the entries in this format:

```
2017-09-12 12:10:34,972 INFO  - processType=AZ, processId=ABCDEFG614B6F5E49, status=RUN,
threadId=123:amqListenerContainerPool23P:AJ|ABCDE9614B6F5E49||2017-09-12T12:10:11.172-0700],
executionTime=7290, tenantId=12456, userId=123123f8535f8d76015374e7a1d87c3c, shard=testapp1,
jobId=12312345e5e7df0015e777fb2e03f3c, messageType=REAL_TIME_SYNC,
action=receive, hostname=1.abc.def.com
```

To query this log data:
+ Add the Grok pattern to the `input.format` for each column. For example, for `timestamp`, add `%{TIMESTAMP_ISO8601:timestamp}`. For `loglevel`, add `%{LOGLEVEL:loglevel}`.
+ Make sure the pattern in `input.format` matches the format of the log exactly, by mapping the dashes (`-`) and the commas that separate the entries in the log format.

  ```
  CREATE EXTERNAL TABLE bltest (
   timestamp STRING,
   loglevel STRING,
   processtype STRING,
   processid STRING,
   status STRING,
   threadid STRING,
   executiontime INT,
   tenantid INT,
   userid STRING,
   shard STRING,
   jobid STRING,
   messagetype STRING,
   action STRING,
   hostname STRING
   )
  ROW FORMAT SERDE 'com.amazonaws.glue.serde.GrokSerDe'
  WITH SERDEPROPERTIES (
  "input.grokCustomPatterns" = 'C_ACTION receive|send',
  "input.format" = "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:loglevel} - processType=%{NOTSPACE:processtype}, processId=%{NOTSPACE:processid}, status=%{NOTSPACE:status}, threadId=%{NOTSPACE:threadid}, executionTime=%{POSINT:executiontime}, tenantId=%{POSINT:tenantid}, userId=%{NOTSPACE:userid}, shard=%{NOTSPACE:shard}, jobId=%{NOTSPACE:jobid}, messageType=%{NOTSPACE:messagetype}, action=%{C_ACTION:action}, hostname=%{HOST:hostname}"
  ) STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat'
  OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
  LOCATION 's3://amzn-s3-demo-bucket/samples/';
  ```

### Example 3
<a name="example-3"></a>

The following example [Amazon S3 server access logs](https://docs.aws.amazon.com/AmazonS3/latest/userguide/LogFormat.html) `CREATE TABLE` statement shows the `'input.grokCustomPatterns'` expression that contains two pattern entries, separated by the newline escape character (`\n`), as shown in this snippet from the example query: `'input.grokCustomPatterns'='INSIDE_QS ([^\"]*)\nINSIDE_BRACKETS ([^\\]]*)')`.

```
CREATE EXTERNAL TABLE `s3_access_auto_raw_02`(
  `bucket_owner` string COMMENT 'from deserializer', 
  `bucket` string COMMENT 'from deserializer', 
  `time` string COMMENT 'from deserializer', 
  `remote_ip` string COMMENT 'from deserializer', 
  `requester` string COMMENT 'from deserializer', 
  `request_id` string COMMENT 'from deserializer', 
  `operation` string COMMENT 'from deserializer', 
  `key` string COMMENT 'from deserializer', 
  `request_uri` string COMMENT 'from deserializer', 
  `http_status` string COMMENT 'from deserializer', 
  `error_code` string COMMENT 'from deserializer', 
  `bytes_sent` string COMMENT 'from deserializer', 
  `object_size` string COMMENT 'from deserializer', 
  `total_time` string COMMENT 'from deserializer', 
  `turnaround_time` string COMMENT 'from deserializer', 
  `referrer` string COMMENT 'from deserializer', 
  `user_agent` string COMMENT 'from deserializer', 
  `version_id` string COMMENT 'from deserializer')
ROW FORMAT SERDE 
  'com.amazonaws.glue.serde.GrokSerDe' 
WITH SERDEPROPERTIES ( 
  'input.format'='%{NOTSPACE:bucket_owner} %{NOTSPACE:bucket} \\[%{INSIDE_BRACKETS:time}\\] %{NOTSPACE:remote_ip} %{NOTSPACE:requester} %{NOTSPACE:request_id} %{NOTSPACE:operation} %{NOTSPACE:key} \"?%{INSIDE_QS:request_uri}\"? %{NOTSPACE:http_status} %{NOTSPACE:error_code} %{NOTSPACE:bytes_sent} %{NOTSPACE:object_size} %{NOTSPACE:total_time} %{NOTSPACE:turnaround_time} \"?%{INSIDE_QS:referrer}\"? \"?%{INSIDE_QS:user_agent}\"? %{NOTSPACE:version_id}', 
  'input.grokCustomPatterns'='INSIDE_QS ([^\"]*)\nINSIDE_BRACKETS ([^\\]]*)') 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
  's3://amzn-s3-demo-bucket'
```

## See also
<a name="grok-serde-see-also"></a>
+ [Understanding Grok Patterns](https://edgedelta.com/company/blog/what-are-grok-patterns) (external website)
+ [Built-in patterns](https://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html#classifier-builtin-patterns) (*AWS Glue User Guide*)

# JSON SerDe libraries
<a name="json-serde"></a>

In Athena, you can use SerDe libraries to deserialize JSON data. Deserialization converts the JSON data so that it can be serialized (written out) into a different format like Parquet or ORC.
+ [Hive JSON SerDe](hive-json-serde.md)
+ [OpenX JSON SerDe](openx-json-serde.md) 
+ [Amazon Ion Hive SerDe](ion-serde.md)

**Note**  
The Hive and OpenX libraries expect JSON data to be on a single line (not formatted), with records separated by a new line character.

Because Amazon Ion is a superset of JSON, you can use the Amazon Ion Hive SerDe to query non-Amazon Ion JSON datasets. Unlike the Hive and OpenX JSON SerDe libraries, the Amazon Ion SerDe does not expect each row of data to be on a single line. This feature is useful if you want to query JSON datasets that are in "pretty print" format or otherwise break up the fields in a row with newline characters.

## Library names
<a name="library-names"></a>

Use one of the following:

 [org.apache.hive.hcatalog.data.JsonSerDe](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-JSON) 

 [org.openx.data.jsonserde.JsonSerDe](https://github.com/rcongiu/Hive-JSON-Serde) 

[com.amazon.ionhiveserde.IonHiveSerDe](https://github.com/amzn/ion-hive-serde)

# Hive JSON SerDe
<a name="hive-json-serde"></a>

The Hive JSON SerDe is commonly used to process JSON data like events. These events are represented as single-line strings of JSON-encoded text separated by a new line. The Hive JSON SerDe does not allow duplicate keys in `map` or `struct` key names.

**Note**  
The SerDe expects each JSON document to be on a single line of text with no line termination characters separating the fields in the record. If the JSON text is in pretty print format, you may receive an error message like HIVE\$1CURSOR\$1ERROR: Row is not a valid JSON Object or HIVE\$1CURSOR\$1ERROR: JsonParseException: Unexpected end-of-input: expected close marker for OBJECT when you attempt to query the table after you create it. For more information, see [JSON Data Files](https://github.com/rcongiu/Hive-JSON-Serde#json-data-files) in the OpenX SerDe documentation on GitHub. 

The following example DDL statement uses the Hive JSON SerDe to create a table based on sample online advertising data. In the `LOCATION` clause, replace the *myregion* in `s3://amzn-s3-demo-bucket.elasticmapreduce/samples/hive-ads/tables/impressions` with the region identifier where you run Athena (for example, `s3://us-west-2.elasticmapreduce/samples/hive-ads/tables/impressions`).

```
CREATE EXTERNAL TABLE impressions (
    requestbegintime string,
    adid string,
    impressionid string,
    referrer string,
    useragent string,
    usercookie string,
    ip string,
    number string,
    processid string,
    browsercookie string,
    requestendtime string,
    timers struct
                <
                 modellookup:string, 
                 requesttime:string
                >,
    threadid string, 
    hostname string,
    sessionid string
)   
PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
LOCATION 's3://amzn-s3-demo-bucket.elasticmapreduce/samples/hive-ads/tables/impressions';
```

## Specify timestamp formats with the Hive JSON SerDe
<a name="hive-json-serde-timestamp-formats"></a>

To parse timestamp values from string, you can add the `WITH SERDEPROPERTIES` subfield to the `ROW FORMAT SERDE` clause and use it to specify the `timestamp.formats` parameter. In the parameter, specify a comma-separated list of one or more timestamp patterns, as in the following example:

```
...
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
WITH SERDEPROPERTIES ("timestamp.formats"="yyyy-MM-dd'T'HH:mm:ss.SSS'Z',yyyy-MM-dd'T'HH:mm:ss")
...
```

For more information, see [Timestamps](https://cwiki.apache.org/confluence/display/hive/languagemanual+types#LanguageManualTypes-TimestampstimestampTimestamps) in the Apache Hive documentation.

## Load the table for querying
<a name="hive-json-serde-loading-the-table"></a>

After you create the table, run [MSCK REPAIR TABLE](msck-repair-table.md) to load the table and make it queryable from Athena:

```
MSCK REPAIR TABLE impressions
```

## Query CloudTrail logs
<a name="hive-json-serde-querying-cloud-trail-logs"></a>

You can use the Hive JSON SerDe to query CloudTrail logs. For more information and example `CREATE TABLE` statements, see [Query AWS CloudTrail logs](cloudtrail-logs.md).

# OpenX JSON SerDe
<a name="openx-json-serde"></a>

Like the Hive JSON SerDe, you can use the OpenX JSON to process JSON data. The data are also represented as single-line strings of JSON-encoded text separated by a new line. Like the Hive JSON SerDe, the OpenX JSON SerDe does not allow duplicate keys in `map` or `struct` key names. 

## Considerations and limitations
<a name="openx-json-serde-considerations-limitations"></a>
+ When using the OpenX JSON SerDe, the number of results and their values can be non-deterministic. The results can contain more rows than expected, fewer rows than expected, or unexpected null values when none are present in the underlying data. To work around this issue, use the [Hive JSON SerDe](hive-json-serde.md), or rewrite the data to another file format type.
+ The SerDe expects each JSON document to be on a single line of text with no line termination characters separating the fields in the record. If the JSON text is in pretty print format, you may receive an error message like HIVE\$1CURSOR\$1ERROR: Row is not a valid JSON Object or HIVE\$1CURSOR\$1ERROR: JsonParseException: Unexpected end-of-input: expected close marker for OBJECT when you attempt to query the table after you create it. 

  For more information, see [JSON Data Files](https://github.com/rcongiu/Hive-JSON-Serde#json-data-files) in the OpenX SerDe documentation on GitHub. 

## Optional properties
<a name="openx-json-serde-optional-properties"></a>

Unlike the Hive JSON SerDe, the OpenX JSON SerDe also has the following optional SerDe properties that can be useful for addressing inconsistencies in data.

**use.null.for.invalid.data**  
Optional. The default is `FALSE`. When set to `TRUE`, the SerDe uses `NULL` for column values that failed deserializing into the column’s type defined from table schema.  
Setting `use.null.for.invalid.data` to `TRUE` can cause incorrect or unexpected results because `NULL` values replace invalid data in columns with schema mismatches. We recommend that you fix the data in your files or table schema rather than enabling this property. When you enable this property, queries will not fail on invalid data, which may prevent you from discovering data quality issues.

**ignore.malformed.json**  
Optional. When set to `TRUE`, lets you skip malformed JSON syntax. The default is `FALSE`.

**dots.in.keys**  
Optional. The default is `FALSE`. When set to `TRUE`, allows the SerDe to replace the dots in key names with underscores. For example, if the JSON dataset contains a key with the name `"a.b"`, you can use this property to define the column name to be `"a_b"` in Athena. By default (without this SerDe), Athena does not allow dots in column names.

**case.insensitive**  
Optional. The default is `TRUE`. When set to `TRUE`, the SerDe converts all uppercase columns to lowercase.   
To use case-sensitive key names in your data, use `WITH SERDEPROPERTIES ("case.insensitive"= FALSE;)`. Then, for every key that is not already all lowercase, provide a mapping from the column name to the property name using the following syntax:  

```
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("case.insensitive" = "FALSE", "mapping.userid" = "userId")
```
If you have two keys like `URL` and `Url` that are the same when they are in lowercase, an error like the following can occur:  
HIVE\$1CURSOR\$1ERROR: Row is not a valid JSON Object - JSONException: Duplicate key "url"  
To resolve this, set the `case.insensitive` property to `FALSE` and map the keys to different names, as in the following example:  

```
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("case.insensitive" = "FALSE", "mapping.url1" = "URL", "mapping.url2" = "Url")
```

**mapping**  
Optional. Maps column names to JSON keys that aren't identical to the column names. The `mapping` parameter is useful when the JSON data contains keys that are [keywords](reserved-words.md). For example, if you have a JSON key named `timestamp`, use the following syntax to map the key to a column named `ts`:  

```
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("mapping.ts" = "timestamp")
```
**Mapping nested field names with colons to Hive-compatible names**  
If you have a field name with colons inside a `struct`, you can use the `mapping` property to map the field to a Hive-compatible name. For example, if your column type definitions contain `my:struct:field:string`, you can map the definition to `my_struct_field:string` by including the following entry in `WITH SERDEPROPERTIES`:

```
("mapping.my_struct_field" = "my:struct:field")
```
The following example shows the corresponding `CREATE TABLE` statement.  

```
CREATE EXTERNAL TABLE colon_nested_field (
item struct<my_struct_field:string>)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES ("mapping.my_struct_field" = "my:struct:field")
```

## Example: advertising data
<a name="openx-json-serde-ad-data-example"></a>

The following example DDL statement uses the OpenX JSON SerDe to create a table based on the same sample online advertising data used in the example for the Hive JSON SerDe. In the `LOCATION` clause, replace *myregion* with the region identifier where you run Athena.

```
CREATE EXTERNAL TABLE impressions (
    requestbegintime string,
    adid string,
    impressionId string,
    referrer string,
    useragent string,
    usercookie string,
    ip string,
    number string,
    processid string,
    browsercokie string,
    requestendtime string,
    timers struct<
       modellookup:string, 
       requesttime:string>,
    threadid string, 
    hostname string,
    sessionid string
)   PARTITIONED BY (dt string)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://amzn-s3-demo-bucket.elasticmapreduce/samples/hive-ads/tables/impressions';
```

## Example: deserializing nested JSON
<a name="nested-json-serde-example"></a>

You can use the JSON SerDes to parse more complex JSON-encoded data. This requires using `CREATE TABLE` statements that use `struct` and `array` elements to represent nested structures. 

The following example creates an Athena table from JSON data that has nested structures. The example has the following structure:

```
{
"DocId": "AWS",
"User": {
        "Id": 1234,
        "Username": "carlos_salazar", 
        "Name": "Carlos",
"ShippingAddress": {
"Address1": "123 Main St.",
"Address2": null,
"City": "Anytown",
"State": "CA"
   },
"Orders": [
   {
     "ItemId": 6789,
     "OrderDate": "11/11/2022" 
   },
   {
     "ItemId": 4352,
     "OrderDate": "12/12/2022"
   }
  ]
 }
}
```

Remember that the OpenX SerDe expects each JSON record to be on a single line of text. When stored in Amazon S3, all of the data in the preceding example should be on a single line, like this:

```
{"DocId":"AWS","User":{"Id":1234,"Username":"carlos_salazar","Name":"Carlos","ShippingAddress" ...
```

The following `CREATE TABLE` statement uses the [Openx-JsonSerDe](https://github.com/rcongiu/Hive-JSON-Serde) with the `struct` and `array` collection data types to establish groups of objects for the example data. 

```
CREATE external TABLE complex_json (
   docid string,
   `user` struct<
               id:INT,
               username:string,
               name:string,
               shippingaddress:struct<
                                      address1:string,
                                      address2:string,
                                      city:string,
                                      state:string
                                      >,
               orders:array<
                            struct<
                                 itemid:INT,
                                  orderdate:string
                                  >
                              >
               >
   )
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://amzn-s3-demo-bucket/myjsondata/';
```

To query the table, use a `SELECT` statement like the following.

```
SELECT 
 user.name as Name, 
 user.shippingaddress.address1 as Address, 
 user.shippingaddress.city as City, 
 o.itemid as Item_ID, o.orderdate as Order_date
FROM complex_json, UNNEST(user.orders) as temp_table (o)
```

To access the data fields inside structs, the sample query uses dot notation (for example, `user.name`). To access data inside an array of structs (as with the `orders` field), you can use the `UNNEST` function. The `UNNEST` function flattens the array into a temporary table (in this case called `o`). This lets you use the dot notation as you do with structs to access the unnested array elements (for example, `o.itemid`). The name `temp_table`, used in the example for illustrative purposes, is often abbreviated as `t`.

The following table shows the query results.


****  

| \$1 | Name | Address | City | Item\$1ID | Order\$1date | 
| --- | --- | --- | --- | --- | --- | 
| 1 | Carlos | 123 Main St. | Anytown | 6789 | 11/11/2022 | 
| 2 | Carlos | 123 Main St. | Anytown | 4352 | 12/12/2022 | 

## Additional resources
<a name="json-serdes-additional-resources"></a>

For more information about working with JSON and nested JSON in Athena, see the following resources:
+ [Create tables in Amazon Athena from nested JSON and mappings using JSONSerDe](https://aws.amazon.com/blogs/big-data/create-tables-in-amazon-athena-from-nested-json-and-mappings-using-jsonserde/) (AWS Big Data Blog)
+ [I get errors when I try to read JSON data in Amazon Athena](https://aws.amazon.com/premiumsupport/knowledge-center/error-json-athena/) (AWS Knowledge Center article)
+ [hive-json-schema](https://github.com/quux00/hive-json-schema) (GitHub) – Tool written in Java that generates `CREATE TABLE` statements from example JSON documents. The `CREATE TABLE` statements that are generated use the OpenX JSON Serde.

# CSV SerDe libraries
<a name="serde-csv-choices"></a>

When you create a table for CSV data in Athena, you can use either the Open CSV SerDe or the Lazy Simple SerDe library. To help you decide which to use, consider the following guidelines.
+ If your data contains values enclosed in double quotes (`"`), you can use the [Open CSV SerDe](https://cwiki.apache.org/confluence/display/Hive/CSV+Serde) library to deserialize the values in Athena. If your data does not contain values enclosed in double quotes (`"`), you can omit specifying any SerDe. In this case, Athena uses the default Lazy Simple SerDe. For information, see [Lazy Simple SerDe for CSV, TSV, and custom-delimited files](lazy-simple-serde.md).
+  If your data has UNIX numeric `TIMESTAMP` values (for example, `1579059880000`), use the Open CSV SerDe. If your data uses the `java.sql.Timestamp` format, use the Lazy Simple SerDe.

**Topics**
+ [Lazy Simple SerDe for CSV, TSV, and custom-delimited files](lazy-simple-serde.md)
+ [Open CSV SerDe for processing CSV](csv-serde.md)

# Lazy Simple SerDe for CSV, TSV, and custom-delimited files
<a name="lazy-simple-serde"></a>

Because this is the default SerDe in Athena for data in CSV, TSV, and custom-delimited formats, specifying it is optional. In your `CREATE TABLE` statement, if you don't specify a SerDe and specify only `ROW FORMAT DELIMITED`, Athena uses this SerDe. Use this SerDe if your data does not have values enclosed in quotes.

For reference documentation about the Lazy Simple SerDe, see the [Hive SerDe](https://cwiki.apache.org/confluence/display/Hive/DeveloperGuide#DeveloperGuide-HiveSerDe) section of the Apache Hive Developer Guide.

## Serialization library name
<a name="lazy-simple-serde-library-name"></a>

The serialization library name for the Lazy Simple SerDe is `org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe`. For source code information, see [LazySimpleSerDe.java](https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazySimpleSerDe.java) on GitHub.com. 

## Ignoring headers
<a name="lazy-simple-serde-ignoring-headers"></a>

To ignore headers in your data when you define a table, you can use the `skip.header.line.count` table property, as in the following example.

```
TBLPROPERTIES ("skip.header.line.count"="1")
```

For examples that ignore headers, see the `CREATE TABLE` statements in [Query Amazon VPC flow logs](vpc-flow-logs.md) and [Query Amazon CloudFront logs](cloudfront-logs.md).

## CSV example
<a name="csv-example"></a>

The following example shows how to use the `LazySimpleSerDe` library to create a table in Athena from CSV data. To deserialize custom-delimited files using this SerDe, follow the pattern in the examples but use the `FIELDS TERMINATED BY` clause to specify a different single-character delimiter. Lazy Simple SerDe does not support multi-character delimiters.

**Note**  
Replace *myregion* in `s3://athena-examples-myregion/path/to/data/` with the region identifier where you run Athena, for example, `s3://athena-examples-us-west-1/path/to/data/`.

Use the `CREATE TABLE` statement to create an Athena table from the underlying data in CSV stored in Amazon S3.

```
CREATE EXTERNAL TABLE flight_delays_csv (
    yr INT,
    quarter INT,
    month INT,
    dayofmonth INT,
    dayofweek INT,
    flightdate STRING,
    uniquecarrier STRING,
    airlineid INT,
    carrier STRING,
    tailnum STRING,
    flightnum STRING,
    originairportid INT,
    originairportseqid INT,
    origincitymarketid INT,
    origin STRING,
    origincityname STRING,
    originstate STRING,
    originstatefips STRING,
    originstatename STRING,
    originwac INT,
    destairportid INT,
    destairportseqid INT,
    destcitymarketid INT,
    dest STRING,
    destcityname STRING,
    deststate STRING,
    deststatefips STRING,
    deststatename STRING,
    destwac INT,
    crsdeptime STRING,
    deptime STRING,
    depdelay INT,
    depdelayminutes INT,
    depdel15 INT,
    departuredelaygroups INT,
    deptimeblk STRING,
    taxiout INT,
    wheelsoff STRING,
    wheelson STRING,
    taxiin INT,
    crsarrtime INT,
    arrtime STRING,
    arrdelay INT,
    arrdelayminutes INT,
    arrdel15 INT,
    arrivaldelaygroups INT,
    arrtimeblk STRING,
    cancelled INT,
    cancellationcode STRING,
    diverted INT,
    crselapsedtime INT,
    actualelapsedtime INT,
    airtime INT,
    flights INT,
    distance INT,
    distancegroup INT,
    carrierdelay INT,
    weatherdelay INT,
    nasdelay INT,
    securitydelay INT,
    lateaircraftdelay INT,
    firstdeptime STRING,
    totaladdgtime INT,
    longestaddgtime INT,
    divairportlandings INT,
    divreacheddest INT,
    divactualelapsedtime INT,
    divarrdelay INT,
    divdistance INT,
    div1airport STRING,
    div1airportid INT,
    div1airportseqid INT,
    div1wheelson STRING,
    div1totalgtime INT,
    div1longestgtime INT,
    div1wheelsoff STRING,
    div1tailnum STRING,
    div2airport STRING,
    div2airportid INT,
    div2airportseqid INT,
    div2wheelson STRING,
    div2totalgtime INT,
    div2longestgtime INT,
    div2wheelsoff STRING,
    div2tailnum STRING,
    div3airport STRING,
    div3airportid INT,
    div3airportseqid INT,
    div3wheelson STRING,
    div3totalgtime INT,
    div3longestgtime INT,
    div3wheelsoff STRING,
    div3tailnum STRING,
    div4airport STRING,
    div4airportid INT,
    div4airportseqid INT,
    div4wheelson STRING,
    div4totalgtime INT,
    div4longestgtime INT,
    div4wheelsoff STRING,
    div4tailnum STRING,
    div5airport STRING,
    div5airportid INT,
    div5airportseqid INT,
    div5wheelson STRING,
    div5totalgtime INT,
    div5longestgtime INT,
    div5wheelsoff STRING,
    div5tailnum STRING
)
    PARTITIONED BY (year STRING)
    ROW FORMAT DELIMITED
      FIELDS TERMINATED BY ','
      ESCAPED BY '\\'
      LINES TERMINATED BY '\n'
    LOCATION 's3://athena-examples-myregion/flight/csv/';
```

Run `MSCK REPAIR TABLE` to refresh partition metadata each time a new partition is added to this table:

```
MSCK REPAIR TABLE flight_delays_csv;
```

Query the top 10 routes delayed by more than 1 hour:

```
SELECT origin, dest, count(*) as delays
FROM flight_delays_csv
WHERE depdelayminutes > 60
GROUP BY origin, dest
ORDER BY 3 DESC
LIMIT 10;
```

**Note**  
The flight table data comes from [Flights](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&amp;DB_Short_Name=On-Time) provided by US Department of Transportation, [Bureau of Transportation Statistics](http://www.transtats.bts.gov/). Desaturated from original.

## TSV example
<a name="tsv-example"></a>

To create an Athena table from TSV data stored in Amazon S3, use `ROW FORMAT DELIMITED` and specify the `\t` as the tab field delimiter, `\n` as the line separator, and `\` as the escape character. The following excerpt shows this syntax. No sample TSV flight data is available in the `athena-examples` location, but as with the CSV table, you would run `MSCK REPAIR TABLE` to refresh partition metadata each time a new partition is added. 

```
...
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
ESCAPED BY '\\'
LINES TERMINATED BY '\n'
...
```

# Open CSV SerDe for processing CSV
<a name="csv-serde"></a>

Use the Open CSV SerDe to create Athena tables from comma-separated data (CSV) data.

## Serialization library name
<a name="csv-serde-library-name"></a>

The serialization library name for the Open CSV SerDe is `org.apache.hadoop.hive.serde2.OpenCSVSerde`. For source code information, see [CSV SerDe](https://cwiki.apache.org/confluence/display/Hive/CSV+Serde) in the Apache documentation.

## Using the Open CSV SerDe
<a name="csv-serde-using"></a>

To use this SerDe, specify its fully qualified class name after `ROW FORMAT SERDE`. Also specify the delimiters inside `SERDEPROPERTIES`, as in the following example.

```
...
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
  "separatorChar" = ",",
  "quoteChar"     = "`",
  "escapeChar"    = "\\"
)
```

### Ignore headers
<a name="csv-serde-opencsvserde-ignoring-headers"></a>

To ignore headers in your data when you define a table, you can use the `skip.header.line.count` table property, as in the following example.

```
TBLPROPERTIES ("skip.header.line.count"="1")
```

For examples, see the `CREATE TABLE` statements in [Query Amazon VPC flow logs](vpc-flow-logs.md) and [Query Amazon CloudFront logs](cloudfront-logs.md).

### Using NULL for invalid data
<a name="csv-serde-opencsvserde-using-null"></a>

To use NULL values for data that fails to deserialize into the column’s defined type, you can use the `use.null.for.invalid.data` table property, as shown in the following example. 

```
TBLPROPERTIES ("skip.header.line.count"="1")
```

**Important**  
Setting `use.null.for.invalid.data` to `TRUE` can cause incorrect or unexpected results because `NULL` values replace invalid data in columns with schema mismatches. We recommend that you fix the data in your files or table schema rather than enabling this property. When you enable this property, queries will not fail on invalid data, which may prevent you from discovering data quality issues.

### Considerations for string data
<a name="csv-serde-opencsvserde-considerations-string"></a>

The Open CSV SerDe has the following characteristics for string data:
+ Uses double quotes (`"`) as the default quote character, and allows you to specify separator, quote, and escape characters, such as: 

  ```
  WITH SERDEPROPERTIES ("separatorChar" = ",", "quoteChar" = "`", "escapeChar" = "\\" )
  ```
+ You cannot escape `\t` or `\n` directly. To escape them, use `"escapeChar" = "\\"`. For an example, see [Example: Escaping \t or \n](#csv-serde-opencsvserde-example-escaping-t-or-n).
+ The Open CSV SerDe does not support embedded line breaks in CSV files.

### Considerations for non-string data
<a name="csv-serde-opencsvserde-considerations-non-string"></a>

For data types other than `STRING`, the Open CSV SerDe behaves as follows:
+ Recognizes `BOOLEAN`, `BIGINT`, `INT`, and `DOUBLE` data types. 
+ Does not recognize empty or null values in columns defined as a numeric data type, leaving them as `string`. One workaround is to create the column with the null values as `string` and then use `CAST` to convert the field in a query to a numeric data type, supplying a default value of `0` for nulls. For more information, see [When I query CSV data in Athena, I get the error HIVE\$1BAD\$1DATA: Error parsing field value](https://aws.amazon.com/premiumsupport/knowledge-center/athena-hive-bad-data-error-csv/) in the AWS Knowledge Center.
+ For columns specified with the `timestamp` data type in your `CREATE TABLE` statement, recognizes `TIMESTAMP` data if it is specified in the UNIX numeric format in milliseconds, such as `1579059880000`. For an example, see [Example: Using the TIMESTAMP type and DATE type specified in the UNIX numeric format](#csv-serde-opencsvserde-example-timestamp-unix).
  + The Open CSV SerDe does not support `TIMESTAMP` in the JDBC-compliant `java.sql.Timestamp` format, such as `"YYYY-MM-DD HH:MM:SS.fffffffff"` (9 decimal place precision).
+ For columns specified with the `DATE` data type in your `CREATE TABLE` statement, recognizes values as dates if the values represent the number of days that elapsed since January 1, 1970. For example, the value `18276` in a column with the `date` data type renders as `2020-01-15` when queried. In this UNIX format, each day is considered to have 86,400 seconds.
  + The Open CSV SerDe does not support `DATE` in any other format directly. To process timestamp data in other formats, you can define the column as `string` and then use time conversion functions to return the desired results in your `SELECT` query. For more information, see the article [When I query a table in Amazon Athena, the TIMESTAMP result is empty](https://aws.amazon.com/premiumsupport/knowledge-center/query-table-athena-timestamp-empty/) in the [AWS knowledge center](https://aws.amazon.com/premiumsupport/knowledge-center/).
+ To further convert columns to the desired type in a table, you can [create a view](views.md) over the table and use `CAST` to convert to the desired type.

## Examples
<a name="csv-serde-opencsvserde-examples"></a>

**Example: Querying simple CSV data**  
The following example assumes you have CSV data saved in the location `s3://amzn-s3-demo-bucket/mycsv/` with the following contents:  

```
"a1","a2","a3","a4"
"1","2","abc","def"
"a","a1","abc3","ab4"
```
Use a `CREATE TABLE` statement to create an Athena table based on the data. Reference `OpenCSVSerde` (note the "d" in lower case) after `ROW FORMAT SERDE` and specify the character separator, quote character, and escape character in `WITH SERDEPROPERTIES`, as in the following example.  

```
CREATE EXTERNAL TABLE myopencsvtable (
   col1 string,
   col2 string,
   col3 string,
   col4 string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
   'separatorChar' = ',',
   'quoteChar' = '"',
   'escapeChar' = '\\'
   )
STORED AS TEXTFILE
LOCATION 's3://amzn-s3-demo-bucket/mycsv/';
```
Query all values in the table:  

```
SELECT * FROM myopencsvtable;
```
The query returns the following values:  

```
col1     col2    col3    col4
-----------------------------
a1       a2      a3      a4
1        2       abc     def
a        a1      abc3    ab4
```

**Example: Using the TIMESTAMP type and DATE type specified in the UNIX numeric format**  
Consider the following three columns of comma-separated data. The values in each column are enclosed in double quotes.  

```
"unixvalue creationdate 18276 creationdatetime 1579059880000","18276","1579059880000"
```
The following statement creates a table in Athena from the specified Amazon S3 bucket location.  

```
CREATE EXTERNAL TABLE IF NOT EXISTS testtimestamp1(
 `profile_id` string,
 `creationdate` date,
 `creationdatetime` timestamp
 )
 ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
 LOCATION 's3://amzn-s3-demo-bucket'
```
Next, run the following query:   

```
SELECT * FROM testtimestamp1
```
The query returns the following result, showing the date and time data:  

```
profile_id                                                        creationdate     creationdatetime
unixvalue creationdate 18276 creationdatetime 1579146280000       2020-01-15       2020-01-15 03:44:40.000
```

**Example: Escaping \$1t or \$1n**  
Consider the following test data:  

```
" \\t\\t\\n 123 \\t\\t\\n ",abc
" 456 ",xyz
```
The following statement creates a table in Athena, specifying that `"escapeChar" = "\\"`.   

```
CREATE EXTERNAL TABLE test1 (
f1 string,
s2 string) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ("separatorChar" = ",", "escapeChar" = "\\") 
LOCATION 's3://amzn-s3-demo-bucket/dataset/test1/'
```
Next, run the following query:   

```
SELECT * FROM test1;
```
It returns this result, correctly escaping `\t` or `\n`:  

```
f1            s2
\t\t\n 123 \t\t\n            abc
456                          xyz
```

# ORC SerDe
<a name="orc-serde"></a>

Use the ORC SerDe to create Athena tables from ORC data.

## Serialization library name
<a name="orc-serde-library-name"></a>

The serialization library for the ORC SerDe is `org.apache.hadoop.hive.ql.io.orc.OrcSerde`, but in your `CREATE TABLE` statements, you specify this with the clause `STORED AS ORC`. For source code information, see [OrcSerde.java](https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/io/orc/OrcSerde.java) on GitHub.com.

## Example: create a table for ORC flight data
<a name="orc-serde-example"></a>

**Note**  
Replace *myregion* in `s3://athena-examples-myregion/path/to/data/` with the region identifier where you run Athena, for example, `s3://athena-examples-us-west-1/path/to/data/`.

The following example creates a table for the flight delays data in ORC. The table includes partitions:

```
DROP TABLE flight_delays_orc;
CREATE EXTERNAL TABLE flight_delays_orc (
    yr INT,
    quarter INT,
    month INT,
    dayofmonth INT,
    dayofweek INT,
    flightdate STRING,
    uniquecarrier STRING,
    airlineid INT,
    carrier STRING,
    tailnum STRING,
    flightnum STRING,
    originairportid INT,
    originairportseqid INT,
    origincitymarketid INT,
    origin STRING,
    origincityname STRING,
    originstate STRING,
    originstatefips STRING,
    originstatename STRING,
    originwac INT,
    destairportid INT,
    destairportseqid INT,
    destcitymarketid INT,
    dest STRING,
    destcityname STRING,
    deststate STRING,
    deststatefips STRING,
    deststatename STRING,
    destwac INT,
    crsdeptime STRING,
    deptime STRING,
    depdelay INT,
    depdelayminutes INT,
    depdel15 INT,
    departuredelaygroups INT,
    deptimeblk STRING,
    taxiout INT,
    wheelsoff STRING,
    wheelson STRING,
    taxiin INT,
    crsarrtime INT,
    arrtime STRING,
    arrdelay INT,
    arrdelayminutes INT,
    arrdel15 INT,
    arrivaldelaygroups INT,
    arrtimeblk STRING,
    cancelled INT,
    cancellationcode STRING,
    diverted INT,
    crselapsedtime INT,
    actualelapsedtime INT,
    airtime INT,
    flights INT,
    distance INT,
    distancegroup INT,
    carrierdelay INT,
    weatherdelay INT,
    nasdelay INT,
    securitydelay INT,
    lateaircraftdelay INT,
    firstdeptime STRING,
    totaladdgtime INT,
    longestaddgtime INT,
    divairportlandings INT,
    divreacheddest INT,
    divactualelapsedtime INT,
    divarrdelay INT,
    divdistance INT,
    div1airport STRING,
    div1airportid INT,
    div1airportseqid INT,
    div1wheelson STRING,
    div1totalgtime INT,
    div1longestgtime INT,
    div1wheelsoff STRING,
    div1tailnum STRING,
    div2airport STRING,
    div2airportid INT,
    div2airportseqid INT,
    div2wheelson STRING,
    div2totalgtime INT,
    div2longestgtime INT,
    div2wheelsoff STRING,
    div2tailnum STRING,
    div3airport STRING,
    div3airportid INT,
    div3airportseqid INT,
    div3wheelson STRING,
    div3totalgtime INT,
    div3longestgtime INT,
    div3wheelsoff STRING,
    div3tailnum STRING,
    div4airport STRING,
    div4airportid INT,
    div4airportseqid INT,
    div4wheelson STRING,
    div4totalgtime INT,
    div4longestgtime INT,
    div4wheelsoff STRING,
    div4tailnum STRING,
    div5airport STRING,
    div5airportid INT,
    div5airportseqid INT,
    div5wheelson STRING,
    div5totalgtime INT,
    div5longestgtime INT,
    div5wheelsoff STRING,
    div5tailnum STRING
)
PARTITIONED BY (year String)
STORED AS ORC
LOCATION 's3://athena-examples-myregion/flight/orc/'
tblproperties ("orc.compress"="ZLIB");
```

Run the `MSCK REPAIR TABLE` statement on the table to refresh partition metadata:

```
MSCK REPAIR TABLE flight_delays_orc;
```

Use this query to obtain the top 10 routes delayed by more than 1 hour:

```
SELECT origin, dest, count(*) as delays
FROM flight_delays_orc
WHERE depdelayminutes > 60
GROUP BY origin, dest
ORDER BY 3 DESC
LIMIT 10;
```

# Parquet SerDe
<a name="parquet-serde"></a>

Use the Parquet SerDe to create Athena tables from Parquet data.

The Parquet SerDe is used for data stored in the [Parquet format](https://cwiki.apache.org/confluence/display/Hive/Parquet). To convert data into Parquet format, you can use [CREATE TABLE AS SELECT (CTAS)](create-table-as.md) queries. For more information, see [Create a table from query results (CTAS)](ctas.md), [Examples of CTAS queries](ctas-examples.md) and [Use CTAS and INSERT INTO for ETL and data analysis](ctas-insert-into-etl.md).

## Serialization library name
<a name="parquet-serde-library-name"></a>

The serialization library name for the Parquet SerDe is `org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe`. For source code information, see [Class ParquetHiveSerDe](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r2.1.1/api/org/apache/hadoop/hive/ql/io/parquet/serde/ParquetHiveSerDe.html) in the Apache documentation.

## Example: Query a file stored in parquet
<a name="example-querying-a-file-stored-in-parquet"></a>

**Note**  
Replace *myregion* in `s3://athena-examples-myregion/path/to/data/` with the region identifier where you run Athena, for example, `s3://athena-examples-us-west-1/path/to/data/`.

Use the following `CREATE TABLE` statement to create an Athena table from the underlying data stored in Parquet format in Amazon S3:

```
CREATE EXTERNAL TABLE flight_delays_pq (
    yr INT,
    quarter INT,
    month INT,
    dayofmonth INT,
    dayofweek INT,
    flightdate STRING,
    uniquecarrier STRING,
    airlineid INT,
    carrier STRING,
    tailnum STRING,
    flightnum STRING,
    originairportid INT,
    originairportseqid INT,
    origincitymarketid INT,
    origin STRING,
    origincityname STRING,
    originstate STRING,
    originstatefips STRING,
    originstatename STRING,
    originwac INT,
    destairportid INT,
    destairportseqid INT,
    destcitymarketid INT,
    dest STRING,
    destcityname STRING,
    deststate STRING,
    deststatefips STRING,
    deststatename STRING,
    destwac INT,
    crsdeptime STRING,
    deptime STRING,
    depdelay INT,
    depdelayminutes INT,
    depdel15 INT,
    departuredelaygroups INT,
    deptimeblk STRING,
    taxiout INT,
    wheelsoff STRING,
    wheelson STRING,
    taxiin INT,
    crsarrtime INT,
    arrtime STRING,
    arrdelay INT,
    arrdelayminutes INT,
    arrdel15 INT,
    arrivaldelaygroups INT,
    arrtimeblk STRING,
    cancelled INT,
    cancellationcode STRING,
    diverted INT,
    crselapsedtime INT,
    actualelapsedtime INT,
    airtime INT,
    flights INT,
    distance INT,
    distancegroup INT,
    carrierdelay INT,
    weatherdelay INT,
    nasdelay INT,
    securitydelay INT,
    lateaircraftdelay INT,
    firstdeptime STRING,
    totaladdgtime INT,
    longestaddgtime INT,
    divairportlandings INT,
    divreacheddest INT,
    divactualelapsedtime INT,
    divarrdelay INT,
    divdistance INT,
    div1airport STRING,
    div1airportid INT,
    div1airportseqid INT,
    div1wheelson STRING,
    div1totalgtime INT,
    div1longestgtime INT,
    div1wheelsoff STRING,
    div1tailnum STRING,
    div2airport STRING,
    div2airportid INT,
    div2airportseqid INT,
    div2wheelson STRING,
    div2totalgtime INT,
    div2longestgtime INT,
    div2wheelsoff STRING,
    div2tailnum STRING,
    div3airport STRING,
    div3airportid INT,
    div3airportseqid INT,
    div3wheelson STRING,
    div3totalgtime INT,
    div3longestgtime INT,
    div3wheelsoff STRING,
    div3tailnum STRING,
    div4airport STRING,
    div4airportid INT,
    div4airportseqid INT,
    div4wheelson STRING,
    div4totalgtime INT,
    div4longestgtime INT,
    div4wheelsoff STRING,
    div4tailnum STRING,
    div5airport STRING,
    div5airportid INT,
    div5airportseqid INT,
    div5wheelson STRING,
    div5totalgtime INT,
    div5longestgtime INT,
    div5wheelsoff STRING,
    div5tailnum STRING
)
PARTITIONED BY (year STRING)
STORED AS PARQUET
LOCATION 's3://athena-examples-myregion/flight/parquet/'
tblproperties ("parquet.compression"="SNAPPY");
```

Run the `MSCK REPAIR TABLE` statement on the table to refresh partition metadata:

```
MSCK REPAIR TABLE flight_delays_pq;
```

Query the top 10 routes delayed by more than 1 hour:

```
SELECT origin, dest, count(*) as delays
FROM flight_delays_pq
WHERE depdelayminutes > 60
GROUP BY origin, dest
ORDER BY 3 DESC
LIMIT 10;
```

**Note**  
The flight table data comes from [Flights](http://www.transtats.bts.gov/DL_SelectFields.asp?Table_ID=236&amp;DB_Short_Name=On-Time) provided by US Department of Transportation, [Bureau of Transportation Statistics](http://www.transtats.bts.gov/). Desaturated from original.

## Ignore Parquet statistics
<a name="parquet-serde-ignoring-parquet-statistics"></a>

When you read Parquet data, you might receive error messages like the following:

```
HIVE_CANNOT_OPEN_SPLIT: Index x out of bounds for length y
HIVE_CURSOR_ERROR: Failed to read x bytes
HIVE_CURSOR_ERROR: FailureException at Malformed input: offset=x
HIVE_CURSOR_ERROR: FailureException at java.io.IOException: 
can not read class org.apache.parquet.format.PageHeader: Socket is closed by peer.
```

To workaround this issue, use the [CREATE TABLE](create-table.md) or [ALTER TABLE SET TBLPROPERTIES](alter-table-set-tblproperties.md) statement to set the Parquet SerDe `parquet.ignore.statistics` property to `true`, as in the following examples. 

CREATE TABLE example

```
...
ROW FORMAT SERDE  
'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' 
WITH SERDEPROPERTIES (  
'parquet.ignore.statistics'='true')  
STORED AS PARQUET 
...
```

ALTER TABLE example

```
ALTER TABLE ... SET TBLPROPERTIES ('parquet.ignore.statistics'='true')
```

# Regex SerDe
<a name="regex-serde"></a>

The Regex SerDe uses a regular expression (regex) to deserialize data by extracting regex groups into table columns. 

If a row in the data does not match the regex, then all columns in the row are returned as `NULL`. If a row matches the regex but has fewer groups than expected, the missing groups are `NULL`. If a row in the data matches the regex but has more columns than groups in the regex, the additional columns are ignored. 

For more information, see [Class RegexSerDe](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r1.2.2/api/org/apache/hadoop/hive/serde2/RegexSerDe.html) in the Apache Hive documentation.

## Serialization library name
<a name="regex-serde-library-name"></a>

The serialization library name for the Regex SerDe is `org.apache.hadoop.hive.serde2.RegexSerDe`. For source code information, see [Class RegexSerDe](https://svn.apache.org/repos/infra/websites/production/hive/content/javadocs/r1.2.2/api/org/apache/hadoop/hive/serde2/RegexSerDe.html) In the Apache documentation. 

## Example
<a name="regex-serde-examples"></a>

The following example creates a table from CloudFront logs using the RegExSerDe. Replace *myregion* in `s3://athena-examples-myregion/cloudfront/plaintext/` with the region identifier where you run Athena (for example, `s3://athena-examples-us-west-1/cloudfront/plaintext/`).

```
CREATE EXTERNAL TABLE IF NOT EXISTS cloudfront_logs (
  `Date` DATE,
  Time STRING,
  Location STRING,
  Bytes INT,
  RequestIP STRING,
  Method STRING,
  Host STRING,
  Uri STRING,
  Status INT,
  Referrer STRING,
  os STRING,
  Browser STRING,
  BrowserVersion STRING
 ) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'
 WITH SERDEPROPERTIES (
 "input.regex" = "^(?!#)([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+([^ ]+)\\s+[^\(]+[\(]([^\;]+).*\%20([^\/]+)[\/](.*)$"
 ) 
LOCATION 's3://athena-examples-myregion/cloudfront/plaintext/';
```