

# Amazon Ion Hive SerDe
<a name="ion-serde"></a>

You can use the Amazon Ion Hive SerDe to query data stored in [Amazon Ion](https://amzn.github.io/ion-docs/guides/cookbook.html) format. Amazon Ion is a richly-typed, self-describing, open source data format. The Amazon Ion format is used in the open source SQL query language [PartiQL](https://partiql.org/).

Amazon Ion has binary and text formats that are interchangeable. This feature combines the ease of use of text with the efficiency of binary encoding.

To query Amazon Ion data from Athena, you can use the [Amazon Ion Hive SerDe](https://github.com/amzn/ion-hive-serde), which serializes and deserializes Amazon Ion data. Deserialization allows you to run queries on the Amazon Ion data or read it for writing out into a different format like Parquet or ORC. Serialization lets you generate data in the Amazon Ion format by using `CREATE TABLE AS SELECT` (CTAS) or `INSERT INTO` queries to copy data from existing tables.

**Note**  
Because Amazon Ion is a superset of JSON, you can use the Amazon Ion Hive SerDe to query non-Amazon Ion JSON datasets. Unlike other [JSON SerDe libraries](https://docs.aws.amazon.com/athena/latest/ug/json-serde.html), the Amazon Ion SerDe does not expect each row of data to be on a single line. This feature is useful if you want to query JSON datasets that are in "pretty print" format or otherwise break up the fields in a row with newline characters.

For additional information and examples of querying Amazon Ion with Athena, see [Analyze Amazon Ion datasets using Amazon Athena](https://aws.amazon.com/blogs/big-data/analyze-amazon-ion-datasets-using-amazon-athena/).

## Serialization library name
<a name="library-name"></a>

The serialization library name for the Amazon Ion SerDe is `com.amazon.ionhiveserde.IonHiveSerDe`. For source code information, see [Amazon Ion Hive SerDe](https://github.com/amazon-ion/ion-hive-serde) on GitHub.com.

## Considerations and limitations
<a name="ion-serde-considerations-and-limitations"></a>
+ **Duplicated fields** – Amazon Ion structs are ordered and support duplicated fields, while Hive's `STRUCT<>` and `MAP<>` do not. Thus, when you deserialize a duplicated field from an Amazon Ion struct, a single value is chosen non deterministically, and the others are ignored.
+ **External symbol tables unsupported** – Currently, Athena does not support external symbol tables or the following Amazon Ion Hive SerDe properties:
  + `ion.catalog.class`
  + `ion.catalog.file`
  + `ion.catalog.url`
  + `ion.symbol_table_imports`
+ **File extensions** – Amazon Ion uses file extensions to determine which compression codec to use for deserializing Amazon Ion files. As such, compressed files must have the file extension that corresponds to the compression algorithm used. For example, if ZSTD is used, corresponding files should have the extension `.zst`.
+ **Homogeneous data** – Amazon Ion has no restrictions on the data types that can be used for values in particular fields. For example, two different Amazon Ion documents might have a field with the same name that have different data types. However, because Hive uses a schema, all values that you extract to a single Hive column must have the same data type.
+ **Map key type restrictions** – When you serialize data from another format into Amazon Ion, ensure that the map key type is one of `STRING`, `VARCHAR`, or `CHAR`. Although Hive allows you to use any primitive data type as a map key, [Amazon Ion symbols](https://amzn.github.io/ion-docs/docs/symbols.html) must be a string type.
+ **Union type** – Athena does not currently support the Hive [union type](https://cwiki.apache.org/confluence/display/hive/languagemanual+types/#LanguageManualTypes-UnionTypesunionUnionTypes).
+ **Double data type** – Amazon Ion does not currently support the `double` data type.

**Topics**
+ [Serialization library name](#library-name)
+ [Considerations and limitations](#ion-serde-considerations-and-limitations)
+ [Create Amazon Ion tables](ion-serde-using-create-table.md)
+ [Use CTAS and INSERT INTO to create Amazon Ion tables](ion-serde-using-ctas-and-insert-into-to-create-ion-tables.md)
+ [Amazon Ion SerDe property reference](ion-serde-using-ion-serde-properties.md)
+ [Use path extractors](ion-serde-using-path-extractors.md)

# Create Amazon Ion tables
<a name="ion-serde-using-create-table"></a>

To create a table in Athena from data stored in Amazon Ion format, you can use one of the following techniques in a CREATE TABLE statement:
+ Specify `STORED AS ION`. In this usage, you do not have to specify the Amazon Ion Hive SerDe explicitly. This choice is the more straightforward option.
+ Specify the Amazon Ion class paths in the `ROW FORMAT SERDE`, `INPUTFORMAT`, and `OUTPUTFORMAT` fields.

You can also use `CREATE TABLE AS SELECT` (CTAS) statements to create Amazon Ion tables in Athena. For information, see [Use CTAS and INSERT INTO to create Amazon Ion tables](ion-serde-using-ctas-and-insert-into-to-create-ion-tables.md).

## Specify STORED AS ION
<a name="ion-serde-specifying-stored-as-ion"></a>

The following example `CREATE TABLE` statement uses `STORED AS ION` before the `LOCATION` clause to create a table based on flight data in Amazon Ion format. The `LOCATION` clause specifies the bucket or folder where the input files in Ion format are located. All files in the specified location are scanned.

```
CREATE EXTERNAL TABLE flights_ion (
    yr INT,
    quarter INT,
    month INT,
    dayofmonth INT,
    dayofweek INT,
    flightdate STRING,
    uniquecarrier STRING,
    airlineid INT,
)
STORED AS ION
LOCATION 's3://amzn-s3-demo-bucket/'
```

## Specify the Amazon Ion class paths
<a name="ion-serde-specifying-the-ion-class-paths"></a>

Instead of using the `STORED AS ION` syntax, you can explicitly specify the Ion class path values for the `ROW FORMAT SERDE`, `INPUTFORMAT`, and `OUTPUTFORMAT` clauses as follows.


****  

| Parameter | Ion class path | 
| --- | --- | 
| ROW FORMAT SERDE | 'com.amazon.ionhiveserde.IonHiveSerDe' | 
| STORED AS INPUTFORMAT | 'com.amazon.ionhiveserde.formats.IonInputFormat' | 
| OUTPUTFORMAT | 'com.amazon.ionhiveserde.formats.IonOutputFormat' | 

The following DDL query uses this technique to create the same external table as in the previous example.

```
CREATE EXTERNAL TABLE flights_ion (
    yr INT,
    quarter INT,
    month INT,
    dayofmonth INT,
    dayofweek INT,
    flightdate STRING,
    uniquecarrier STRING,
    airlineid INT,
)
ROW FORMAT SERDE
 'com.amazon.ionhiveserde.IonHiveSerDe'
STORED AS INPUTFORMAT
 'com.amazon.ionhiveserde.formats.IonInputFormat'
OUTPUTFORMAT
 'com.amazon.ionhiveserde.formats.IonOutputFormat'
LOCATION 's3://amzn-s3-demo-bucket/'
```

For information about the SerDe properties for `CREATE TABLE` statements in Athena, see [Amazon Ion SerDe property reference](ion-serde-using-ion-serde-properties.md).

# Use CTAS and INSERT INTO to create Amazon Ion tables
<a name="ion-serde-using-ctas-and-insert-into-to-create-ion-tables"></a>

You can use the `CREATE TABLE AS SELECT` (CTAS) and `INSERT INTO` statements to copy or insert data from a table into a new table in Amazon Ion format in Athena.

In a CTAS query, specify `format='ION'` in the `WITH` clause, as in the following example.

```
CREATE TABLE new_table
WITH (format='ION')
AS SELECT * from existing_table
```

By default, Athena serializes Amazon Ion results in [Ion binary format](https://amzn.github.io/ion-docs/docs/binary.html), but you can also use text format. To use text format, specify `ion_encoding = 'TEXT'` in the CTAS `WITH` clause, as in the following example.

```
CREATE TABLE new_table
WITH (format='ION', ion_encoding = 'TEXT')
AS SELECT * from existing_table
```

For more information about Amazon Ion specific properties in the CTAS `WITH` clause, see [Amazon Ion properties for the CTAS WITH clause](#ion-serde-ctas-with-clause-properties).

## Amazon Ion properties for the CTAS WITH clause
<a name="ion-serde-ctas-with-clause-properties"></a>

In a CTAS query, you can use the `WITH` clause to specify the Amazon Ion format and optionally specify the Amazon Ion encoding and/or write compression algorithm to use.

**format**  
You can specify the `ION` keyword as the format option in the `WITH` clause of a CTAS query. When you do so, the table that you create uses the format that you specify for `IonInputFormat` for reads, and it serializes data in the format that you specify for `IonOutputFormat`.  
The following example specifies that the CTAS query use Amazon Ion format.  

```
WITH (format='ION')
```

**ion\$1encoding**  
Optional  
Default: `BINARY`  
Values: `BINARY`, `TEXT`  
Specifies whether data is serialized in Amazon Ion binary format or Amazon Ion text format. The following example specifies Amazon Ion text format.  

```
WITH (format='ION', ion_encoding='TEXT')
```

**write\$1compression**  
Optional  
Default: `GZIP`  
Values: `GZIP`, `ZSTD`, `BZIP2`, `SNAPPY`, `NONE`  
Specifies the compression algorithm to use to compress output files.  
The following example specifies that the CTAS query write its output in Amazon Ion format using the [Zstandard](https://facebook.github.io/zstd/) compression algorithm.  

```
WITH (format='ION', write_compression = 'ZSTD')       
```
For information about using compression in Athena, see [Use compression in Athena](compression-formats.md). 

For information about other CTAS properties in Athena, see [CTAS table properties](create-table-as.md#ctas-table-properties).

# Amazon Ion SerDe property reference
<a name="ion-serde-using-ion-serde-properties"></a>

This topic contains information about the SerDe properties for `CREATE TABLE` statements in Athena. For more information and examples of Amazon Ion SerDe property usage, see [SerDe properties](https://github.com/amzn/ion-hive-serde/blob/master/docs/serde-properties.md) in the Amazon Ion Hive SerDe documentation on [GitHub](https://github.com/amzn/ion-hive-serde/tree/master/docs).

## How to specify Amazon Ion SerDe properties
<a name="ion-serde-specifying-ion-serde-properties"></a>

To specify properties for the Amazon Ion Hive SerDe in your `CREATE TABLE` statement, use the `WITH SERDEPROPERTIES` clause. Because `WITH SERDEPROPERTIES` is a subfield of the `ROW FORMAT SERDE` clause, you must specify `ROW FORMAT SERDE` and the Amazon Ion Hive SerDe class path first, as the following syntax shows.

```
...
ROW FORMAT SERDE
 'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
 'property' = 'value',
 'property' = 'value',
...
)
```

Note that although the `ROW FORMAT SERDE` clause is required if you want to use `WITH SERDEPROPERTIES`, you can use either `STORED AS ION` or the longer `INPUTFORMAT` and `OUTPUTFORMAT` syntax to specify the Amazon Ion format.

## Amazon Ion SerDe properties
<a name="ion-serde-ion-serde-properties"></a>

Following are the Amazon Ion SerDe properties that can be used in `CREATE TABLE` statements in Athena.

**ion.encoding**  
Optional  
Default: `BINARY`  
Values: `BINARY`, `TEXT`  
This property declares whether new values added are serialized as [Amazon Ion binary](https://amzn.github.io/ion-docs/docs/binary.html) or Amazon Ion text format.  
The following SerDe property example specifies Amazon Ion text format.  

```
'ion.encoding' = 'TEXT'
```

**ion.fail\$1on\$1overflow**  
Optional  
Default: `true`  
Values: `true`, `false`  
Amazon Ion allows for arbitrarily large numerical types while Hive does not. By default, the SerDe fails if the Amazon Ion value does not fit the Hive column, but you can use the `fail_on_overflow` configuration option to let the value overflow instead of failing.  
This property can be set at either the table or column level. To specify it at the table level, specify `ion.fail_on_overflow` as in the following example. This sets the default behavior for all columns.  

```
'ion.fail_on_overflow' = 'true'
```
To control a specific column, specify the column name between `ion` and `fail_on_overflow`, delimited by periods, as in the following example.  

```
'ion.<column>.fail_on_overflow' = 'false'
```

**ion.path\$1extractor.case\$1sensitive**  
Optional  
Default: `false`  
Values: `true`, `false`  
Determines whether to treat Amazon Ion field names as case sensitive. When `false`, the SerDe ignores case parsing Amazon Ion field names.  
For example, suppose you have a Hive table schema that defines a field `alias` in lower case and an Amazon Ion document with both an `alias` field and an `ALIAS` field, as in the following example.  

```
-- Hive Table Schema
alias: STRING

-- Amazon Ion Document
{ 'ALIAS': 'value1'} 
{ 'alias': 'value2'}
```
The following example shows SerDe properties and the resulting extracted table when case sensitivity is set to `false`:  

```
-- Serde properties
'ion.alias.path_extractor' = '(alias)'
'ion.path_extractor.case_sensitive' = 'false'

--Extracted Table
| alias    |
|----------|
| "value1" |
| "value2" |
```
The following example shows SerDe properties and the resulting extracted table when case sensitivity is set to `true`:  

```
-- Serde properties
'ion.alias.path_extractor' = '(alias)'
'ion.path_extractor.case_sensitive' = 'true'

--Extracted Table
| alias    |
|----------|
| "value2" |
```
In the second case, `value1` for the `ALIAS` field is ignored when case sensitivity is set to `true` and the path extractor is specified as `alias`.

**ion.*<column>*.path\$1extractor**  
Optional  
Default: NA  
Values: String with search path  
Creates a path extractor with the specified search path for the given column. Path extractors map Amazon Ion fields to Hive columns. If no path extractors are specified, Athena dynamically creates path extractors at run time based on column names.  
The following example path extractor maps the `example_ion_field` to the `example_hive_column`.  

```
'ion.example_hive_column.path_extractor' = '(example_ion_field)'
```
For more information about path extractors and search paths, see [Use path extractors](ion-serde-using-path-extractors.md).

**ion.timestamp.serialization\$1offset**  
Optional  
Default: `'Z'`  
Values: `OFFSET`, where `OFFSET `is represented as `<signal>hh:mm`. Example values: `01:00`, `+01:00`, `-09:30`, `Z` (UTC, same as 00:00)  
Unlike Apache Hive [timestamps](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types#LanguageManualTypes-timestamp), which have no built-in time zone and are stored as an offset from the UNIX epoch, Amazon Ion timestamps do have an offset. Use this property to specify the offset when you serialize to Amazon Ion.  
The following example adds an offset of one hour.  

```
'ion.timestamp.serialization_offset' = '+01:00'       
```

**ion.serialize\$1null**  
Optional  
Default: `OMIT`  
Values: `OMIT`, `UNTYPED`, `TYPED`  
The Amazon Ion SerDe can be configured to either serialize or omit columns that have null values. You can choose to write out strongly typed nulls (`TYPED`) or untyped nulls (`UNTYPED`). Strongly typed nulls are determined based on the default Amazon Ion to Hive type mapping.  
The following example specifies strongly typed nulls.  

```
'ion.serialize_null'='TYPED'
```

**ion.ignore\$1malformed**  
Optional  
Default: `false`  
Values: `true`, `false`  
When `true`, ignores malformed entries or the whole file if the SerDe is unable to read it. For more information, see [Ignore malformed](https://github.com/amzn/ion-hive-serde/blob/master/docs/serde-properties.md#ignore-malformed) in the documentation on GitHub.

**ion.*<column>*.serialize\$1as**  
Optional  
Default: Default type for the column.  
Values: String containing Amazon Ion type  
Determines the Amazon Ion data type in which a value is serialized. Because Amazon Ion and Hive types do not always have a direct mapping, a few Hive types have multiple valid data types for serialization. To serialize data as a non-default data type, use this property. For more information about type mapping, see the Amazon Ion [Type mapping](https://github.com/amzn/ion-hive-serde/blob/master/docs/type-mapping.md) page on GitHub.  
By default, binary Hive columns are serialized as Amazon Ion blobs, but they can also be serialized as an [Amazon Ion clob](https://amzn.github.io/ion-docs/docs/stringclob.html#ion-clob) (character large object). The following example serializes the column `example_hive_binary_column` as a clob.  

```
'ion.example_hive_binary_column.serialize_as' = 'clob'       
```

# Use path extractors
<a name="ion-serde-using-path-extractors"></a>

Amazon Ion is a document style file format, but Apache Hive is a flat columnar format. You can use special Amazon Ion SerDe properties called `path extractors` to map between the two formats. Path extractors flatten the hierarchical Amazon Ion format, map Amazon Ion values to Hive columns, and can be used to rename fields.

Athena can generate the extractors for you, but you can also define your own extractors if necessary.

**Topics**
+ [Use Athena generated path extractors](ion-serde-generated-path-extractors.md)
+ [Specify your own path extractors](ion-serde-specifying-your-own-path-extractors.md)
+ [Use search paths in path extractors](ion-serde-using-search-paths-in-path-extractors.md)
+ [Path extractor examples](ion-serde-examples.md)

# Use Athena generated path extractors
<a name="ion-serde-generated-path-extractors"></a>

By default, Athena searches for top level Amazon Ion values that match Hive column names and creates path extractors at runtime based on these matching values. If your Amazon Ion data format matches the Hive table schema, Athena dynamically generates the extractors for you, and you do not need to add any additional path extractors. These default path extractors are not stored in the table metadata.

The following example shows how Athena generates extractors based on column name.

```
-- Example Amazon Ion Document
{
    identification: {
        name: "John Smith",
        driver_license: "XXXX"
    },
    
    alias: "Johnny"    
}

-- Example DDL
CREATE EXTERNAL TABLE example_schema2 (
    identification MAP<STRING, STRING>,
    alias STRING
)
STORED AS ION
LOCATION 's3://amzn-s3-demo-bucket/path_extraction1/'
```

The following example extractors are generated by Athena. The first extracts the `identification` field to the `identification` column, and the second extracts the `alias` field to the `alias` column.

```
'ion.identification.path_extractor' = '(identification)'
'ion.alias.path_extractor' = '(alias)'
```

The following example shows the extracted table.

```
|                  identification                    |  alias   |
|----------------------------------------------------|----------|
|{["name", "driver_license"],["John Smith", "XXXX"]} | "Johnny" |
```

# Specify your own path extractors
<a name="ion-serde-specifying-your-own-path-extractors"></a>

If your Amazon Ion fields do not map neatly to Hive columns, you can specify your own path extractors. In the `WITH SERDEPROPERTIES` clause of your `CREATE TABLE` statement, use the following syntax.

```
WITH SERDEPROPERTIES (
   "ion.path_extractor.case_sensitive" = "<Boolean>", 
   "ion.<column_name>.path_extractor" = "<path_extractor_expression>"
)
```

**Note**  
By default, path extractors are case insensitive. To override this setting, set the [ion.path_extractor.case_sensitive](ion-serde-using-ion-serde-properties.md#ioncase) SerDe property to `true`.

# Use search paths in path extractors
<a name="ion-serde-using-search-paths-in-path-extractors"></a>

The SerDe property syntax for path extractor contains a *<path\$1extractor\$1expression>*:

```
"ion.<column_name>.path_extractor" = "<path_extractor_expression>"         
```

You can use the *<path\$1extractor\$1expression>* to specify a search path that parses the Amazon Ion document and finds matching data. The search path is enclosed in parenthesis and can contain one or more of the following components separated by spaces.
+ **Wild card** – Matches all values.
+ **Index** – Matches the value at the specified numerical index. Indices are zero-based.
+ **Text** – Matches all values whose field names match are equivalent to the specified text.
+ **Annotations** – Matches values specified by a wrapped path component that has the annotations specified.

The following example shows an Amazon Ion document and some example search paths.

```
-- Amazon Ion document
{
    foo: ["foo1", "foo2"] ,
    bar: "myBarValue", 
    bar: A::"annotatedValue"
}

-- Example search paths
(foo 0)       # matches "foo1"
(1)           # matches "myBarValue"
(*)           # matches ["foo1", "foo2"], "myBarValue" and A::"annotatedValue"
()            # matches {foo: ["foo1", "foo2"] , bar: "myBarValue", bar: A::"annotatedValue"}
(bar)         # matches "myBarValue" and A::"annotatedValue"
(A::bar)      # matches A::"annotatedValue"
```

# Path extractor examples
<a name="ion-serde-examples"></a>

The following path extractor examples show how to flatten and rename fields or extract data as Amazon Ion text.

## Flatten and rename fields
<a name="ion-serde-flattening-and-renaming-fields"></a>

The following example shows a set of search paths that flatten and rename fields. The example uses search paths to do the following:
+ Map the `nickname` column to the `alias` field
+ Map the `name` column to the `name` subfield located in the `identification` struct.

Following is the example Amazon Ion document.

```
-- Example Amazon Ion Document
{
    identification: {
        name: "John Smith",
        driver_license: "XXXX"
    },
    
    alias: "Johnny"    
}
```

The following is the example `CREATE TABLE` statement that defines the path extractors.

```
-- Example DDL Query
CREATE EXTERNAL TABLE example_schema2 (
    name STRING,
    nickname STRING
)
ROW FORMAT SERDE
 'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
 'ion.nickname.path_extractor' = '(alias)',
 'ion.name.path_extractor' = '(identification name)'
 )
STORED AS ION
LOCATION 's3://amzn-s3-demo-bucket/path_extraction2/'
```

The following example shows the extracted data.

```
-- Extracted Table
| name         |   nickname   |
|--------------|--------------|
| "John Smith" |  "Johnny"    |
```

For more information about search paths and additional search path examples, see the [Ion Java Path Extraction](https://github.com/amzn/ion-java-path-extraction) page on GitHub.

## Extract flight data to text format
<a name="ion-serde-extracting-flight-data-to-text-format"></a>

The following example `CREATE TABLE` query uses `WITH SERDEPROPERTIES` to add path extractors to extract flight data and specify the output encoding as Amazon Ion text. The example uses the `STORED AS ION` syntax.

```
CREATE EXTERNAL TABLE flights_ion (
    yr INT,
    quarter INT,
    month INT,
    dayofmonth INT,
    dayofweek INT,
    flightdate STRING,
    uniquecarrier STRING,
    airlineid INT,
)
ROW FORMAT SERDE
 'com.amazon.ionhiveserde.IonHiveSerDe'
WITH SERDEPROPERTIES (
 'ion.encoding' = 'TEXT',
 'ion.yr.path_extractor'='(year)',
 'ion.quarter.path_extractor'='(results quarter)',
 'ion.month.path_extractor'='(date month)')
STORED AS ION
LOCATION 's3://amzn-s3-demo-bucket/'
```