

# Use Amazon Athena Federated Query
Use federated queries

If you have data in sources other than Amazon S3, you can use Athena Federated Query to query the data in place or build pipelines that extract data from multiple data sources and store them in Amazon S3. With Athena Federated Query, you can run SQL queries across data stored in relational, non-relational, object, and custom data sources.

Athena uses *data source connectors* that run on AWS Lambda to run federated queries. A data source connector is a piece of code that can translate between your target data source and Athena. You can think of a connector as an extension of Athena's query engine. Prebuilt Athena data source connectors exist for data sources like Amazon CloudWatch Logs, Amazon DynamoDB, Amazon DocumentDB, and Amazon RDS, and JDBC-compliant relational data sources such MySQL, and PostgreSQL under the Apache 2.0 license. You can also use the Athena Query Federation SDK to write custom connectors. To choose, configure, and deploy a data source connector to your account, you can use the Athena and Lambda consoles or the AWS Serverless Application Repository. After you deploy data source connectors, the connector is associated with a catalog that you can specify in SQL queries. You can combine SQL statements from multiple catalogs and span multiple data sources with a single query.

When a query is submitted against a data source, Athena invokes the corresponding connector to identify parts of the tables that need to be read, manages parallelism, and pushes down filter predicates. Based on the user submitting the query, connectors can provide or restrict access to specific data elements. Connectors use Apache Arrow as the format for returning data requested in a query, which enables connectors to be implemented in languages such as C, C\$1\$1, Java, Python, and Rust. Since connectors are processed in Lambda, they can be used to access data from any data source on the cloud or on-premises that is accessible from Lambda.

To write your own data source connector, you can use the Athena Query Federation SDK to customize one of the prebuilt connectors that Amazon Athena provides and maintains. You can modify a copy of the source code from the [GitHub repository](https://github.com/awslabs/aws-athena-query-federation/wiki/Available-Connectors) and then use the [Connector publish tool](https://github.com/awslabs/aws-athena-query-federation/wiki/Connector_Publish_Tool) to create your own AWS Serverless Application Repository package. 

**Note**  
Third party developers may have used the Athena Query Federation SDK to write data source connectors. For support or licensing issues with these data source connectors, please work with your connector provider. These connectors are not tested or supported by AWS. 

For a list of data source connectors written and tested by Athena, see [Available data source connectors](connectors-available.md).

For information about writing your own data source connector, see [Example Athena connector](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-example) on GitHub.

## Considerations and limitations

+ **Engine versions** – Athena Federated Query is supported only on Athena engine version 2 and later. For information about Athena engine versions, see [Athena engine versioning](engine-versions.md).
+ **Views** – You can create and query views on federated data sources. Federated views are stored in AWS Glue, not the underlying data source. For more information, see [Query federated views](running-federated-queries.md#running-federated-queries-federated-views).
+ **Delimited identifiers** – Delimited identifiers (also known as quoted identifiers) begin and end with double quotation marks ("). Currently, delimited identifiers are not supported for federated queries in Athena.
+ **Write operations** – Write operations like [INSERT INTO](insert-into.md) are not supported. Attempting to do so may result in the error message This operation is currently not supported for external catalogs.
+  **Pricing** – For pricing information, see [Amazon Athena pricing](https://aws.amazon.com/athena/pricing/).
+ **JDBC driver** – To use the JDBC driver with federated queries or an [external Hive metastore](connect-to-data-source-hive.md), include `MetadataRetrievalMethod=ProxyAPI` in your JDBC connection string. For information about the JDBC driver, see [Connect to Amazon Athena with JDBC](connect-with-jdbc.md). 
+ **Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, you must configure an Amazon VPC private endpoint for Secrets Manager. For more information, see [Create a Secrets Manager VPC private endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) in the *AWS Secrets Manager User Guide*.

## Permissions required


Data source connectors might require access to the following resources to function correctly. If you use a prebuilt connector, check the information for the connector to ensure that you have configured your VPC correctly. Also, ensure that IAM principals running queries and creating connectors have privileges to required actions. For more information, see [Allow access to Athena Federated Query: Example policies](federated-query-iam-access.md).
+ **Amazon S3** – In addition to writing query results to the Athena query results location in Amazon S3, data connectors also write to a spill bucket in Amazon S3. Connectivity and permissions to this Amazon S3 location are required. We recommend using spill to disk encryption for each connector and [S3 lifecycle configuration](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) to expire spilled data that is no longer needed.
+ **Athena** – Data sources need connectivity to Athena and vice versa for checking query status and preventing overscan.
+ **AWS Glue Data Catalog** – Connectivity and permissions are required if your connector uses Data Catalog for supplemental or primary metadata.
+ **Amazon ECR** – Data source connector Lambda functions use an Amazon ECR image from an Amazon ECR repository. The user that deploys the connector must have the permissions `ecr:BatchGetImage` and `ecr:GetDownloadUrlForLayer`. For more information, see [Amazon ECR permissions](https://docs.aws.amazon.com/lambda/latest/dg/images-create.html#gettingstarted-images-permissions) in the *AWS Lambda Developer Guide*.

## Videos


Watch the following videos to learn more about using Athena Federated Query.

**Video: Analyze Results of Federated Query in Amazon Athena in Quick**  
The following video demonstrates how to analyze results of an Athena federated query in Quick.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/HyM5d0TmwAQ/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/HyM5d0TmwAQ)


**Video: Game Analytics Pipeline**  
The following video shows how to deploy a scalable serverless data pipeline to ingest, store, and analyze telemetry data from games and services using Amazon Athena federated queries.

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/xcS-flUMVbs/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/xcS-flUMVbs)


# Available data source connectors
Available data source connectors

This section lists prebuilt Athena data source connectors that you can use to query a variety of data sources external to Amazon S3. To use a connector in your Athena queries, configure it and deploy it to your account. 

## Considerations and limitations

+ Some prebuilt connectors require that you create a VPC and a security group before you can use the connector. For information about creating VPCs, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md). 
+ To use the Athena Federated Query feature with AWS Secrets Manager, you must configure an Amazon VPC private endpoint for Secrets Manager. For more information, see [Create a Secrets Manager VPC private endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) in the *AWS Secrets Manager User Guide*. 
+ For connectors that do not support predicate pushdown, queries that include a predicate take longer to execute. For small datasets, very little data is scanned, and queries take an average of about 2 minutes. However, for large datasets, many queries can time out.
+ Some federated data sources use terminology to refer data objects that is different from Athena. For more information, see [Understand federated table name qualifiers](tables-qualifiers.md).
+ We update our connectors periodically based on upgrades from the database or data source provider. We do not support data sources that are at end-of-life for support.
+ For connectors that do not support pagination when you list tables, the web service can time out if your database has many tables and metadata. The following connectors provide pagination support for listing tables:
  + DocumentDB
  + DynamoDB
  + MySQL
  + OpenSearch
  + Oracle
  + PostgreSQL
  + Redshift
  + SQL Server

## Case resolver modes in Federation SDK


The Federation SDK supports the following standardized case resolver modes for schema and table names:
+ `NONE` – Does not change case of the given schema and table names.
+ `LOWER` – Lower case all given schema and table names.
+ `UPPER` – Upper case all given schema and table names.
+ `ANNOTATION` – This mode is maintained for backward compatibility only and is supported exclusively by existing Snowflake and SAP HANA connectors.
+ `CASE_INSENSITIVE_SEARCH` – Perform case insensitive searches against schema and tables names.

## Connector support for case resolver modes


### Basic mode support


All JDBC connectors support the following basic modes:
+ `NONE`
+ `LOWER`
+ `UPPER`

### Annotation mode support


Only the following connectors support the `ANNOTATION` mode:
+ Snowflake
+ SAP HANA

**Note**  
It is recommended to use CASE\$1INSENSITIVE\$1SEARCH instead of ANNOTATION.

### Case-insensitive search support


The following connectors support `CASE_INSENSITIVE_SEARCH`:
+ DataLake Gen2
+ Snowflake
+ Oracle
+ Synapse
+ MySQL
+ PostgreSQL
+ Redshift
+ ClickHouse
+ SQL Server
+ DB2

## Case resolver limitations


Be aware of the following limitations when using case resolver modes:
+ When using `LOWER` mode, your schema name and all tables within the schema must be in lowercase.
+ When using `UPPER` mode, your schema name and all tables within the schema must be in uppercase.
+ When using `CASE_INSENSITIVE_SEARCH`:
  + Schema names must be unique
  + Table names within a schema must be unique (for example, you cannot have both "Apple" and "APPLE")
+ Glue integration limitations:
  + Glue only supports lowercase names
  + Only `NONE` or `LOWER` modes will work when registering your Lambda function with GlueDataCatalog/LakeFormation

## Additional information

+ For information about deploying an Athena data source connector, see [Use Amazon Athena Federated Query](federated-queries.md). 
+ For information about queries that use Athena data source connectors, see [Run federated queries](running-federated-queries.md).

**Topics**
+ [

## Considerations and limitations
](#connectors-available-considerations)
+ [

## Case resolver modes in Federation SDK
](#case-resolver-modes)
+ [

## Connector support for case resolver modes
](#connector-support-matrix)
+ [

## Case resolver limitations
](#case-resolver-limitations)
+ [

## Additional information
](#connectors-available-additional-resources)
+ [Azure Data Lake Storage](connectors-adls-gen2.md)
+ [Azure Synapse](connectors-azure-synapse.md)
+ [Cloudera Hive](connectors-cloudera-hive.md)
+ [Cloudera Impala](connectors-cloudera-impala.md)
+ [CloudWatch](connectors-cloudwatch.md)
+ [CloudWatch metrics](connectors-cwmetrics.md)
+ [CMDB](connectors-cmdb.md)
+ [Db2](connectors-ibm-db2.md)
+ [Db2 iSeries](connectors-ibm-db2-as400.md)
+ [DocumentDB](connectors-docdb.md)
+ [DynamoDB](connectors-dynamodb.md)
+ [Google BigQuery](connectors-bigquery.md)
+ [Google Cloud Storage](connectors-gcs.md)
+ [HBase](connectors-hbase.md)
+ [Hortonworks](connectors-hortonworks.md)
+ [Kafka](connectors-kafka.md)
+ [MSK](connectors-msk.md)
+ [MySQL](connectors-mysql.md)
+ [Neptune](connectors-neptune.md)
+ [OpenSearch](connectors-opensearch.md)
+ [Oracle](connectors-oracle.md)
+ [PostgreSQL](connectors-postgresql.md)
+ [Redis OSS](connectors-redis.md)
+ [Redshift](connectors-redshift.md)
+ [SAP HANA](connectors-sap-hana.md)
+ [Snowflake](connectors-snowflake.md)
+ [SQL Server](connectors-microsoft-sql-server.md)
+ [Teradata](connectors-teradata.md)
+ [Timestream](connectors-timestream.md)
+ [TPC-DS](connectors-tpcds.md)
+ [Vertica](connectors-vertica.md)

**Note**  
The [AthenaJdbcConnector](https://serverlessrepo.aws.amazon.com/applications/us-east-1/292517598671/AthenaJdbcConnector) (latest version 2022.4.1) has been deprecated. Instead, use a database-specific connector like those for [MySQL](connectors-mysql.md), [Redshift](connectors-redshift.md), or [PostgreSQL](connectors-postgresql.md).

# Amazon Athena Azure Data Lake Storage (ADLS) Gen2 connector
Azure Data Lake Storage

The Amazon Athena connector for [Azure Data Lake Storage (ADLS) Gen2](https://docs.microsoft.com/en-us/azure/databricks/data/data-sources/azure/adls-gen2/) enables Amazon Athena to run SQL queries on data stored on ADLS. Athena cannot access stored files in the data lake directly. 

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.
+ **Workflow** – The connector implements the JDBC interface, which uses the `com.microsoft.sqlserver.jdbc.SQLServerDriver` driver. The connector passes queries to the Azure Synapse engine, which then accesses the data lake. 
+ **Data handling and S3** – Normally, the Lambda connector queries data directly without transfer to Amazon S3. However, when data returned by the Lambda function exceeds Lambda limits, the data is written to the Amazon S3 spill bucket that you specify so that Athena can read the excess.
+ **AAD authentication** – AAD can be used as an authentication method for the Azure Synapse connector. In order to use AAD, the JDBC connection string that the connector uses must contain the URL parameters `authentication=ActiveDirectoryServicePrincipal`, `AADSecurePrincipalId`, and `AADSecurePrincipalSecret`. These parameters can either be passed in directly or by Secrets Manager.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Date and timestamp data types in filter conditions must be cast to appropriate data types.

## Terms


The following terms relate to the Azure Data Lake Storage Gen2 connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the Azure Data Lake Storage Gen2 connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)


We recommend that you configure a Azure Data Lake Storage Gen2 connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Azure Data Lake Storage Gen2 connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type DATALAKEGEN2
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Azure Data Lake Storage Gen2 connector created using Glue connections does not support the use of a multiplexing handler.
The Azure Data Lake Storage Gen2 connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
datalakegentwo://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | DataLakeGen2MuxCompositeHandler | 
| Metadata handler | DataLakeGen2MuxMetadataHandler | 
| Record handler | DataLakeGen2MuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mydatalakegentwocatalog, then the environment variable name is mydatalakegentwocatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a DataLakeGen2 MUX Lambda function that supports two database instances: `datalakegentwo1` (the default), and `datalakegentwo2`.


****  

| Property | Value | 
| --- | --- | 
| default | datalakegentwo://jdbc:sqlserver://adlsgentwo1.hostname:port;databaseName=database\$1name;\$1\$1secret1\$1name\$1 | 
| datalakegentwo\$1catalog1\$1connection\$1string | datalakegentwo://jdbc:sqlserver://adlsgentwo1.hostname:port;databaseName=database\$1name;\$1\$1secret1\$1name\$1 | 
| datalakegentwo\$1catalog2\$1connection\$1string | datalakegentwo://jdbc:sqlserver://adlsgentwo2.hostname:port;databaseName=database\$1name;\$1\$1secret2\$1name\$1 | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${secret1_name}`.

```
datalakegentwo://jdbc:sqlserver://hostname:port;databaseName=database_name;${secret1_name}
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
datalakegentwo://jdbc:sqlserver://hostname:port;databaseName=database_name;user=user_name;password=password
```

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single Azure Data Lake Storage Gen2 instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | DataLakeGen2CompositeHandler | 
| Metadata handler | DataLakeGen2MetadataHandler | 
| Record handler | DataLakeGen2RecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Azure Data Lake Storage Gen2 instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | datalakegentwo://jdbc:sqlserver://hostname:port;databaseName=;\$1\$1secret\$1name\$1 | 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for ADLS Gen2 and Arrow.


****  

| ADLS Gen2 | Arrow | 
| --- | --- | 
| bit | TINYINT | 
| tinyint | SMALLINT | 
| smallint | SMALLINT | 
| int | INT | 
| bigint | BIGINT | 
| decimal | DECIMAL | 
| numeric | FLOAT8 | 
| smallmoney | FLOAT8 | 
| money | DECIMAL | 
| float[24] | FLOAT4 | 
| float[53] | FLOAT8 | 
| real | FLOAT4 | 
| datetime | Date(MILLISECOND) | 
| datetime2 | Date(MILLISECOND) | 
| smalldatetime | Date(MILLISECOND) | 
| date | Date(DAY) | 
| time | VARCHAR | 
| datetimeoffset | Date(MILLISECOND) | 
| char[n] | VARCHAR | 
| varchar[n/max] | VARCHAR | 

## Partitions and splits


Azure Data Lake Storage Gen2 uses Hadoop compatible Gen2 blob storage for storing data files. The data from these files is queried from the Azure Synapse engine. The Azure Synapse engine treats Gen2 data stored in file systems as external tables. The partitions are implemented based on the type of data. If the data has already been partitioned and distributed within the Gen 2 storage system, the connector retrieves the data as single split.

## Performance


The Azure Data Lake Storage Gen2 connector shows slower query performance when running multiple queries at once, and is subject to throttling.

The Athena Azure Data Lake Storage Gen2 connector performs predicate pushdown to decrease the data scanned by the query. Simple predicates and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Azure Data Lake Storage Gen2 connector can combine these expressions and push them directly to Azure Data Lake Storage Gen2 for enhanced functionality and to reduce the amount of data scanned.

The following Athena Azure Data Lake Storage Gen2 connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%');
```

## Passthrough queries


The Azure Data Lake Storage Gen2 connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Azure Data Lake Storage Gen2, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Azure Data Lake Storage Gen2. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-datalakegen2/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-datalakegen2/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-datalakegen2/pom.xml) file for the Azure Data Lake Storage Gen2 connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-datalakegen2) on GitHub.com.

# Amazon Athena Azure Synapse connector
Azure Synapse

The Amazon Athena connector for [Azure Synapse analytics](https://docs.microsoft.com/en-us/azure/synapse-analytics/overview-what-is) enables Amazon Athena to run SQL queries on your Azure Synapse databases using JDBC.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ In filter conditions, you must cast the `Date` and `Timestamp` data types to the appropriate data type.
+ To search for negative values of type `Real` and `Float`, use the `<=` or `>=` operator.
+ The `binary`, `varbinary`, `image`, and `rowversion` data types are not supported.

## Terms


The following terms relate to the Synapse connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the Synapse connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)


We recommend that you configure a Synapse connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Synapse connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type SYNAPSE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Synapse connector created using Glue connections does not support the use of a multiplexing handler.
The Synapse connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections (recommended)


#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
synapse://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | SynapseMuxCompositeHandler | 
| Metadata handler | SynapseMuxMetadataHandler | 
| Record handler | SynapseMuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mysynapsecatalog, then the environment variable name is mysynapsecatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Synapse MUX Lambda function that supports two database instances: `synapse1` (the default), and `synapse2`.


****  

| Property | Value | 
| --- | --- | 
| default | synapse://jdbc:synapse://synapse1.hostname:port;databaseName=<database\$1name>;\$1\$1secret1\$1name\$1 | 
| synapse\$1catalog1\$1connection\$1string | synapse://jdbc:synapse://synapse1.hostname:port;databaseName=<database\$1name>;\$1\$1secret1\$1name\$1 | 
| synapse\$1catalog2\$1connection\$1string | synapse://jdbc:synapse://synapse2.hostname:port;databaseName=<database\$1name>;\$1\$1secret2\$1name\$1 | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name \$1\$1secret\$1name\$1.

```
synapse://jdbc:synapse://hostname:port;databaseName=<database_name>;${secret_name}
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
synapse://jdbc:synapse://hostname:port;databaseName=<database_name>;user=<user>;password=<password>
```

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single Synapse instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | SynapseCompositeHandler | 
| Metadata handler | SynapseMetadataHandler | 
| Record handler | SynapseRecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Synapse instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | synapse://jdbc:sqlserver://hostname:port;databaseName=<database\$1name>;\$1\$1secret\$1name\$1 | 

#### Configuring Active Directory authentication


The Amazon Athena Azure Synapse connector supports Microsoft Active Directory Authentication. Before you begin, you must configure an administrative user in the Microsoft Azure portal and then use AWS Secrets Manager to create a secret.

**To set the Active Directory administrative user**

1. Using an account that has administrative privileges, sign in to the Microsoft Azure portal at [https://portal.azure.com/](https://portal.azure.com/).

1. In the search box, enter **Azure Synapse Analytics**, and then choose **Azure Synapse Analytics**.  
![\[Choose Azure Synapse Analytics.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-azure-synapse-1.png)

1. Open the menu on the left.  
![\[Choose the Azure portal menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-azure-synapse-2.png)

1. In the navigation pane, choose **Azure Active Directory**.

1. On the **Set admin** tab, set **Active Directory admin** to a new or existing user.  
![\[Use the Set admin tab\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-azure-synapse-3.png)

1. In AWS Secrets Manager, store the admin username and password credentials. For information on creating a secret in Secrets Manager, see [Create an AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html).

**To view your secret in Secrets Manager**

1. Open the Secrets Manager console at [https://console.aws.amazon.com/secretsmanager/](https://console.aws.amazon.com/secretsmanager/).

1. In the navigation pane, choose **Secrets**.

1. On the **Secrets** page, choose the link to your secret.

1. On the details page for your secret, choose **Retrieve secret value**.  
![\[Viewing secrets in AWS Secrets Manager.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-azure-synapse-4.png)

##### Modifying the connection string


To enable Active Directory Authentication for the connector, modify the connection string using the following syntax:

```
synapse://jdbc:synapse://hostname:port;databaseName=database_name;authentication=ActiveDirectoryPassword;{secret_name}
```

##### Using ActiveDirectoryServicePrincipal


The Amazon Athena Azure Synapse connector also supports `ActiveDirectoryServicePrincipal`. To enable this, modify the connection string as follows.

```
synapse://jdbc:synapse://hostname:port;databaseName=database_name;authentication=ActiveDirectoryServicePrincipal;{secret_name}
```

For `secret_name`, specify the application or client ID as the username and the secret of a service principal identity in the password.

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for Synapse and Apache Arrow.


****  

| Synapse | Arrow | 
| --- | --- | 
| bit | TINYINT | 
| tinyint | SMALLINT | 
| smallint | SMALLINT | 
| int | INT | 
| bigint | BIGINT | 
| decimal | DECIMAL | 
| numeric | FLOAT8 | 
| smallmoney | FLOAT8 | 
| money | DECIMAL | 
| float[24] | FLOAT4 | 
| float[53] | FLOAT8 | 
| real | FLOAT4 | 
| datetime | Date(MILLISECOND) | 
| datetime2 | Date(MILLISECOND) | 
| smalldatetime | Date(MILLISECOND) | 
| date | Date(DAY) | 
| time | VARCHAR | 
| datetimeoffset | Date(MILLISECOND) | 
| char[n] | VARCHAR | 
| varchar[n/max] | VARCHAR | 
| nchar[n] | VARCHAR | 
| nvarchar[n/max] | VARCHAR | 

## Partitions and splits


A partition is represented by a single partition column of type `varchar`. Synapse supports range partitioning, so partitioning is implemented by extracting the partition column and partition range from Synapse metadata tables. These range values are used to create the splits.

## Performance


Selecting a subset of columns significantly slows down query runtime. The connector shows significant throttling due to concurrency.

The Athena Synapse connector performs predicate pushdown to decrease the data scanned by the query. Simple predicates and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Synapse connector can combine these expressions and push them directly to Synapse for enhanced functionality and to reduce the amount of data scanned.

The following Athena Synapse connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%');
```

## Passthrough queries


The Synapse connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Synapse, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Synapse. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-synapse/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-synapse/LICENSE.txt) file on GitHub.com.

## Additional resources

+ For an article that shows how to use Quick and Amazon Athena Federated Query to build dashboards and visualizations on data stored in Microsoft Azure Synapse databases, see [Perform multi-cloud analytics using Quick, Amazon Athena Federated Query, and Microsoft Azure Synapse](https://aws.amazon.com/blogs/business-intelligence/perform-multi-cloud-analytics-using-amazon-quicksight-amazon-athena-federated-query-and-microsoft-azure-synapse/) in the *AWS Big Data Blog*.
+ For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-synapse/pom.xml) file for the Synapse connector on GitHub.com.
+ For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-synapse) on GitHub.com.

# Amazon Athena Cloudera Hive connector
Cloudera Hive

The Amazon Athena connector for Cloudera Hive enables Athena to run SQL queries on the [Cloudera Hive](https://www.cloudera.com/products/open-source/apache-hadoop/apache-hive.html) Hadoop distribution. The connector transforms your Athena SQL queries to their equivalent HiveQL syntax. 

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.

## Terms


The following terms relate to the Cloudera Hive connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the Cloudera Hive connector.

### Glue connections (recommended)


We recommend that you configure a Cloudera Hive connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Cloudera Hive connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type CLOUDERAHIVE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Cloudera Hive connector created using Glue connections does not support the use of a multiplexing handler.
The Cloudera Hive connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
hive://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | HiveMuxCompositeHandler | 
| Metadata handler | HiveMuxMetadataHandler | 
| Record handler | HiveMuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myhivecatalog, then the environment variable name is myhivecatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Hive MUX Lambda function that supports two database instances: `hive1` (the default), and `hive2`.


****  

| Property | Value | 
| --- | --- | 
| default | hive://jdbc:hive2://hive1:10000/default;\$1\$1Test/RDS/hive1\$1 | 
| hive2\$1catalog1\$1connection\$1string | hive://jdbc:hive2://hive1:10000/default;\$1\$1Test/RDS/hive1\$1 | 
| hive2\$1catalog2\$1connection\$1string | hive://jdbc:hive2://hive2:10000/default;UID=sample&PWD=sample | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, the Cloudera Hive connector requires a secret from AWS Secrets Manager. To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

Put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/hive1}`.

```
hive://jdbc:hive2://hive1:10000/default;...&${Test/RDS/hive1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
hive://jdbc:hive2://hive1:10000/default;...&UID=sample2&PWD=sample2&...
```

Currently, the Cloudera Hive connector recognizes the `UID` and `PWD` JDBC properties.

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single Cloudera Hive instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | HiveCompositeHandler | 
| Metadata handler | HiveMetadataHandler | 
| Record handler | HiveRecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Cloudera Hive instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | hive://jdbc:hive2://hive1:10000/default;secret=\$1\$1Test/RDS/hive1\$1 | 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for JDBC, Cloudera Hive, and Arrow.


****  

| JDBC | Cloudera Hive | Arrow | 
| --- | --- | --- | 
| Boolean | Boolean | Bit | 
| Integer | TINYINT | Tiny | 
| Short | SMALLINT | Smallint | 
| Integer | INT | Int | 
| Long | BIGINT | Bigint | 
| float | float4 | Float4 | 
| Double | float8 | Float8 | 
| Date | date | DateDay | 
| Timestamp | timestamp | DateMilli | 
| String | VARCHAR | Varchar | 
| Bytes | bytes | Varbinary | 
| BigDecimal | Decimal | Decimal | 
| ARRAY | N/A (see note) | List | 

**Note**  
Currently, Cloudera Hive does not support the aggregate types `ARRAY`, `MAP`, `STRUCT`, or `UNIONTYPE`. Columns of aggregate types are treated as `VARCHAR` columns in SQL.

## Partitions and splits


Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance


Cloudera Hive supports static partitions. The Athena Cloudera Hive connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, static partitioning is highly recommended. The Cloudera Hive connector is resilient to throttling due to concurrency.

The Athena Cloudera Hive connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### LIMIT clauses


A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Cloudera Hive connector can combine these expressions and push them directly to Cloudera Hive for enhanced functionality and to reduce the amount of data scanned.

The following Athena Cloudera Hive connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries


The Cloudera Hive connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Cloudera Hive, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Cloudera Hive. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-hive/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-hive/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-hive/pom.xml) file for the Cloudera Hive connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudera-hive) on GitHub.com.

# Amazon Athena Cloudera Impala connector
Cloudera Impala

The Amazon Athena Cloudera Impala connector enables Athena to run SQL queries on the [Cloudera Impala](https://docs.cloudera.com/cdw-runtime/cloud/impala-overview/topics/impala-overview.html) distribution. The connector transforms your Athena SQL queries to the equivalent Impala syntax.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.

## Terms


The following terms relate to the Cloudera Impala connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the Cloudera Impala connector.

### Glue connections (recommended)


We recommend that you configure a Cloudera Impala connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Cloudera Impala connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type CLOUDERAIMPALA
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Cloudera Impala connector created using Glue connections does not support the use of a multiplexing handler.
The Cloudera Impala connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


#### Connection string


Use a JDBC connection string in the following format to connect to an Impala cluster.

```
impala://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | ImpalaMuxCompositeHandler | 
| Metadata handler | ImpalaMuxMetadataHandler | 
| Record handler | ImpalaMuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. An Impala cluster connection string for an Athena catalog. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myimpalacatalog, then the environment variable name is myimpalacatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Impala MUX Lambda function that supports two database instances: `impala1` (the default), and `impala2`.


****  

| Property | Value | 
| --- | --- | 
| default | impala://jdbc:impala://some.impala.host.name:21050/?\$1\$1Test/impala1\$1 | 
| impala\$1catalog1\$1connection\$1string | impala://jdbc:impala://someother.impala.host.name:21050/?\$1\$1Test/impala1\$1 | 
| impala\$1catalog2\$1connection\$1string | impala://jdbc:impala://another.impala.host.name:21050/?UID=sample&PWD=sample | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/impala1host}`.

```
impala://jdbc:impala://Impala1host:21050/?...&${Test/impala1host}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
impala://jdbc:impala://Impala1host:21050/?...&UID=sample2&PWD=sample2&...
```

Currently, Cloudera Impala recognizes the `UID` and `PWD` JDBC properties.

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single Cloudera Impala instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | ImpalaCompositeHandler | 
| Metadata handler | ImpalaMetadataHandler | 
| Record handler | ImpalaRecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Cloudera Impala instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | impala://jdbc:impala://Impala1host:21050/?secret=\$1\$1Test/impala1host\$1 | 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for JDBC, Cloudera Impala, and Arrow.


****  

| JDBC | Cloudera Impala | Arrow | 
| --- | --- | --- | 
| Boolean | Boolean | Bit | 
| Integer | TINYINT | Tiny | 
| Short | SMALLINT | Smallint | 
| Integer | INT | Int | 
| Long | BIGINT | Bigint | 
| float | float4 | Float4 | 
| Double | float8 | Float8 | 
| Date | date | DateDay | 
| Timestamp | timestamp | DateMilli | 
| String | VARCHAR | Varchar | 
| Bytes | bytes | Varbinary | 
| BigDecimal | Decimal | Decimal | 
| ARRAY | N/A (see note) | List | 

**Note**  
Currently, Cloudera Impala does not support the aggregate types `ARRAY`, `MAP`, `STRUCT`, or `UNIONTYPE`. Columns of aggregate types are treated as `VARCHAR` columns in SQL.

## Partitions and splits


Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance


Cloudera Impala supports static partitions. The Athena Cloudera Impala connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, static partitioning is highly recommended. The Cloudera Impala connector is resilient to throttling due to concurrency.

The Athena Cloudera Impala connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### LIMIT clauses


A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Cloudera Impala connector can combine these expressions and push them directly to Cloudera Impala for enhanced functionality and to reduce the amount of data scanned.

The following Athena Cloudera Impala connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries


The Cloudera Impala connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Cloudera Impala, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Cloudera Impala. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-impala/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-impala/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudera-impala/pom.xml) file for the Cloudera Impala connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudera-impala) on GitHub.com.

# Amazon Athena CloudWatch connector
CloudWatch

The Amazon Athena CloudWatch connector enables Amazon Athena to communicate with CloudWatch so that you can query your log data with SQL.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

The connector maps your LogGroups as schemas and each LogStream as a table. The connector also maps a special `all_log_streams` view that contains all LogStreams in the LogGroup. This view enables you to query all the logs in a LogGroup at once instead of searching through each LogStream individually.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters


Use the parameters in this section to configure the CloudWatch connector.

### Glue connections (recommended)


We recommend that you configure a CloudWatch connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the CloudWatch connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type CLOUDWATCH
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The CloudWatch connector created using Glue connections does not support the use of a multiplexing handler.
The CloudWatch connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections

+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).

The connector also supports [AIMD congestion control](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) for handling throttling events from CloudWatch through the [Amazon Athena Query Federation SDK](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk) `ThrottlingInvoker` construct. You can tweak the default throttling behavior by setting any of the following optional environment variables:
+ **throttle\$1initial\$1delay\$1ms** – The initial call delay applied after the first congestion event. The default is 10 milliseconds.
+ **throttle\$1max\$1delay\$1ms** – The maximum delay between calls. You can derive TPS by dividing it into 1000ms. The default is 1000 milliseconds.
+ **throttle\$1decrease\$1factor** – The factor by which Athena reduces the call rate. The default is 0.5
+ **throttle\$1increase\$1ms** – The rate at which Athena decreases the call delay. The default is 10 milliseconds.

## Databases and tables


The Athena CloudWatch connector maps your LogGroups as schemas (that is, databases) and each LogStream as a table. The connector also maps a special `all_log_streams` view that contains all LogStreams in the LogGroup. This view enables you to query all the logs in a LogGroup at once instead of searching through each LogStream individually.

Every table mapped by the Athena CloudWatch connector has the following schema. This schema matches the fields provided by CloudWatch Logs.
+ **log\$1stream** – A `VARCHAR` that contains the name of the LogStream that the row is from.
+ **time** – An `INT64` that contains the epoch time of when the log line was generated.
+ **message** – A `VARCHAR` that contains the log message.

**Examples**  
The following example shows how to perform a `SELECT` query on a specified LogStream.

```
SELECT * 
FROM "lambda:cloudwatch_connector_lambda_name"."log_group_path"."log_stream_name" 
LIMIT 100
```

The following example shows how to use the `all_log_streams` view to perform a query on all LogStreams in a specified LogGroup. 

```
SELECT * 
FROM "lambda:cloudwatch_connector_lambda_name"."log_group_path"."all_log_streams" 
LIMIT 100
```

## Required Permissions


For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-cloudwatch.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudwatch/athena-cloudwatch.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **CloudWatch Logs Read/Write** – The connector uses this permission to read your log data and to write its diagnostic logs.

## Performance


The Athena CloudWatch connector attempts to optimize queries against CloudWatch by parallelizing scans of the log streams required for your query. For certain time period filters, predicate pushdown is performed both within the Lambda function and within CloudWatch Logs.

For best performance, use only lowercase for your log group names and log stream names. Using mixed casing causes the connector to perform a case insensitive search that is more computationally intensive.

**Note**  
 The CloudWatch connector does not support uppercase database names. 

## Passthrough queries


The CloudWatch connector supports [passthrough queries](federated-query-passthrough.md) that use [CloudWatch Logs Insights query syntax](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/CWL_QuerySyntax.html). For more information about CloudWatch Logs Insights, see [Analyzing log data with CloudWatch Logs Insights](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/AnalyzingLogData.html) in the *Amazon CloudWatch Logs User Guide*.

To create passthrough queries with CloudWatch, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            STARTTIME => 'start_time',
            ENDTIME => 'end_time',
            QUERYSTRING => 'query_string',
            LOGGROUPNAMES => 'log_group-names',
            LIMIT => 'max_number_of_results'
        ))
```

The following example CloudWatch passthrough query filters for the `duration` field when it does not equal 1000.

```
SELECT * FROM TABLE(
        system.query(
            STARTTIME => '1710918615308',
            ENDTIME => '1710918615972',
            QUERYSTRING => 'fields @duration | filter @duration != 1000',
            LOGGROUPNAMES => '/aws/lambda/cloudwatch-test-1',
            LIMIT => '2'
            ))
```

## License information


The Amazon Athena CloudWatch connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudwatch) on GitHub.com.

# Amazon Athena CloudWatch Metrics connector
CloudWatch metrics

The Amazon Athena CloudWatch Metrics connector enables Amazon Athena to query CloudWatch Metrics data with SQL.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

For information on publishing query metrics to CloudWatch from Athena itself, see [Use CloudWatch and EventBridge to monitor queries and control costs](workgroups-control-limits.md).

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters


Use the parameters in this section to configure the CloudWatch Metrics connector.

### Glue connections (recommended)


We recommend that you configure a CloudWatch Metrics connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the CloudWatch Metrics connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type CLOUDWATCHMETRICS
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The CloudWatch Metrics connector created using Glue connections does not support the use of a multiplexing handler.
The CloudWatch Metrics connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections

+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).

The connector also supports [AIMD congestion control](https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease) for handling throttling events from CloudWatch through the [Amazon Athena Query Federation SDK](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk) `ThrottlingInvoker` construct. You can tweak the default throttling behavior by setting any of the following optional environment variables:
+ **throttle\$1initial\$1delay\$1ms** – The initial call delay applied after the first congestion event. The default is 10 milliseconds.
+ **throttle\$1max\$1delay\$1ms** – The maximum delay between calls. You can derive TPS by dividing it into 1000ms. The default is 1000 milliseconds.
+ **throttle\$1decrease\$1factor** – The factor by which Athena reduces the call rate. The default is 0.5
+ **throttle\$1increase\$1ms** – The rate at which Athena decreases the call delay. The default is 10 milliseconds.

## Databases and tables


The Athena CloudWatch Metrics connector maps your namespaces, dimensions, metrics, and metric values into two tables in a single schema called `default`.

### The metrics table


The `metrics` table contains the available metrics as uniquely defined by a combination of namespace, set, and name. The `metrics` table contains the following columns.
+ **namespace** – A `VARCHAR` containing the namespace.
+ **metric\$1name** – A `VARCHAR` containing the metric name.
+ **dimensions** – A `LIST` of `STRUCT` objects composed of `dim_name (VARCHAR)` and `dim_value (VARCHAR)`.
+ **statistic** – A `LIST` of `VARCH` statistics (for example, `p90`, `AVERAGE`, ...) available for the metric.

### The metric\$1samples table


The `metric_samples` table contains the available metric samples for each metric in the `metrics` table. The `metric_samples` table contains the following columns.
+ **namespace** – A `VARCHAR` that contains the namespace.
+ **metric\$1name** – A `VARCHAR` that contains the metric name.
+ **dimensions** – A `LIST` of `STRUCT` objects composed of `dim_name (VARCHAR)` and `dim_value (VARCHAR)`.
+ **dim\$1name** – A `VARCHAR` convenience field that you can use to easily filter on a single dimension name.
+ **dim\$1value** – A `VARCHAR` convenience field that you can use to easily filter on a single dimension value.
+ **period** – An `INT` field that represents the "period" of the metric in seconds (for example, a 60 second metric).
+ **timestamp** – A `BIGINT` field that represents the epoch time in seconds that the metric sample is for.
+ **value** – A `FLOAT8` field that contains the value of the sample.
+ **statistic** – A `VARCHAR` that contains the statistic type of the sample (for example, `AVERAGE` or `p90`).

## Required Permissions


For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-cloudwatch-metrics.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-cloudwatch-metrics/athena-cloudwatch-metrics.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **CloudWatch Metrics ReadOnly** – The connector uses this permission to query your metrics data.
+ **CloudWatch Logs Write** – The connector uses this access to write its diagnostic logs.

## Performance


The Athena CloudWatch Metrics connector attempts to optimize queries against CloudWatch Metrics by parallelizing scans of the log streams required for your query. For certain time period, metric, namespace, and dimension filters, predicate pushdown is performed both within the Lambda function and within CloudWatch Logs.

## License information


The Amazon Athena CloudWatch Metrics connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-cloudwatch-metrics) on GitHub.com.

# Amazon Athena AWS CMDB connector
CMDB

The Amazon Athena AWS CMDB connector enables Athena to communicate with various AWS services so that you can query them with SQL.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters


Use the parameters in this section to configure the AWS CMDB connector.

### Glue connections (recommended)


We recommend that you configure a AWS CMDB connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the AWS CMDB connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type CMDB
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The AWS CMDB connector created using Glue connections does not support the use of a multiplexing handler.
The AWS CMDB connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **default\$1ec2\$1image\$1owner** – (Optional) When set, controls the default Amazon EC2 image owner that filters [Amazon Machine Images (AMI)](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AMIs.html). If you do not set this value and your query against the EC2 images table does not include a filter for owner, your results will include all public images.

## Databases and tables


The Athena AWS CMDB connector makes the following databases and tables available for querying your AWS resource inventory. For more information on the columns available in each table, run a `DESCRIBE database.table` statement using the Athena console or API.
+ **ec2** – This database contains Amazon EC2 related resources, including the following.
+ **ebs\$1volumes** – Contains details of your Amazon EBS volumes.
+ **ec2\$1instances** – Contains details of your EC2 Instances.
+ **ec2\$1images** – Contains details of your EC2 Instance images.
+ **routing\$1tables** – Contains details of your VPC Routing Tables.
+ **security\$1groups** – Contains details of your security groups.
+ **subnets** – Contains details of your VPC Subnets.
+ **vpcs** – Contains details of your VPCs.
+ **emr** – This database contains Amazon EMR related resources, including the following.
+ **emr\$1clusters** – Contains details of your EMR Clusters.
+ **rds** – This database contains Amazon RDS related resources, including the following.
+ **rds\$1instances** – Contains details of your RDS Instances.
+ **s3** – This database contains RDS related resources, including the following.
+ **buckets** – Contains details of your Amazon S3 buckets.
+ **objects** – Contains details of your Amazon S3 objects, excluding their contents.

## Required Permissions


For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-aws-cmdb.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-aws-cmdb/athena-aws-cmdb.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **S3 List** – The connector uses this permission to list your Amazon S3 buckets and objects.
+ **EC2 Describe** – The connector uses this permission to describe resources such as your Amazon EC2 instances, security groups, VPCs, and Amazon EBS volumes.
+ **EMR Describe / List** – The connector uses this permission to describe your EMR clusters.
+ **RDS Describe** – The connector uses this permission to describe your RDS Instances.

## Performance


Currently, the Athena AWS CMDB connector does not support parallel scans. Predicate pushdown is performed within the Lambda function. Where possible, partial predicates are pushed to the services being queried. For example, a query for the details of a specific Amazon EC2 instance calls the EC2 API with the specific instance ID to run a targeted describe operation.

## License information


The Amazon Athena AWS CMDB connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-aws-cmdb) on GitHub.com.

# Amazon Athena IBM Db2 connector
Db2

The Amazon Athena connector for Db2 enables Amazon Athena to run SQL queries on your IBM Db2 databases using JDBC.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Date and timestamp data types in filter conditions must be cast to appropriate data types.

## Terms


The following terms relate to the Db2 connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the Db2 connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)


We recommend that you configure a Db2 connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Db2 connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type DB2
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Db2 connector created using Glue connections does not support the use of a multiplexing handler.
The Db2 connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
dbtwo://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | Db2MuxCompositeHandler | 
| Metadata handler | Db2MuxMetadataHandler | 
| Record handler | Db2MuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mydbtwocatalog, then the environment variable name is mydbtwocatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Db2 MUX Lambda function that supports two database instances: `dbtwo1` (the default), and `dbtwo2`.


****  

| Property | Value | 
| --- | --- | 
| default | dbtwo://jdbc:db2://dbtwo1.hostname:port/database\$1name:\$1\$1secret1\$1name\$1 | 
| dbtwo\$1catalog1\$1connection\$1string | dbtwo://jdbc:db2://dbtwo1.hostname:port/database\$1name:\$1\$1secret1\$1name\$1 | 
| dbtwo\$1catalog2\$1connection\$1string | dbtwo://jdbc:db2://dbtwo2.hostname:port/database\$1name:\$1\$1secret2\$1name\$1 | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${secret_name}`.

```
dbtwo://jdbc:db2://hostname:port/database_name:${secret_name}
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
dbtwo://jdbc:db2://hostname:port/database_name:user=user_name;password=password;
```

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single Db2 instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | Db2CompositeHandler | 
| Metadata handler | Db2MetadataHandler | 
| Record handler | Db2RecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Db2 instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | dbtwo://jdbc:db2://hostname:port/database\$1name:\$1\$1secret\$1name\$1  | 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for JDBC and Arrow.


****  

| Db2 | Arrow | 
| --- | --- | 
| CHAR | VARCHAR | 
| VARCHAR | VARCHAR | 
| DATE | DATEDAY | 
| TIME | VARCHAR | 
| TIMESTAMP | DATEMILLI | 
| DATETIME | DATEMILLI | 
| BOOLEAN | BOOL | 
| SMALLINT | SMALLINT | 
| INTEGER | INT | 
| BIGINT | BIGINT | 
| DECIMAL | DECIMAL | 
| REAL | FLOAT8 | 
| DOUBLE | FLOAT8 | 
| DECFLOAT | FLOAT8 | 

## Partitions and splits


A partition is represented by one or more partition columns of type `varchar`. The Db2 connector creates partitions using the following organization schemes.
+ Distribute by hash
+ Partition by range
+ Organize by dimensions

The connector retrieves partition details such as the number of partitions and column name from one or more Db2 metadata tables. Splits are created based upon the number of partitions identified. 

## Performance


The Athena Db2 connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### LIMIT clauses


A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Db2 connector can combine these expressions and push them directly to Db2 for enhanced functionality and to reduce the amount of data scanned.

The following Athena Db2 connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries


The Db2 connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Db2, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Db2. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2/pom.xml) file for the Db2 connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-db2) on GitHub.com.

# Amazon Athena IBM Db2 AS/400 (Db2 iSeries) connector
Db2 iSeries

The Amazon Athena connector for Db2 AS/400 enables Amazon Athena to run SQL queries on your IBM Db2 AS/400 (Db2 iSeries) databases using JDBC.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Date and timestamp data types in filter conditions must be cast to appropriate data types.

## Terms


The following terms relate to the Db2 AS/400 connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the Db2 AS/400 connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)


We recommend that you configure a Db2 AS/400 connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Db2 AS/400 connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type DB2AS400
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Db2 AS/400 connector created using Glue connections does not support the use of a multiplexing handler.
The Db2 AS/400 connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
db2as400://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | Db2MuxCompositeHandler | 
| Metadata handler | Db2MuxMetadataHandler | 
| Record handler | Db2MuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mydb2as400catalog, then the environment variable name is mydb2as400catalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Db2 MUX Lambda function that supports two database instances: `db2as4001` (the default), and `db2as4002`.


****  

| Property | Value | 
| --- | --- | 
| default | db2as400://jdbc:as400://<ip\$1address>;<properties>;:\$1\$1<secret name>\$1; | 
| db2as400\$1catalog1\$1connection\$1string | db2as400://jdbc:as400://db2as4001.hostname/:\$1\$1secret1\$1name\$1 | 
| db2as400\$1catalog2\$1connection\$1string | db2as400://jdbc:as400://db2as4002.hostname/:\$1\$1secret2\$1name\$1 | 
| db2as400\$1catalog3\$1connection\$1string | db2as400://jdbc:as400://<ip\$1address>;user=<username>;password=<password>;<properties>; | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${secret_name}`.

```
db2as400://jdbc:as400://<ip_address>;<properties>;:${<secret_name>};
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
db2as400://jdbc:as400://<ip_address>;user=<username>;password=<password>;<properties>;
```

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single Db2 AS/400 instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | Db2CompositeHandler | 
| Metadata handler | Db2MetadataHandler | 
| Record handler | Db2RecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Db2 AS/400 instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | db2as400://jdbc:as400://<ip\$1address>;<properties>;:\$1\$1<secret\$1name>\$1; | 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for JDBC and Apache Arrow.


****  

| Db2 AS/400 | Arrow | 
| --- | --- | 
| CHAR | VARCHAR | 
| VARCHAR | VARCHAR | 
| DATE | DATEDAY | 
| TIME | VARCHAR | 
| TIMESTAMP | DATEMILLI | 
| DATETIME | DATEMILLI | 
| BOOLEAN | BOOL | 
| SMALLINT | SMALLINT | 
| INTEGER | INT | 
| BIGINT | BIGINT | 
| DECIMAL | DECIMAL | 
| REAL | FLOAT8 | 
| DOUBLE | FLOAT8 | 
| DECFLOAT | FLOAT8 | 

## Partitions and splits


A partition is represented by one or more partition columns of type `varchar`. The Db2 AS/400 connector creates partitions using the following organization schemes.
+ Distribute by hash
+ Partition by range
+ Organize by dimensions

The connector retrieves partition details such as the number of partitions and column name from one or more Db2 AS/400 metadata tables. Splits are created based upon the number of partitions identified. 

## Performance


For improved performance, use predicate pushdown to query from Athena, as in the following examples.

```
SELECT * FROM "lambda:<LAMBDA_NAME>"."<SCHEMA_NAME>"."<TABLE_NAME>" 
 WHERE integercol = 2147483647
```

```
SELECT * FROM "lambda: <LAMBDA_NAME>"."<SCHEMA_NAME>"."<TABLE_NAME>" 
 WHERE timestampcol >= TIMESTAMP '2018-03-25 07:30:58.878'
```

## Passthrough queries


The Db2 AS/400 connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Db2 AS/400, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Db2 AS/400. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2-as400/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2-as400/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-db2-as400/pom.xml) file for the Db2 AS/400 connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-db2-as400) on GitHub.com.

# Amazon Athena DocumentDB connector
DocumentDB

The Amazon Athena DocumentDB connector enables Athena to communicate with your DocumentDB instances so that you can query your DocumentDB data with SQL. The connector also works with any endpoint that is compatible with MongoDB.

Unlike traditional relational data stores, Amazon DocumentDB collections do not have set schema. DocumentDB does not have a metadata store. Each entry in a DocumentDB collection can have different fields and data types.

The DocumentDB connector supports two mechanisms for generating table schema information: basic schema inference and AWS Glue Data Catalog metadata.

Schema inference is the default. This option scans a small number of documents in your collection, forms a union of all fields, and coerces fields that have non-overlapping data types. This option works well for collections that have mostly uniform entries.

For collections with a greater variety of data types, the connector supports retrieving metadata from the AWS Glue Data Catalog. If the connector sees a AWS Glue database and table that match your DocumentDB database and collection names, it gets its schema information from the corresponding AWS Glue table. When you create your AWS Glue table, we recommend that you make it a superset of all fields that you might want to access from your DocumentDB collection. 

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters


Use the parameters in this section to configure the DocumentDB connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)


We recommend that you configure a DocumentDB connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the DocumentDB connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type DOCUMENTDB
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The DocumentDB connector created using Glue connections does not support the use of a multiplexing handler.
The DocumentDB connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections

+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **disable\$1glue** – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.
+ **default\$1docdb** – If present, specifies a DocumentDB connection string to use when no catalog-specific environment variable exists.
+ **disable\$1projection\$1and\$1casing** – (Optional) Disables projection and casing. Use if you want to query Amazon DocumentDB tables that use case sensitive column names. The `disable_projection_and_casing` parameter uses the following values to specify the behavior of casing and column mapping: 
  + **false** – This is the default setting. Projection is enabled, and the connector expects all column names to be in lower case. 
  + **true** – Disables projection and casing. When using the `disable_projection_and_casing` parameter, keep in mind the following points: 
    + Use of the parameter can result in higher bandwidth usage. Additionally, if your Lambda function is not in the same AWS Region as your data source, you will incur higher standard AWS cross-region transfer costs as a result of the higher bandwidth usage. For more information about cross-region transfer costs, see [AWS Data Transfer Charges for Server and Serverless Architectures](https://aws.amazon.com/blogs/apn/aws-data-transfer-charges-for-server-and-serverless-architectures/) in the AWS Partner Network Blog.
    + Because a larger number of bytes is transferred and because the larger number of bytes requires a higher deserialization time, overall latency can increase. 
+ **enable\$1case\$1insensitive\$1match** – (Optional) When `true`, performs case insensitive searches against schema and table names in Amazon DocumentDB. The default is `false`. Use if your query contains uppercase schema or table names.

#### Specifying connection strings


You can provide one or more properties that define the DocumentDB connection details for the DocumentDB instances that you use with the connector. To do this, set a Lambda environment variable that corresponds to the catalog name that you want to use in Athena. For example, suppose you want to use the following queries to query two different DocumentDB instances from Athena:

```
SELECT * FROM "docdb_instance_1".database.table
```

```
SELECT * FROM "docdb_instance_2".database.table
```

Before you can use these two SQL statements, you must add two environment variables to your Lambda function: `docdb_instance_1` and `docdb_instance_2`. The value for each should be a DocumentDB connection string in the following format:

```
mongodb://:@:/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0      
```

##### Using secrets


You can optionally use AWS Secrets Manager for part or all of the value for your connection string details. To use the Athena Federated Query feature with Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

If you use the syntax `${my_secret}` to put the name of a secret from Secrets Manager in your connection string, the connector replaces `${my_secret}` with its plain text value from Secrets Manager exactly. Secrets should be stored as a plain text secret with value `<username>:<password>`. Secrets stored as `{username:<username>,password:<password>}` will not be passed to the connection string properly.

Secrets can also be used for the entire connection string entirely, and the username and password can be defined within the secret.

For example, suppose you set the Lambda environment variable for `docdb_instance_1` to the following value:

```
mongodb://${docdb_instance_1_creds}@myhostname.com:123/?ssl=true&ssl_ca_certs=rds-combined-ca-bundle.pem&replicaSet=rs0         
```

The Athena Query Federation SDK automatically attempts to retrieve a secret named `docdb_instance_1_creds` from Secrets Manager and inject that value in place of `${docdb_instance_1_creds}`. Any part of the connection string that is enclosed by the `${ }` character combination is interpreted as a secret from Secrets Manager. If you specify a secret name that the connector cannot find in Secrets Manager, the connector does not replace the text.

## Retrieving supplemental metadata


To retrieve supplemental metadata, follow these steps to configure your Glue database and table.

### Set up the Glue database


1. Create a Glue database with the same name as your DocumentDB collection.

1. In the Location URI field, enter `docdb-metadata-flag`.

### Configure the Glue table


Add the following parameters to your Glue table:
+ `docdb-metadata-flag = true`
+ `columnMapping = apple=APPLE`

  In this example, `apple` represents the lowercase column name in Glue, and `APPLE` represents the actual case-sensitive column name in your DocumentDB collection.

### Verify metadata retrieval


1. Run your query.

1. Check the Lambda function's CloudWatch logs for successful metadata retrieval. A successful retrieval will show the following log entry:

   ```
   doGetTable: Retrieved schema for table[TableName{schemaName=test, tableName=profiles}] from AWS Glue.
   ```

**Note**  
If your table already has a `columnMapping` field configured, you only need to add the `docdb-metadata-flag = true` parameter to the table properties.

## Setting up databases and tables in AWS Glue


Because the connector's built-in schema inference capability scans a limited number of documents and supports only a subset of data types, you might want to use AWS Glue for metadata instead.

To enable an AWS Glue table for use with Amazon DocumentDB, you must have a AWS Glue database and table for the DocumentDB database and collection that you want to supply supplemental metadata for.

**To use an AWS Glue table for supplemental metadata**

1. Use the AWS Glue console to create an AWS Glue database that has the same name as your Amazon DocumentDB database name.

1. Set the URI property of the database to include **docdb-metadata-flag**.

1. (Optional) Add the **sourceTable** table property. This property defines the source table name in Amazon DocumentDB. Use this property if your AWS Glue table has a different name from the table name in Amazon DocumentDB. Differences in naming rules between AWS Glue and Amazon DocumentDB can make this necessary. For example, capital letters are not permitted in AWS Glue table names, but they are permitted in Amazon DocumentDB table names.

1. (Optional) Add the **columnMapping** table property. This property defines column name mappings. Use this property if AWS Glue column naming rules prevent you from creating an AWS Glue table that has the same column names as those in your Amazon DocumentDB table. This can be useful because capital letters are permitted in Amazon DocumentDB column names but are not permitted in AWS Glue column names.

   The `columnMapping` property value is expected to be a set of mappings in the format `col1=Col1,col2=Col2`.
**Note**  
 Column mapping applies only to top level column names and not to nested fields. 

   After you add the AWS Glue `columnMapping` table property, you can remove the `disable_projection_and_casing` Lambda environment variable.

1. Make sure that you use the data types appropriate for AWS Glue as listed in this document.

## Data type support


This section lists the data types that the DocumentDB connector uses for schema inference, and the data types when AWS Glue metadata is used.

### Schema inference data types


The schema inference feature of the DocumentDB connector attempts to infer values as belonging to one of the following data types. The table shows the corresponding data types for Amazon DocumentDB, Java, and Apache Arrow.


****  

| Apache Arrow | Java or DocDB | 
| --- | --- | 
| VARCHAR | String | 
| INT | Integer | 
| BIGINT | Long | 
| BIT | Boolean | 
| FLOAT4 | Float | 
| FLOAT8 | Double | 
| TIMESTAMPSEC | Date | 
| VARCHAR | ObjectId | 
| LIST | List | 
| STRUCT | Document | 

### AWS Glue data types


If you use AWS Glue for supplemental metadata, you can configure the following data types. The table shows the corresponding data types for AWS Glue and Apache Arrow.


****  

| AWS Glue | Apache Arrow | 
| --- | --- | 
| int | INT | 
| bigint | BIGINT | 
| double | FLOAT8 | 
| float | FLOAT4 | 
| boolean | BIT | 
| binary | VARBINARY | 
| string | VARCHAR | 
| List | LIST | 
| Struct | STRUCT | 

## Required Permissions


For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-docdb.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-docdb/athena-docdb.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The DocumentDB connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.
+ **AWS Secrets Manager read access** – If you choose to store DocumentDB endpoint details in Secrets Manager, you must grant the connector access to those secrets.
+ **VPC access** – The connector requires the ability to attach and detach interfaces to your VPC so that it can connect to it and communicate with your DocumentDB instances.

## Performance


The Athena Amazon DocumentDB connector does not currently support parallel scans but attempts to push down predicates as part of its DocumentDB queries, and predicates against indexes on your DocumentDB collection result in significantly less data scanned.

The Lambda function performs projection pushdown to decrease the data scanned by the query. However, selecting a subset of columns sometimes results in a longer query execution runtime. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data.

## Passthrough queries


The Athena Amazon DocumentDB connector supports [passthrough queries](federated-query-passthrough.md) and is NoSQL based. For information about querying Amazon DocumentDB, see [Querying](https://docs.aws.amazon.com/documentdb/latest/developerguide/querying.html) in the *Amazon DocumentDB Developer Guide*.

To use passthrough queries with Amazon DocumentDB, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            database => 'database_name',
            collection => 'collection_name',
            filter => '{query_syntax}'
        ))
```

The following example queries the `example` database within the `TPCDS` collection, filtering on all books with the title *Bill of Rights*.

```
SELECT * FROM TABLE(
        system.query(
            database => 'example',
            collection => 'tpcds',
            filter => '{title: "Bill of Rights"}'
        ))
```

## Additional resources

+ For an article on using [Amazon Athena Federated Query](federated-queries.md) to connect a MongoDB database to [Quick](https://aws.amazon.com/quicksight/) to build dashboards and visualizations, see [Visualize MongoDB data from Quick using Amazon Athena Federated Query](https://aws.amazon.com/blogs/big-data/visualize-mongodb-data-from-amazon-quicksight-using-amazon-athena-federated-query/) in the *AWS Big Data Blog*.
+ For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-docdb) on GitHub.com.

# Amazon Athena DynamoDB connector
DynamoDB

The Amazon Athena DynamoDB connector enables Amazon Athena to communicate with DynamoDB so that you can query your tables with SQL. Write operations like [INSERT INTO](insert-into.md) are not supported.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations


If you migrate your DynamoDB connections to Glue Catalog and Lake Formation, only the lowercase table and column names will be recognized. 

## Parameters


Use the parameters in this section to configure the DynamoDB connector.

### Glue connections (recommended)


We recommend that you configure a DynamoDB connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the DynamoDB connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type DYNAMODB
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The DynamoDB connector created using Glue connections does not support the use of a multiplexing handler.
The DynamoDB connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **disable\$1glue** – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.
+ **disable\$1projection\$1and\$1casing** – (Optional) Disables projection and casing. Use if you want to query DynamoDB tables that have casing in their column names and you do not want to specify a `columnMapping` property on your AWS Glue table.

  The `disable_projection_and_casing` parameter uses the following values to specify the behavior of casing and column mapping:
  + **auto** – Disables projection and casing when a previously unsupported type is detected and column name mapping is not set on the table. This is the default setting.
  + **always** – Disables projection and casing unconditionally. This is useful when you have casing in your DynamoDB column names but do not want to specify any column name mapping.

  When using the `disable_projection_and_casing` parameter, keep in mind the following points:
  + Use of the parameter can result in higher bandwidth usage. Additionally, if your Lambda function is not in the same AWS Region as your data source, you will incur higher standard AWS cross-region transfer costs as a result of the higher bandwidth usage. For more information about cross-region transfer costs, see [AWS Data Transfer Charges for Server and Serverless Architectures](https://aws.amazon.com/blogs/apn/aws-data-transfer-charges-for-server-and-serverless-architectures/) in the AWS Partner Network Blog.
  + Because a larger number of bytes is transferred and because the larger number of bytes requires a higher deserialization time, overall latency can increase. 

## Setting up databases and tables in AWS Glue


Because the connector's built-in schema inference capability is limited, you might want to use AWS Glue for metadata. To do this, you must have a database and table in AWS Glue. To enable them for use with DynamoDB, you must edit their properties.

**To edit database properties in the AWS Glue console**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, expand **Data Catalog**, and then choose **Databases**.

   On the **Databases** page, you can edit an existing database, or choose **Add database** to create one.

1. In the list of databases, choose the link for the database that you want to edit.

1. Choose **Edit**.

1. On the **Update a database** page, under **Database settings**, for **Location**, add the string **dynamo-db-flag**. This keyword indicates that the database contains tables that the Athena DynamoDB connector is using for supplemental metadata and is required for AWS Glue databases other than `default`. The `dynamo-db-flag` property is useful for filtering out databases in accounts with many databases.

1. Choose **Update Database**.

**To edit table properties in the AWS Glue console**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, expand **Data Catalog**, and then choose **Tables**.

1. On the **Tables** page, in the list of tables, choose the linked name of the table that you want to edit.

1. Choose **Actions**, **Edit table**.

1. On the **Edit table** page, in the **Table properties** section, add the following table properties as required. If you use the AWS Glue DynamoDB crawler, these properties are automatically set.
   + **dynamodb** – String that indicates to the Athena DynamoDB connector that the table can be used for supplemental metadata. Enter the `dynamodb` string in the table properties under a field called **classification** (exact match).
**Note**  
The **Set table properties** page that is part of the table creation process in the AWS Glue console has a **Data format** section with a **Classification** field. You cannot enter or choose `dynamodb` here. Instead, after you create your table, follow the steps to edit the table and to enter `classification` and `dynamodb` as a key-value pair in the **Table properties** section.
   + **sourceTable** – Optional table property that defines the source table name in DynamoDB. Use this if AWS Glue table naming rules prevent you from creating a AWS Glue table with the same name as your DynamoDB table. For example, capital letters are not permitted in AWS Glue table names, but they are permitted in DynamoDB table names.
   + **columnMapping** – Optional table property that defines column name mappings. Use this if AWS Glue column naming rules prevent you from creating a AWS Glue table with the same column names as your DynamoDB table. For example, capital letters are not permitted in AWS Glue column names but are permitted in DynamoDB column names. The property value is expected to be in the format col1=Col1,col2=Col2. Note that column mapping applies only to top level column names and not to nested fields.
   + **defaultTimeZone** – Optional table property that is applied to `date` or `datetime` values that do not have an explicit time zone. Setting this value is a good practice to avoid discrepancies between the data source default time zone and the Athena session time zone.
   + **datetimeFormatMapping** – Optional table property that specifies the `date` or `datetime` format to use when parsing data from a column of the AWS Glue `date` or `timestamp` data type. If this property is not specified, the connector attempts to [infer](https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/DateFormatUtils.html) an ISO-8601 format. If the connector cannot infer the `date` or `datetime` format or parse the raw string, then the value is omitted from the result. 

     The `datetimeFormatMapping` value should be in the format `col1=someformat1,col2=someformat2`. Following are some example formats:

     ```
     yyyyMMdd'T'HHmmss 
     ddMMyyyy'T'HH:mm:ss
     ```

     If your column has `date` or `datetime` values without a time zone and you want to use the column in the `WHERE` clause, set the `datetimeFormatMapping` property for the column.

1. If you define your columns manually, make sure that you use the appropriate data types. If you used a crawler, validate the columns and types that the crawler discovered.

1. Choose **Save**.

## Required Permissions


For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-dynamodb.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-dynamodb/athena-dynamodb.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The DynamoDB connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.
+ **DynamoDB read access** – The connector uses the `DescribeTable`, `ListSchemas`, `ListTables`, `Query`, and `Scan` API operations.

## Performance


The Athena DynamoDB connector supports parallel scans and attempts to push down predicates as part of its DynamoDB queries. A hash key predicate with `X` distinct values results in `X` query calls to DynamoDB. All other predicate scenarios result in `Y` number of scan calls, where `Y` is heuristically determined based on the size of your table and its provisioned throughput. However, selecting a subset of columns sometimes results in a longer query execution runtime.

`LIMIT` clauses and simple predicates are pushed down and can reduce the amount of data scanned and will lead to decreased query execution run time. 

### LIMIT clauses


A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. For enhanced functionality, and to reduce the amount of data scanned, the Athena DynamoDB connector can combine these expressions and push them directly to DynamoDB.

The following Athena DynamoDB connector operators support predicate pushdown:
+ **Boolean: **AND
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1NULL

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT *
FROM my_table
WHERE col_a > 10 and col_b < 10
LIMIT 10
```

For an article on using predicate pushdown to improve performance in federated queries, including DynamoDB, see [Improve federated queries with predicate pushdown in Amazon Athena](https://aws.amazon.com/blogs/big-data/improve-federated-queries-with-predicate-pushdown-in-amazon-athena/) in the *AWS Big Data Blog*.

## Passthrough queries


The DynamoDB connector supports [passthrough queries](federated-query-passthrough.md) and uses PartiQL syntax. The DynamoDB [GetItem](https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_GetItem.html) API operation is not supported. For information about querying DynamoDB using PartiQL, see [PartiQL select statements for DynamoDB](https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/ql-reference.select.html) in the *Amazon DynamoDB Developer Guide*.

To use passthrough queries with DynamoDB, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query_string'
        ))
```

The following DynamoDB passthrough query example uses PartiQL to return a list of Fire TV Stick devices that have a `DateWatched` property later than 12/24/22.

```
SELECT * FROM TABLE(
        system.query(
           query => 'SELECT Devices 
                       FROM WatchList 
                       WHERE Devices.FireStick.DateWatched[0] > '12/24/22''
        ))
```

## Troubleshooting


### Multiple filters on a sort key column


**Error message**: KeyConditionExpressions must only contain one condition per key

**Cause**: This issue can occur in Athena engine version 3 in queries that have both a lower and upper bounded filter on a DynamoDB sort key column. Because DynamoDB does not support more than one filter condition on a sort key, an error is thrown when the connector attempts to push down a query that has both conditions applied.

**Solution**: Update the connector to version 2023.11.1 or later. For instructions on updating a connector, see [Update a data source connector](connectors-updating.md).

## Costs


The costs for use of the connector depends on the underlying AWS resources that are used. Because queries that use scans can consume a large number of [read capacity units (RCUs)](https://aws.amazon.com/dynamodb/pricing/provisioned/), consider the information for [Amazon DynamoDB pricing](https://aws.amazon.com/dynamodb/pricing/) carefully.

## Additional resources

+ For an introduction to using the Amazon Athena DynamoDB connector, see [Access, query, and join Amazon DynamoDB tables using Athena](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/access-query-and-join-amazon-dynamodb-tables-using-athena.html) in the *AWS Prescriptive Guidance Patterns* guide. 
+ For an article on how to use the Athena DynamoDB connector to query data in DynamoDB with SQL and visualize insights in Quick, see the *AWS Big Data Blog* post [Visualize Amazon DynamoDB insights in Quick using the Amazon Athena DynamoDB connector and AWS Glue](https://aws.amazon.com/blogs/big-data/visualize-amazon-dynamodb-insights-in-amazon-quicksight-using-the-amazon-athena-dynamodb-connector-and-aws-glue/). 
+ For an article on using the Amazon Athena DynamoDB connector with Amazon DynamoDB, Athena, and Quick to create a simple governance dashboard, see the *AWS Big Data Blog* post [Query cross-account Amazon DynamoDB tables using Amazon Athena Federated Query](https://aws.amazon.com/blogs/big-data/query-cross-account-amazon-dynamodb-tables-using-amazon-athena-federated-query/).
+ For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-dynamodb) on GitHub.com.

# Amazon Athena Google BigQuery connector
Google BigQuery

The Amazon Athena connector for Google [BigQuery](https://cloud.google.com/bigquery/) enables Amazon Athena to run SQL queries on your Google BigQuery data.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Lambda functions have a maximum timeout value of 15 minutes. Each split executes a query on BigQuery and must finish with enough time to store the results for Athena to read. If the Lambda function times out, the query fails.
+ Google BigQuery is case sensitive. The connector attempts to correct the case of dataset names, table names, and project IDs. This is necessary because Athena lower cases all metadata. These corrections make many extra calls to Google BigQuery.
+ Binary data types are not supported.
+ Because of Google BigQuery concurrency and quota limits, the connector may encounter Google quota limit issues. To avoid these issues, push as many constraints to Google BigQuery as feasible. For information about BigQuery quotas, see [Quotas and limits](https://cloud.google.com/bigquery/quotas) in the Google BigQuery documentation.

## Parameters


Use the parameters in this section to configure the Google BigQuery connector.

### Glue connections (recommended)


We recommend that you configure a Google BigQuery connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Google BigQuery connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type BIGQUERY
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Google BigQuery connector created using Glue connections does not support the use of a multiplexing handler.
The Google BigQuery connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **gcp\$1project\$1id** – The project ID (not project name) that contains the datasets that the connector should read from (for example, `semiotic-primer-1234567`).
+ **secret\$1manager\$1gcp\$1creds\$1name** – The name of the secret within AWS Secrets Manager that contains your BigQuery credentials in JSON format (for example, `GoogleCloudPlatformCredentials`).
+ **big\$1query\$1endpoint** – (Optional) The URL of a BigQuery private endpoint. Use this parameter when you want to access BigQuery over a private endpoint.

## Splits and views


Because the BigQuery connector uses the BigQuery Storage Read API to query tables, and the BigQuery Storage API does not support views, the connector uses the BigQuery client with a single split for views.

## Performance


To query tables, the BigQuery connector uses the BigQuery Storage Read API, which uses an RPC-based protocol that provides fast access to BigQuery managed storage. For more information about the BigQuery Storage Read API, see [Use the BigQuery Storage Read API to read table data](https://cloud.google.com/bigquery/docs/reference/storage) in the Google Cloud documentation.

Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The connector is subject to query failures as concurrency increases, and generally is a slow connector.

The Athena Google BigQuery connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, `ORDER BY` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### LIMIT clauses


A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Top N queries


A top `N` query specifies an ordering of the result set and a limit on the number of rows returned. You can use this type of query to determine the top `N` max values or top `N` min values for your datasets. With top `N` pushdown, the connector returns only `N` ordered rows to Athena.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Google BigQuery connector can combine these expressions and push them directly to Google BigQuery for enhanced functionality and to reduce the amount of data scanned.

The following Athena Google BigQuery connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
ORDER BY col_a DESC 
LIMIT 10;
```

## Passthrough queries


The Google BigQuery connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Google BigQuery, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Google BigQuery. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


The Amazon Athena Google BigQuery connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-google-bigquery/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-google-bigquery/LICENSE.txt) file on GitHub.com.

## Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-google-bigquery) on GitHub.com.

# Amazon Athena Google Cloud Storage connector
Google Cloud Storage

The Amazon Athena Google Cloud Storage connector enables Amazon Athena to run queries on Parquet and CSV files stored in a Google Cloud Storage (GCS) bucket. After you group one or more Parquet or CSV files in an unpartitioned or partitioned folder in a GCS bucket, you can organize them in an [AWS Glue](https://aws.amazon.com/glue/) database table.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog. 

For an article that shows how to use Athena to run queries on Parquet or CSV files in a GCS bucket, see the AWS Big Data Blog post [Use Amazon Athena to query data stored in Google Cloud Platform](https://aws.amazon.com/blogs/big-data/use-amazon-athena-to-query-data-stored-in-google-cloud-platform/).

## Prerequisites

+ Set up an AWS Glue database and table that correspond to your bucket and folders in Google Cloud Storage. For the steps, see [Setting up databases and tables in AWS Glue](#connectors-gcs-setting-up-databases-and-tables-in-glue) later in this document.
+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Currently, the connector supports only the `VARCHAR` type for partition columns (`string` or `varchar` in an AWS Glue table schema). Other partition field types raise errors when you query them in Athena.

## Terms


The following terms relate to the GCS connector.
+ **Handler** – A Lambda handler that accesses your GCS bucket. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your GCS bucket.
+ **Record handler** – A Lambda handler that retrieves data records from your GCS bucket.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your GCS bucket.

## Supported file types


The GCS connector supports the Parquet and CSV file types.

**Note**  
Make sure you do not place both CSV and Parquet files in the same GCS bucket or path. Doing so can result in a runtime error when Parquet files are attempted to be read as CSV or vice versa. 

## Parameters


Use the parameters in this section to configure the GCS connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)


We recommend that you configure a GCS connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the GCS connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type GOOGLECLOUDSTORAGE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The GCS connector created using Glue connections does not support the use of a multiplexing handler.
The GCS connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections

+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **secret\$1manager\$1gcp\$1creds\$1name** – The name of the secret in AWS Secrets Manager that contains your GCS credentials in JSON format (for example, `GoogleCloudPlatformCredentials`).

## Setting up databases and tables in AWS Glue


Because the built-in schema inference capability of the GCS connector is limited, we recommend that you use AWS Glue for your metadata. The following procedures show how to create a database and table in AWS Glue that you can access from Athena.

### Creating a database in AWS Glue


You can use the AWS Glue console to create a database for use with the GCS connector.

**To create a database in AWS Glue**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. From the navigation pane, choose **Databases**.

1. Choose **Add database**.

1. For **Name**, enter a name for the database that you want to use with the GCS connector.

1. For **Location**, specify `google-cloud-storage-flag`. This location tells the GCS connector that the AWS Glue database contains tables for GCS data to be queried in Athena. The connector recognizes databases in Athena that have this flag and ignores databases that do not.

1. Choose **Create database**.

### Creating a table in AWS Glue


Now you can create a table for the database. When you create an AWS Glue table to use with the GCS connector, you must specify additional metadata.

**To create a table in the AWS Glue console**

1. In the AWS Glue console, from the navigation pane, choose **Tables**.

1. On the **Tables** page, choose **Add table**.

1. On the **Set table properties** page, enter the following information.
   + **Name** – A unique name for the table.
   + **Database** – Choose the AWS Glue database that you created for the GCS connector.
   + **Include path** – In the **Data store** section, for **Include path**, enter the URI location for GCS prefixed by `gs://` (for example, `gs://gcs_table/data/`). If you have one or more partition folders, don't include them in the path.
**Note**  
When you enter the non `s3://` table path, the AWS Glue console shows an error. You can ignore this error. The table will be created successfully.
   + **Data format** – For **Classification**, select **CSV** or **Parquet**.

1. Choose **Next.**

1. On the **Choose or define schema** page, defining a table schema is highly recommended, but not mandatory. If you do not define a schema, the GCS connector attempts to infer a schema for you.

   Do one of the following:
   + If you want the GCS connector to attempt to infer a schema for you, choose **Next**, and then choose **Create**.
   + To define a schema yourself, follow the steps in the next section.

### Defining a table schema in AWS Glue


Defining a table schema in AWS Glue requires more steps but gives you greater control over the table creation process.

**To define a schema for your table in AWS Glue**

1. On the **Choose or define schema** page, choose **Add**.

1. Use the **Add schema entry** dialog box to provide a column name and data type.

1. To designate the column as a partition column, select the **Set as partition key** option.

1. Choose **Save** to save the column.

1. Choose **Add** to add another column.

1. When you are finished adding columns, choose **Next**.

1. On the **Review and create** page, review the table, and then choose **Create**.

1. If your schema contains partition information, follow the steps in the next section to add a partition pattern to the table's properties in AWS Glue.

### Adding a partition pattern to table properties in AWS Glue


If your GCS buckets have partitions, you must add the partition pattern to the properties of the table in AWS Glue.

**To add partition information to table properties AWS Glue**

1. On the details page for the table that you created in AWS Glue, choose **Actions**, **Edit table**.

1. On the **Edit table** page, scroll down to the **Table properties** section.

1. Choose **Add** to add a partition key.

1. For **Key**, enter **partition.pattern**. This key defines the folder path pattern.

1. For **Value**, enter a folder path pattern like **StateName=\$1\$1statename\$1/ZipCode=\$1\$1zipcode\$1/**, where **statename** and **zipcode** enclosed by **\$1\$1\$1** are partition column names. The GCS connector supports both Hive and non-Hive partition schemes.

1. When you are finished, choose **Save**.

1. To view the table properties that you just created, choose the **Advanced properties** tab.

At this point, you can navigate to the Athena console. The database and table that you created in AWS Glue are available for querying in Athena.

## Data type support


The following tables show the supported data types for CSV and for Parquet.

### CSV



****  

| **Nature of data** | **Inferred Data Type** | 
| --- | --- | 
| Data looks like a number | BIGINT | 
| Data looks like a string | VARCHAR | 
| Data looks like a floating point (float, double, or decimal) | DOUBLE | 
| Data looks like a Date | Timestamp | 
| Data that contains true/false values | BOOL | 

### Parquet



****  

| **PARQUET** | **Athena (Arrow)** | 
| --- | --- | 
| BINARY | VARCHAR | 
| BOOLEAN | BOOL | 
| DOUBLE | DOUBLE | 
| ENUM | VARCHAR | 
| FIXED\$1LEN\$1BYTE\$1ARRAY | DECIMAL | 
| FLOAT | FLOAT (32-bit) | 
| INT32 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-gcs.html)  | 
| INT64 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-gcs.html)  | 
| INT96 | Timestamp | 
| MAP | MAP | 
| STRUCT | STRUCT | 
| LIST | LIST | 

## Required Permissions


For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-gcs.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-gcs/athena-gcs.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The GCS connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.

## Performance


When the table schema contains partition fields and the `partition.pattern` table property is configured correctly, you can include the partition field in the `WHERE` clause of your queries. For such queries, the GCS connector uses the partition columns to refine the GCS folder path and avoid scanning unneeded files in GCS folders.

For Parquet datasets, selecting a subset of columns results in fewer data being scanned. This usually results in a shorter query execution runtime when column projection is applied. 

For CSV datasets, column projection is not supported and does not reduce the amount of data being scanned. 

`LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data. The GCS connector scans more data for larger datasets than for smaller datasets, regardless of the `LIMIT` clause applied. For example, the query `SELECT * LIMIT 10000` scans more data for a larger underlying dataset than a smaller one.

### License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-gcs/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-gcs/LICENSE.txt) file on GitHub.com.

### Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-gcs) on GitHub.com.

# Amazon Athena HBase connector
HBase

The Amazon Athena HBase connector enables Amazon Athena to communicate with your Apache HBase instances so that you can query your HBase data with SQL.

Unlike traditional relational data stores, HBase collections do not have set schema. HBase does not have a metadata store. Each entry in a HBase collection can have different fields and data types.

The HBase connector supports two mechanisms for generating table schema information: basic schema inference and AWS Glue Data Catalog metadata.

Schema inference is the default. This option scans a small number of documents in your collection, forms a union of all fields, and coerces fields that have non overlapping data types. This option works well for collections that have mostly uniform entries.

For collections with a greater variety of data types, the connector supports retrieving metadata from the AWS Glue Data Catalog. If the connector sees an AWS Glue database and table that match your HBase namespace and collection names, it gets its schema information from the corresponding AWS Glue table. When you create your AWS Glue table, we recommend that you make it a superset of all fields that you might want to access from your HBase collection.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog. 

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters


Use the parameters in this section to configure the HBase connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)


We recommend that you configure a HBase connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the HBase connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type HBASE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The HBase connector created using Glue connections does not support the use of a multiplexing handler.
The HBase connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections

+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **disable\$1glue** – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.
+ **default\$1hbase** – If present, specifies an HBase connection string to use when no catalog-specific environment variable exists.
+ **enable\$1case\$1insensitive\$1match** – (Optional) When `true`, performs case insensitive searches against table names in HBase. The default is `false`. Use if your query contains uppercase table names.

#### Specifying connection strings


You can provide one or more properties that define the HBase connection details for the HBase instances that you use with the connector. To do this, set a Lambda environment variable that corresponds to the catalog name that you want to use in Athena. For example, suppose you want to use the following queries to query two different HBase instances from Athena:

```
SELECT * FROM "hbase_instance_1".database.table
```

```
SELECT * FROM "hbase_instance_2".database.table
```

Before you can use these two SQL statements, you must add two environment variables to your Lambda function: `hbase_instance_1` and `hbase_instance_2`. The value for each should be a HBase connection string in the following format:

```
master_hostname:hbase_port:zookeeper_port
```

##### Using secrets


You can optionally use AWS Secrets Manager for part or all of the value for your connection string details. To use the Athena Federated Query feature with Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

If you use the syntax `${my_secret}` to put the name of a secret from Secrets Manager in your connection string, the connector replaces the secret name with your user name and password values from Secrets Manager.

For example, suppose you set the Lambda environment variable for `hbase_instance_1` to the following value:

```
${hbase_host_1}:${hbase_master_port_1}:${hbase_zookeeper_port_1}
```

The Athena Query Federation SDK automatically attempts to retrieve a secret named `hbase_instance_1_creds` from Secrets Manager and inject that value in place of `${hbase_instance_1_creds}`. Any part of the connection string that is enclosed by the `${ }` character combination is interpreted as a secret from Secrets Manager. If you specify a secret name that the connector cannot find in Secrets Manager, the connector does not replace the text.

## Setting up databases and tables in AWS Glue


The connector's built-in schema inference supports only values that are serialized in HBase as strings (for example, `String.valueOf(int)`). Because the connector's built-in schema inference capability is limited, you might want to use AWS Glue for metadata instead. To enable an AWS Glue table for use with HBase, you must have an AWS Glue database and table with names that match the HBase namespace and table that you want to supply supplemental metadata for. The use of HBase column family naming conventions is optional but not required.

**To use an AWS Glue table for supplemental metadata**

1. When you edit the table and database in the AWS Glue console, add the following table properties:
   + **hbase-metadata-flag** – This property indicates to the HBase connector that the connector can use the table for supplemental metadata. You can provide any value for `hbase-metadata-flag` as long as the `hbase-metadata-flag` property is present in the list of table properties.
   + **hbase-native-storage-flag** – Use this flag to toggle the two value serialization modes supported by the connector. By default, when this field is not present, the connector assumes all values are stored in HBase as strings. As such it will attempt to parse data types such as `INT`, `BIGINT`, and `DOUBLE` from HBase as strings. If this field is set with any value on the table in AWS Glue, the connector switches to "native" storage mode and attempts to read `INT`, `BIGINT`, `BIT`, and `DOUBLE` as bytes by using the following functions:

     ```
     ByteBuffer.wrap(value).getInt() 
     ByteBuffer.wrap(value).getLong() 
     ByteBuffer.wrap(value).get() 
     ByteBuffer.wrap(value).getDouble()
     ```

1. Make sure that you use the data types appropriate for AWS Glue as listed in this document.

### Modeling column families


The Athena HBase connector supports two ways to model HBase column families: fully qualified (flattened) naming like `family:column`, or using `STRUCT` objects.

In the `STRUCT` model, the name of the `STRUCT` field should match the column family, and children of the `STRUCT` should match the names of the columns of the family. However, because predicate push down and columnar reads are not yet fully supported for complex types like `STRUCT`, using `STRUCT` is currently not advised.

The following image shows a table configured in AWS Glue that uses a combination of the two approaches.

![\[Modeling column families in AWS Glue for Apache Hbase.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-hbase-1.png)


## Data type support


The connector retrieves all HBase values as the basic byte type. Then, based on how you defined your tables in AWS Glue Data Catalog, it maps the values into one of the Apache Arrow data types in the following table.


****  

| AWS Glue data type | Apache Arrow data type | 
| --- | --- | 
| int | INT | 
| bigint | BIGINT | 
| double | FLOAT8 | 
| float | FLOAT4 | 
| boolean | BIT | 
| binary | VARBINARY | 
| string | VARCHAR | 

**Note**  
If you do not use AWS Glue to supplement your metadata, the connector's schema inferencing uses only the data types `BIGINT`, `FLOAT8`, and `VARCHAR`.

## Required Permissions


For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-hbase.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-hbase/athena-hbase.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The HBase connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.
+ **AWS Secrets Manager read access** – If you choose to store HBase endpoint details in Secrets Manager, you must grant the connector access to those secrets.
+ **VPC access** – The connector requires the ability to attach and detach interfaces to your VPC so that it can connect to it and communicate with your HBase instances.

## Performance


The Athena HBase connector attempts to parallelize queries against your HBase instance by reading each region server in parallel. The Athena HBase connector performs predicate pushdown to decrease the data scanned by the query.

The Lambda function also performs *projection* pushdown to decrease the data scanned by the query. However, selecting a subset of columns sometimes results in a longer query execution runtime. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data.

HBase is prone to query failures and variable query execution times. You might have to retry your queries multiple times for them to succeed. The HBase connector is resilient to throttling due to concurrency.

## Passthrough queries


The HBase connector supports [passthrough queries](federated-query-passthrough.md) and is NoSQL based. For information about querying Apache HBase using filtering, see [Filter language](https://hbase.apache.org/book.html#thrift.filter_language) in the Apache documentation.

To use passthrough queries with HBase, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            database => 'database_name',
            collection => 'collection_name',
            filter => '{query_syntax}'
        ))
```

The following example HBase passthrough query filters for employees aged 24 or 30 within the `employee` collection of the `default` database.

```
SELECT * FROM TABLE(
        system.query(
            DATABASE => 'default',
            COLLECTION => 'employee',
            FILTER => 'SingleColumnValueFilter(''personaldata'', ''age'', =, ''binary:30'')' ||
                       ' OR SingleColumnValueFilter(''personaldata'', ''age'', =, ''binary:24'')'
        ))
```

## License information


The Amazon Athena HBase connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-hbase) on GitHub.com.

# Amazon Athena Hortonworks connector
Hortonworks

The Amazon Athena connector for Hortonworks enables Amazon Athena to run SQL queries on the Cloudera [Hortonworks](https://www.cloudera.com/products/hdp.html) data platform. The connector transforms your Athena SQL queries to their equivalent HiveQL syntax.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.

## Terms


The following terms relate to the Hortonworks Hive connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the Hortonworks Hive connector.

### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
hive://${jdbc_connection_string}
```

### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | HiveMuxCompositeHandler | 
| Metadata handler | HiveMuxMetadataHandler | 
| Record handler | HiveMuxRecordHandler | 

#### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myhivecatalog, then the environment variable name is myhivecatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Hive MUX Lambda function that supports two database instances: `hive1` (the default), and `hive2`.


****  

| Property | Value | 
| --- | --- | 
| default | hive://jdbc:hive2://hive1:10000/default?\$1\$1Test/RDS/hive1\$1 | 
| hive\$1catalog1\$1connection\$1string | hive://jdbc:hive2://hive1:10000/default?\$1\$1Test/RDS/hive1\$1 | 
| hive\$1catalog2\$1connection\$1string | hive://jdbc:hive2://hive2:10000/default?UID=sample&PWD=sample | 

#### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/hive1host}`.

```
hive://jdbc:hive2://hive1host:10000/default?...&${Test/RDS/hive1host}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
hive://jdbc:hive2://hive1host:10000/default?...&UID=sample2&PWD=sample2&...
```

Currently, the Hortonworks Hive connector recognizes the `UID` and `PWD` JDBC properties.

### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single Hortonworks Hive instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | HiveCompositeHandler | 
| Metadata handler | HiveMetadataHandler | 
| Record handler | HiveRecordHandler | 

#### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Hortonworks Hive instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | hive://jdbc:hive2://hive1host:10000/default?secret=\$1\$1Test/RDS/hive1host\$1 | 

### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for JDBC, Hortonworks Hive, and Arrow.


****  

| JDBC | Hortonworks Hive | Arrow | 
| --- | --- | --- | 
| Boolean | Boolean | Bit | 
| Integer | TINYINT | Tiny | 
| Short | SMALLINT | Smallint | 
| Integer | INT | Int | 
| Long | BIGINT | Bigint | 
| float | float4 | Float4 | 
| Double | float8 | Float8 | 
| Date | date | DateDay | 
| Timestamp | timestamp | DateMilli | 
| String | VARCHAR | Varchar | 
| Bytes | bytes | Varbinary | 
| BigDecimal | Decimal | Decimal | 
| ARRAY | N/A (see note) | List | 

**Note**  
Currently, Hortonworks Hive does not support the aggregate types `ARRAY`, `MAP`, `STRUCT`, or `UNIONTYPE`. Columns of aggregate types are treated as `VARCHAR` columns in SQL.

## Partitions and splits


Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance


Hortonworks Hive supports static partitions. The Athena Hortonworks Hive connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, static partitioning is highly recommended. Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The Hortonworks Hive connector is resilient to throttling due to concurrency.

The Athena Hortonworks Hive connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### LIMIT clauses


A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Hortonworks Hive connector can combine these expressions and push them directly to Hortonworks Hive for enhanced functionality and to reduce the amount of data scanned.

The following Athena Hortonworks Hive connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries


The Hortonworks Hive connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Hortonworks Hive, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Hortonworks Hive. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-hortonworks-hive/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-hortonworks-hive/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-hortonworks-hive/pom.xml) file for the Hortonworks Hive connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-hortonworks-hive) on GitHub.com.

# Amazon Athena Apache Kafka connector
Kafka

The Amazon Athena connector for Apache Kafka enables Amazon Athena to run SQL queries on your Apache Kafka topics. Use this connector to view [Apache Kafka](https://kafka.apache.org/) topics as tables and messages as rows in Athena.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites


Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Date and timestamp data types in filter conditions must be cast to appropriate data types.
+ Date and timestamp data types are not supported for the CSV file type and are treated as varchar values.
+ Mapping into nested JSON fields is not supported. The connector maps top-level fields only.
+ The connector does not support complex types. Complex types are interpreted as strings.
+ To extract or work with complex JSON values, use the JSON-related functions available in Athena. For more information, see [Extract JSON data from strings](extracting-data-from-JSON.md).
+ The connector does not support access to Kafka message metadata.

## Terms

+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Kafka endpoint** – A text string that establishes a connection to a Kafka instance.

## Cluster compatibility


The Kafka connector can be used with the following cluster types.
+ **Standalone Kafka** – A direct connection to Kafka (authenticated or unauthenticated).
+ **Confluent** – A direct connection to Confluent Kafka. For information about using Athena with Confluent Kafka data, see [Visualize Confluent data in Quick using Amazon Athena](https://aws.amazon.com/blogs/business-intelligence/visualize-confluent-data-in-amazon-quicksight-using-amazon-athena/) in the *AWS Business Intelligence Blog*. 

### Connecting to Confluent


Connecting to Confluent requires the following steps:

1. Generate an API key from Confluent.

1. Store the username and password for the Confluent API key into AWS Secrets Manager.

1. Provide the secret name for the `secrets_manager_secret` environment variable in the Kafka connector.

1. Follow the steps in the [Setting up the Kafka connector](#connectors-kafka-setup) section of this document.

## Supported authentication methods


The connector supports the following authentication methods.
+ [SSL](https://kafka.apache.org/documentation/#security_ssl)
+ [SASL/SCRAM](https://kafka.apache.org/documentation/#security_sasl_scram)
+ SASL/PLAIN
+ SASL/PLAINTEXT
+ NO\$1AUTH
+ **Self-managed Kafka and Confluent Platform** – SSL, SASL/SCRAM, SASL/PLAINTEXT, NO\$1AUTH
+ **Self-managed Kafka and Confluent Cloud** – SASL/PLAIN

For more information, see [Configuring authentication for the Athena Kafka connector](#connectors-kafka-setup-configuring-authentication).

## Supported input data formats


The connector supports the following input data formats.
+ JSON
+ CSV
+ AVRO
+ PROTOBUF (PROTOCOL BUFFERS)

## Parameters


Use the parameters in this section to configure the Athena Kafka connector.
+ **auth\$1type** – Specifies the authentication type of the cluster. The connector supports the following types of authentication:
  + **NO\$1AUTH** – Connect directly to Kafka (for example, to a Kafka cluster deployed over an EC2 instance that does not use authentication).
  + **SASL\$1SSL\$1PLAIN** – This method uses the `SASL_SSL` security protocol and the `PLAIN` SASL mechanism. For more information, see [SASL configuration](https://kafka.apache.org/documentation/#security_sasl_config) in the Apache Kafka documentation.
  + **SASL\$1PLAINTEXT\$1PLAIN** – This method uses the `SASL_PLAINTEXT` security protocol and the `PLAIN` SASL mechanism. For more information, see [SASL configuration](https://kafka.apache.org/documentation/#security_sasl_config) in the Apache Kafka documentation.
  + **SASL\$1SSL\$1SCRAM\$1SHA512** – You can use this authentication type to control access to your Apache Kafka clusters. This method stores the user name and password in AWS Secrets Manager. The secret must be associated with the Kafka cluster. For more information, see [Authentication using SASL/SCRAM](https://kafka.apache.org/documentation/#security_sasl_scram) in the Apache Kafka documentation.
  + **SASL\$1PLAINTEXT\$1SCRAM\$1SHA512** – This method uses the `SASL_PLAINTEXT` security protocol and the `SCRAM_SHA512 SASL` mechanism. This method uses your user name and password stored in AWS Secrets Manager. For more information, see the [SASL configuration](https://kafka.apache.org/documentation/#security_sasl_config) section of the Apache Kafka documentation.
  + **SSL** – SSL authentication uses key store and trust store files to connect with the Apache Kafka cluster. You must generate the trust store and key store files, upload them to an Amazon S3 bucket, and provide the reference to Amazon S3 when you deploy the connector. The key store, trust store, and SSL key are stored in AWS Secrets Manager. Your client must provide the AWS secret key when the connector is deployed. For more information, see [Encryption and Authentication using SSL](https://kafka.apache.org/documentation/#security_ssl) in the Apache Kafka documentation.

    For more information, see [Configuring authentication for the Athena Kafka connector](#connectors-kafka-setup-configuring-authentication).
+ **certificates\$1s3\$1reference** – The Amazon S3 location that contains the certificates (the key store and trust store files).
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **kafka\$1endpoint** – The endpoint details to provide to Kafka.
+ **schema\$1registry\$1url** – The URL address for the schema registry (for example, `http://schema-registry.example.org:8081`). Applies to the `AVRO` and `PROTOBUF` data formats. Athena only supports Confluent schema registry.
+ **secrets\$1manager\$1secret** – The name of the AWS secret in which the credentials are saved.
+ **Spill parameters** – Lambda functions temporarily store ("spill") data that do not fit into memory to Amazon S3. All database instances accessed by the same Lambda function spill to the same location. Use the parameters in the following table to specify the spill location.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)
+ **Subnet IDs** – One or more subnet IDs that correspond to the subnet that the Lambda function can use to access your data source.
  + **Public Kafka cluster or standard Confluent Cloud cluster** – Associate the connector with a private subnet that has a NAT Gateway.
  + **Confluent Cloud cluster with private connectivity** – Associate the connector with a private subnet that has a route to the Confluent Cloud cluster.
    + For [AWS Transit Gateway](https://docs.confluent.io/cloud/current/networking/aws-transit-gateway.html), the subnets must be in a VPC that is attached to the same transit gateway that Confluent Cloud uses.
    + For [VPC Peering](https://docs.confluent.io/cloud/current/networking/peering/aws-peering.html), the subnets must be in a VPC that is peered to Confluent Cloud VPC.
    + For [AWS PrivateLink](https://docs.confluent.io/cloud/current/networking/private-links/aws-privatelink.html), the subnets must be in a VPC that a has route to the VPC endpoints that connect to Confluent Cloud.

**Note**  
If you deploy the connector into a VPC in order to access private resources and also want to connect to a publicly accessible service like Confluent, you must associate the connector with a private subnet that has a NAT Gateway. For more information, see [NAT gateways](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) in the Amazon VPC User Guide.

## Data type support


The following table shows the corresponding data types supported for Kafka and Apache Arrow.


****  

| Kafka | Arrow | 
| --- | --- | 
| CHAR | VARCHAR | 
| VARCHAR | VARCHAR | 
| TIMESTAMP | MILLISECOND | 
| DATE | DAY | 
| BOOLEAN | BOOL | 
| SMALLINT | SMALLINT | 
| INTEGER | INT | 
| BIGINT | BIGINT | 
| DECIMAL | FLOAT8 | 
| DOUBLE | FLOAT8 | 

## Partitions and splits


Kafka topics are split into partitions. Each partition is ordered. Each message in a partition has an incremental ID called an *offset*. Each Kafka partition is further divided to multiple splits for parallel processing. Data is available for the retention period configured in Kafka clusters.

## Best practices


As a best practice, use predicate pushdown when you query Athena, as in the following examples.

```
SELECT * 
FROM "kafka_catalog_name"."glue_schema_registry_name"."glue_schema_name" 
WHERE integercol = 2147483647
```

```
SELECT * 
FROM "kafka_catalog_name"."glue_schema_registry_name"."glue_schema_name" 
WHERE timestampcol >= TIMESTAMP '2018-03-25 07:30:58.878'
```

## Setting up the Kafka connector


Before you can use the connector, you must set up your Apache Kafka cluster, use the [AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html) to define your schema, and configure authentication for the connector.

When working with the AWS Glue Schema Registry, note the following points:
+ Make sure that the text in **Description** field of the AWS Glue Schema Registry includes the string `{AthenaFederationKafka}`. This marker string is required for AWS Glue Registries that you use with the Amazon Athena Kafka connector.
+ For best performance, use only lowercase for your database names and table names. Using mixed casing causes the connector to perform a case insensitive search that is more computationally intensive.

**To set up your Apache Kafka environment and AWS Glue Schema Registry**

1. Set up your Apache Kafka environment.

1. Upload the Kafka topic description file (that is, its schema) in JSON format to the AWS Glue Schema Registry. For more information, see [Integrating with AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry-integrations.html) in the AWS Glue Developer Guide.

1. To use the `AVRO` or `PROTOBUF` data format when you define the schema in the AWS Glue Schema Registry:
   + For **Schema name**, enter the Kafka topic name in the same casing as the original.
   + For **Data format**, choose **Apache Avro** or **Protocol Buffers**.

    For example schemas, see the following section.

### Schema examples for the AWS Glue Schema Registry


Use the format of the examples in this section when you upload your schema to the [AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html).

#### JSON type schema example


In the following example, the schema to be created in the AWS Glue Schema Registry specifies `json` as the value for `dataFormat` and uses `datatypejson` for `topicName`.

**Note**  
The value for `topicName` should use the same casing as the topic name in Kafka. 

```
{
  "topicName": "datatypejson",
  "message": {
    "dataFormat": "json",
    "fields": [
      {
        "name": "intcol",
        "mapping": "intcol",
        "type": "INTEGER"
      },
      {
        "name": "varcharcol",
        "mapping": "varcharcol",
        "type": "VARCHAR"
      },
      {
        "name": "booleancol",
        "mapping": "booleancol",
        "type": "BOOLEAN"
      },
      {
        "name": "bigintcol",
        "mapping": "bigintcol",
        "type": "BIGINT"
      },
      {
        "name": "doublecol",
        "mapping": "doublecol",
        "type": "DOUBLE"
      },
      {
        "name": "smallintcol",
        "mapping": "smallintcol",
        "type": "SMALLINT"
      },
      {
        "name": "tinyintcol",
        "mapping": "tinyintcol",
        "type": "TINYINT"
      },
      {
        "name": "datecol",
        "mapping": "datecol",
        "type": "DATE",
        "formatHint": "yyyy-MM-dd"
      },
      {
        "name": "timestampcol",
        "mapping": "timestampcol",
        "type": "TIMESTAMP",
        "formatHint": "yyyy-MM-dd HH:mm:ss.SSS"
      }
    ]
  }
}
```

#### CSV type schema example


In the following example, the schema to be created in the AWS Glue Schema Registry specifies `csv` as the value for `dataFormat` and uses `datatypecsvbulk` for `topicName`. The value for `topicName` should use the same casing as the topic name in Kafka.

```
{
  "topicName": "datatypecsvbulk",
  "message": {
    "dataFormat": "csv",
    "fields": [
      {
        "name": "intcol",
        "type": "INTEGER",
        "mapping": "0"
      },
      {
        "name": "varcharcol",
        "type": "VARCHAR",
        "mapping": "1"
      },
      {
        "name": "booleancol",
        "type": "BOOLEAN",
        "mapping": "2"
      },
      {
        "name": "bigintcol",
        "type": "BIGINT",
        "mapping": "3"
      },
      {
        "name": "doublecol",
        "type": "DOUBLE",
        "mapping": "4"
      },
      {
        "name": "smallintcol",
        "type": "SMALLINT",
        "mapping": "5"
      },
      {
        "name": "tinyintcol",
        "type": "TINYINT",
        "mapping": "6"
      },
      {
        "name": "floatcol",
        "type": "DOUBLE",
        "mapping": "7"
      }
    ]
  }
}
```

#### AVRO type schema example


The following example is used to create an AVRO-based schema in the AWS Glue Schema Registry. When you define the schema in the AWS Glue Schema Registry, for **Schema name**, you enter the Kafka topic name in the same casing as the original, and for **Data format**, you choose **Apache Avro**. Because you specify this information directly in the registry, the `dataformat`and `topicName` fields are not required.

```
{
    "type": "record",
    "name": "avrotest",
    "namespace": "example.com",
    "fields": [{
            "name": "id",
            "type": "int"
        },
        {
            "name": "name",
            "type": "string"
        }
    ]
}
```

#### PROTOBUF type schema example


The following example is used to create an PROTOBUF-based schema in the AWS Glue Schema Registry. When you define the schema in the AWS Glue Schema Registry, for **Schema name**, you enter the Kafka topic name in the same casing as the original, and for **Data format**, you choose **Protocol Buffers**. Because you specify this information directly in the registry, the `dataformat`and `topicName` fields are not required. The first line defines the schema as PROTOBUF.

```
syntax = "proto3";
message protobuftest {
string name = 1;
int64 calories = 2;
string colour = 3;
}
```

For more information about adding a registry and schemas in the AWS Glue Schema Registry, see [Getting started with Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry-gs.html) in the AWS Glue documentation.

### Configuring authentication for the Athena Kafka connector


You can use a variety of methods to authenticate to your Apache Kafka cluster, including SSL, SASL/SCRAM, SASL/PLAIN, and SASL/PLAINTEXT.

The following table shows the authentication types for the connector and the security protocol and SASL mechanism for each. For more information, see the [Security](https://kafka.apache.org/documentation/#security) section of the Apache Kafka documentation.


****  

| auth\$1type | security.protocol | sasl.mechanism | Cluster type compatibility | 
| --- | --- | --- | --- | 
| SASL\$1SSL\$1PLAIN | SASL\$1SSL | PLAIN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)  | 
| SASL\$1PLAINTEXT\$1PLAIN | SASL\$1PLAINTEXT | PLAIN |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)  | 
| SASL\$1SSL\$1SCRAM\$1SHA512 | SASL\$1SSL | SCRAM-SHA-512 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)  | 
| SASL\$1PLAINTEXT\$1SCRAM\$1SHA512 | SASL\$1PLAINTEXT | SCRAM-SHA-512 |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)  | 
| SSL | SSL | N/A |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-kafka.html)  | 

#### SSL


If the cluster is SSL authenticated, you must generate the trust store and key store files and upload them to the Amazon S3 bucket. You must provide this Amazon S3 reference when you deploy the connector. The key store, trust store, and SSL key are stored in the AWS Secrets Manager. You provide the AWS secret key when you deploy the connector.

For information on creating a secret in Secrets Manager, see [Create an AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html).

To use this authentication type, set the environment variables as shown in the following table.


****  

| Parameter | Value | 
| --- | --- | 
| auth\$1type | SSL | 
| certificates\$1s3\$1reference | The Amazon S3 location that contains the certificates. | 
| secrets\$1manager\$1secret | The name of your AWS secret key. | 

After you create a secret in Secrets Manager, you can view it in the Secrets Manager console.

**To view your secret in Secrets Manager**

1. Open the Secrets Manager console at [https://console.aws.amazon.com/secretsmanager/](https://console.aws.amazon.com/secretsmanager/).

1. In the navigation pane, choose **Secrets**.

1. On the **Secrets** page, choose the link to your secret.

1. On the details page for your secret, choose **Retrieve secret value**.

   The following image shows an example secret with three key/value pairs: `keystore_password`, `truststore_password`, and `ssl_key_password`.  
![\[Retrieving an SSL secret in Secrets Manager\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-kafka-setup-1.png)

For more information about using SSL with Kafka, see [Encryption and Authentication using SSL](https://kafka.apache.org/documentation/#security_ssl) in the Apache Kafka documentation.

#### SASL/SCRAM


If your cluster uses SCRAM authentication, provide the Secrets Manager key that is associated with the cluster when you deploy the connector. The user's AWS credentials (secret key and access key) are used to authenticate with the cluster.

Set the environment variables as shown in the following table.


****  

| Parameter | Value | 
| --- | --- | 
| auth\$1type | SASL\$1SSL\$1SCRAM\$1SHA512 | 
| secrets\$1manager\$1secret | The name of your AWS secret key. | 

The following image shows an example secret in the Secrets Manager console with two key/value pairs: one for `username`, and one for `password`.

![\[Retrieving a SCRAM secret in Secrets Manager\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-kafka-setup-2.png)


For more information about using SASL/SCRAM with Kafka, see [Authentication using SASL/SCRAM](https://kafka.apache.org/documentation/#security_sasl_scram) in the Apache Kafka documentation.

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-kafka/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-kafka/LICENSE.txt) file on GitHub.com.

## Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-kafka) on GitHub.com.

# Amazon Athena MSK connector
MSK

The Amazon Athena connector for [Amazon MSK](https://aws.amazon.com/msk/) enables Amazon Athena to run SQL queries on your Apache Kafka topics. Use this connector to view [Apache Kafka](https://kafka.apache.org/) topics as tables and messages as rows in Athena. For additional information, see [Analyze real-time streaming data in Amazon MSK with Amazon Athena](https://aws.amazon.com/blogs/big-data/analyze-real-time-streaming-data-in-amazon-msk-with-amazon-athena/) in the AWS Big Data Blog.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites


Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Date and timestamp data types in filter conditions must be cast to appropriate data types.
+ Date and timestamp data types are not supported for the CSV file type and are treated as varchar values.
+ Mapping into nested JSON fields is not supported. The connector maps top-level fields only.
+ The connector does not support complex types. Complex types are interpreted as strings.
+ To extract or work with complex JSON values, use the JSON-related functions available in Athena. For more information, see [Extract JSON data from strings](extracting-data-from-JSON.md).
+ The connector does not support access to Kafka message metadata.

## Terms

+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Kafka endpoint** – A text string that establishes a connection to a Kafka instance.

## Cluster compatibility


The MSK connector can be used with the following cluster types.
+ **MSK Provisioned cluster** – You manually specify, monitor, and scale cluster capacity.
+ **MSK Serverless cluster** – Provides on-demand capacity that scales automatically as application I/O scales.
+ **Standalone Kafka** – A direct connection to Kafka (authenticated or unauthenticated).

## Supported authentication methods


The connector supports the following authentication methods.
+ [SASL/IAM](https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html) 
+ [SSL](https://docs.aws.amazon.com/msk/latest/developerguide/msk-authentication.html)
+ [SASL/SCRAM](https://docs.aws.amazon.com/msk/latest/developerguide/msk-password.html)
+ SASL/PLAIN
+ SASL/PLAINTEXT
+ NO\$1AUTH

  For more information, see [Configuring authentication for the Athena MSK connector](#connectors-msk-setup-configuring-authentication).

## Supported input data formats


The connector supports the following input data formats.
+ JSON
+ CSV

## Parameters


Use the parameters in this section to configure the Athena MSK connector.
+ **auth\$1type** – Specifies the authentication type of the cluster. The connector supports the following types of authentication:
  + **NO\$1AUTH** – Connect directly to Kafka with no authentication (for example, to a Kafka cluster deployed over an EC2 instance that does not use authentication).
  + **SASL\$1SSL\$1PLAIN** – This method uses the `SASL_SSL` security protocol and the `PLAIN` SASL mechanism.
  + **SASL\$1PLAINTEXT\$1PLAIN** – This method uses the `SASL_PLAINTEXT` security protocol and the `PLAIN` SASL mechanism.
**Note**  
The `SASL_SSL_PLAIN` and `SASL_PLAINTEXT_PLAIN` authentication types are supported by Apache Kafka but not by Amazon MSK.
  + **SASL\$1SSL\$1AWS\$1MSK\$1IAM** – IAM access control for Amazon MSK enables you to handle both authentication and authorization for your MSK cluster. Your user's AWS credentials (secret key and access key) are used to connect with the cluster. For more information, see [IAM access control](https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html) in the Amazon Managed Streaming for Apache Kafka Developer Guide.
  + **SASL\$1SSL\$1SCRAM\$1SHA512** – You can use this authentication type to control access to your Amazon MSK clusters. This method stores the user name and password on AWS Secrets Manager. The secret must be associated with the Amazon MSK cluster. For more information, see [Setting up SASL/SCRAM authentication for an Amazon MSK cluster](https://docs.aws.amazon.com/msk/latest/developerguide/msk-password.html#msk-password-tutorial) in the Amazon Managed Streaming for Apache Kafka Developer Guide.
  + **SSL** – SSL authentication uses key store and trust store files to connect with the Amazon MSK cluster. You must generate the trust store and key store files, upload them to an Amazon S3 bucket, and provide the reference to Amazon S3 when you deploy the connector. The key store, trust store, and SSL key are stored in AWS Secrets Manager. Your client must provide the AWS secret key when the connector is deployed. For more information, see [Mutual TLS authentication](https://docs.aws.amazon.com/msk/latest/developerguide/msk-authentication.html) in the Amazon Managed Streaming for Apache Kafka Developer Guide.

    For more information, see [Configuring authentication for the Athena MSK connector](#connectors-msk-setup-configuring-authentication).
+ **certificates\$1s3\$1reference** – The Amazon S3 location that contains the certificates (the key store and trust store files).
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **kafka\$1endpoint** – The endpoint details to provide to Kafka. For example, for an Amazon MSK cluster, you provide a [bootstrap URL](https://docs.aws.amazon.com/msk/latest/developerguide/msk-get-bootstrap-brokers.html) for the cluster.
+ **secrets\$1manager\$1secret** – The name of the AWS secret in which the credentials are saved. This parameter is not required for IAM authentication.
+ **Spill parameters** – Lambda functions temporarily store ("spill") data that do not fit into memory to Amazon S3. All database instances accessed by the same Lambda function spill to the same location. Use the parameters in the following table to specify the spill location.  
****    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/athena/latest/ug/connectors-msk.html)

## Data type support


The following table shows the corresponding data types supported for Kafka and Apache Arrow.


****  

| Kafka | Arrow | 
| --- | --- | 
| CHAR | VARCHAR | 
| VARCHAR | VARCHAR | 
| TIMESTAMP | MILLISECOND | 
| DATE | DAY | 
| BOOLEAN | BOOL | 
| SMALLINT | SMALLINT | 
| INTEGER | INT | 
| BIGINT | BIGINT | 
| DECIMAL | FLOAT8 | 
| DOUBLE | FLOAT8 | 

## Partitions and splits


Kafka topics are split into partitions. Each partition is ordered. Each message in a partition has an incremental ID called an *offset*. Each Kafka partition is further divided to multiple splits for parallel processing. Data is available for the retention period configured in Kafka clusters.

## Best practices


As a best practice, use predicate pushdown when you query Athena, as in the following examples.

```
SELECT * 
FROM "msk_catalog_name"."glue_schema_registry_name"."glue_schema_name" 
WHERE integercol = 2147483647
```

```
SELECT * 
FROM "msk_catalog_name"."glue_schema_registry_name"."glue_schema_name" 
WHERE timestampcol >= TIMESTAMP '2018-03-25 07:30:58.878'
```

## Setting up the MSK connector


Before you can use the connector, you must set up your Amazon MSK cluster, use the [AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html) to define your schema, and configure authentication for the connector.

**Note**  
If you deploy the connector into a VPC in order to access private resources and also want to connect to a publicly accessible service like Confluent, you must associate the connector with a private subnet that has a NAT Gateway. For more information, see [NAT gateways](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html) in the Amazon VPC User Guide.

When working with the AWS Glue Schema Registry, note the following points:
+ Make sure that the text in **Description** field of the AWS Glue Schema Registry includes the string `{AthenaFederationMSK}`. This marker string is required for AWS Glue Registries that you use with the Amazon Athena MSK connector.
+ For best performance, use only lowercase for your database names and table names. Using mixed casing causes the connector to perform a case insensitive search that is more computationally intensive.

**To set up your Amazon MSK environment and AWS Glue Schema Registry**

1. Set up your Amazon MSK environment. For information and steps, see [Setting up Amazon MSK](https://docs.aws.amazon.com/msk/latest/developerguide/before-you-begin.html) and [Getting started using Amazon MSK](https://docs.aws.amazon.com/msk/latest/developerguide/getting-started.html) in the Amazon Managed Streaming for Apache Kafka Developer Guide.

1. Upload the Kafka topic description file (that is, its schema) in JSON format to the AWS Glue Schema Registry. For more information, see [Integrating with AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry-integrations.html) in the AWS Glue Developer Guide. For example schemas, see the following section.

### Schema examples for the AWS Glue Schema Registry


Use the format of the examples in this section when you upload your schema to the [AWS Glue Schema Registry](https://docs.aws.amazon.com/glue/latest/dg/schema-registry.html).

#### JSON type schema example


In the following example, the schema to be created in the AWS Glue Schema Registry specifies `json` as the value for `dataFormat` and uses `datatypejson` for `topicName`.

**Note**  
The value for `topicName` should use the same casing as the topic name in Kafka. 

```
{
  "topicName": "datatypejson",
  "message": {
    "dataFormat": "json",
    "fields": [
      {
        "name": "intcol",
        "mapping": "intcol",
        "type": "INTEGER"
      },
      {
        "name": "varcharcol",
        "mapping": "varcharcol",
        "type": "VARCHAR"
      },
      {
        "name": "booleancol",
        "mapping": "booleancol",
        "type": "BOOLEAN"
      },
      {
        "name": "bigintcol",
        "mapping": "bigintcol",
        "type": "BIGINT"
      },
      {
        "name": "doublecol",
        "mapping": "doublecol",
        "type": "DOUBLE"
      },
      {
        "name": "smallintcol",
        "mapping": "smallintcol",
        "type": "SMALLINT"
      },
      {
        "name": "tinyintcol",
        "mapping": "tinyintcol",
        "type": "TINYINT"
      },
      {
        "name": "datecol",
        "mapping": "datecol",
        "type": "DATE",
        "formatHint": "yyyy-MM-dd"
      },
      {
        "name": "timestampcol",
        "mapping": "timestampcol",
        "type": "TIMESTAMP",
        "formatHint": "yyyy-MM-dd HH:mm:ss.SSS"
      }
    ]
  }
}
```

#### CSV type schema example


In the following example, the schema to be created in the AWS Glue Schema Registry specifies `csv` as the value for `dataFormat` and uses `datatypecsvbulk` for `topicName`. The value for `topicName` should use the same casing as the topic name in Kafka.

```
{
  "topicName": "datatypecsvbulk",
  "message": {
    "dataFormat": "csv",
    "fields": [
      {
        "name": "intcol",
        "type": "INTEGER",
        "mapping": "0"
      },
      {
        "name": "varcharcol",
        "type": "VARCHAR",
        "mapping": "1"
      },
      {
        "name": "booleancol",
        "type": "BOOLEAN",
        "mapping": "2"
      },
      {
        "name": "bigintcol",
        "type": "BIGINT",
        "mapping": "3"
      },
      {
        "name": "doublecol",
        "type": "DOUBLE",
        "mapping": "4"
      },
      {
        "name": "smallintcol",
        "type": "SMALLINT",
        "mapping": "5"
      },
      {
        "name": "tinyintcol",
        "type": "TINYINT",
        "mapping": "6"
      },
      {
        "name": "floatcol",
        "type": "DOUBLE",
        "mapping": "7"
      }
    ]
  }
}
```

### Configuring authentication for the Athena MSK connector


You can use a variety of methods to authenticate to your Amazon MSK cluster, including IAM, SSL, SCRAM, and standalone Kafka.

The following table shows the authentication types for the connector and the security protocol and SASL mechanism for each. For more information, see [Authentication and authorization for Apache Kafka APIs](https://docs.aws.amazon.com/msk/latest/developerguide/kafka_apis_iam.html) in the Amazon Managed Streaming for Apache Kafka Developer Guide.


****  

| auth\$1type | security.protocol | sasl.mechanism | 
| --- | --- | --- | 
| SASL\$1SSL\$1PLAIN | SASL\$1SSL | PLAIN | 
| SASL\$1PLAINTEXT\$1PLAIN | SASL\$1PLAINTEXT | PLAIN | 
| SASL\$1SSL\$1AWS\$1MSK\$1IAM | SASL\$1SSL | AWS\$1MSK\$1IAM | 
| SASL\$1SSL\$1SCRAM\$1SHA512 | SASL\$1SSL | SCRAM-SHA-512 | 
| SSL | SSL | N/A | 

**Note**  
The `SASL_SSL_PLAIN` and `SASL_PLAINTEXT_PLAIN` authentication types are supported by Apache Kafka but not by Amazon MSK.

#### SASL/IAM


If the cluster uses IAM authentication, you must configure the IAM policy for the user when you set up the cluster. For more information, see [IAM access control](https://docs.aws.amazon.com/msk/latest/developerguide/IAM-access-control.html) in the Amazon Managed Streaming for Apache Kafka Developer Guide.

To use this authentication type, set the `auth_type` Lambda environment variable for the connector to `SASL_SSL_AWS_MSK_IAM`. 

#### SSL


If the cluster is SSL authenticated, you must generate the trust store and key store files and upload them to the Amazon S3 bucket. You must provide this Amazon S3 reference when you deploy the connector. The key store, trust store, and SSL key are stored in the AWS Secrets Manager. You provide the AWS secret key when you deploy the connector.

For information on creating a secret in Secrets Manager, see [Create an AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html).

To use this authentication type, set the environment variables as shown in the following table.


****  

| Parameter | Value | 
| --- | --- | 
| auth\$1type | SSL | 
| certificates\$1s3\$1reference | The Amazon S3 location that contains the certificates. | 
| secrets\$1manager\$1secret | The name of your AWS secret key. | 

After you create a secret in Secrets Manager, you can view it in the Secrets Manager console.

**To view your secret in Secrets Manager**

1. Open the Secrets Manager console at [https://console.aws.amazon.com/secretsmanager/](https://console.aws.amazon.com/secretsmanager/).

1. In the navigation pane, choose **Secrets**.

1. On the **Secrets** page, choose the link to your secret.

1. On the details page for your secret, choose **Retrieve secret value**.

   The following image shows an example secret with three key/value pairs: `keystore_password`, `truststore_password`, and `ssl_key_password`.  
![\[Retrieving an SSL secret in Secrets Manager\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-msk-setup-1.png)

#### SASL/SCRAM


If your cluster uses SCRAM authentication, provide the Secrets Manager key that is associated with the cluster when you deploy the connector. The user's AWS credentials (secret key and access key) are used to authenticate with the cluster.

Set the environment variables as shown in the following table.


****  

| Parameter | Value | 
| --- | --- | 
| auth\$1type | SASL\$1SSL\$1SCRAM\$1SHA512 | 
| secrets\$1manager\$1secret | The name of your AWS secret key. | 

The following image shows an example secret in the Secrets Manager console with two key/value pairs: one for `username`, and one for `password`.

![\[Retrieving a SCRAM secret in Secrets Manager\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-msk-setup-2.png)


## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-msk/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-msk/LICENSE.txt) file on GitHub.com.

## Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-msk) on GitHub.com.

# Amazon Athena MySQL connector
MySQL

The Amazon Athena Lambda MySQL connector enables Amazon Athena to access MySQL databases.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Because Athena converts queries to lower case, MySQL table names must be in lower case. For example, Athena queries against a table named `myTable` will fail.
+ If you migrate your MySQL connections to Glue Catalog and Lake Formation, only the lowercase table and column names will be recognized. 

## Terms


The following terms relate to the MySQL connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the MySQL connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

### Glue connections (recommended)


We recommend that you configure a MySQL connector by using a Glue connections object. 

To do this, set the `glue_connection` environment variable of the MySQL connector Lambda to the name of the Glue connection to use.

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type MYSQL
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The MySQL connector created using Glue connections does not support the use of a multiplexing handler.
The MySQL connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
mysql://${jdbc_connection_string}
```

**Note**  
If you receive the error java.sql.SQLException: Zero date value prohibited when doing a `SELECT` query on a MySQL table, add the following parameter to your connection string:  

```
zeroDateTimeBehavior=convertToNull
```
For more information, see [Error 'Zero date value prohibited' while trying to select from MySQL table](https://github.com/awslabs/aws-athena-query-federation/issues/760) on GitHub.com.

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | MySqlMuxCompositeHandler | 
| Metadata handler | MySqlMuxMetadataHandler | 
| Record handler | MySqlMuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mymysqlcatalog, then the environment variable name is mymysqlcatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a MySql MUX Lambda function that supports two database instances: `mysql1` (the default), and `mysql2`.


****  

| Property | Value | 
| --- | --- | 
| default | mysql://jdbc:mysql://mysql2.host:3333/default?user=sample2&password=sample2 | 
| mysql\$1catalog1\$1connection\$1string | mysql://jdbc:mysql://mysql1.host:3306/default?\$1\$1Test/RDS/MySql1\$1 | 
| mysql\$1catalog2\$1connection\$1string | mysql://jdbc:mysql://mysql2.host:3333/default?user=sample2&password=sample2 | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/MySql1}`.

```
mysql://jdbc:mysql://mysql1.host:3306/default?...&${Test/RDS/MySql1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
mysql://jdbc:mysql://mysql1host:3306/default?...&user=sample2&password=sample2&...
```

Currently, the MySQL connector recognizes the `user` and `password` JDBC properties.

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single MySQL instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | MySqlCompositeHandler | 
| Metadata handler | MySqlMetadataHandler | 
| Record handler | MySqlRecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single MySQL instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | mysql://mysql1.host:3306/default?secret=Test/RDS/MySql1 | 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for JDBC and Arrow.


****  

| JDBC | Arrow | 
| --- | --- | 
| Boolean | Bit | 
| Integer | Tiny | 
| Short | Smallint | 
| Integer | Int | 
| Long | Bigint | 
| float | Float4 | 
| Double | Float8 | 
| Date | DateDay | 
| Timestamp | DateMilli | 
| String | Varchar | 
| Bytes | Varbinary | 
| BigDecimal | Decimal | 
| ARRAY | List | 

## Partitions and splits


Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance


MySQL supports native partitions. The Athena MySQL connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, native partitioning is highly recommended.

The Athena MySQL connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### LIMIT clauses


A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena MySQL connector can combine these expressions and push them directly to MySQL for enhanced functionality and to reduce the amount of data scanned.

The following Athena MySQL connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

For an article on using predicate pushdown to improve performance in federated queries, including MySQL, see [Improve federated queries with predicate pushdown in Amazon Athena](https://aws.amazon.com/blogs/big-data/improve-federated-queries-with-predicate-pushdown-in-amazon-athena/) in the *AWS Big Data Blog*.

## Passthrough queries


The MySQL connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with MySQL, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in MySQL. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-mysql/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-mysql/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-mysql/pom.xml) file for the MySQL connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-mysql) on GitHub.com.

# Amazon Athena Neptune connector
Neptune

Amazon Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. Neptune's purpose-built, high-performance graph database engine stores billions of relationships optimally and queries graphs with a latency of only milliseconds. For more information, see the [Neptune User Guide](https://docs.aws.amazon.com/neptune/latest/userguide/intro.html).

The Amazon Athena Neptune Connector enables Athena to communicate with your Neptune graph database instance, making your Neptune graph data accessible by SQL queries.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

## Prerequisites


Using the Neptune connector requires the following three steps.
+ Setting up a Neptune cluster
+ Setting up an AWS Glue Data Catalog
+ Deploying the connector to your AWS account. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md). For additional details specific to deploying the Neptune connector, see [Deploy the Amazon Athena Neptune Connector](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-neptune/docs/neptune-connector-setup) on GitHub.com.

## Limitations


Currently, the Neptune Connector has the following limitation.
+ Projecting columns, including the primary key (ID), is not supported. 

## Setting up a Neptune cluster


If you don't have an existing Amazon Neptune cluster and property graph dataset in it that you would like to use, you must set one up.

Make sure you have an internet gateway and NAT gateway in the VPC that hosts your Neptune cluster. The private subnets that the Neptune connector Lambda function uses should have a route to the internet through this NAT Gateway. The Neptune connector Lambda function uses the NAT Gateway to communicate with AWS Glue.

For instructions on setting up a new Neptune cluster and loading it with a sample dataset, see [Sample Neptune Cluster Setup](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-neptune/docs/neptune-cluster-setup) on GitHub.com.

## Setting up an AWS Glue Data Catalog


Unlike traditional relational data stores, Neptune graph DB nodes and edges do not use a set schema. Each entry can have different fields and data types. However, because the Neptune connector retrieves metadata from the AWS Glue Data Catalog, you must create an AWS Glue database that has tables with the required schema. After you create the AWS Glue database and tables, the connector can populate the list of tables available to query from Athena.

### Enabling case insensitive column matching


To resolve column names from your Neptune table with the correct casing even when the column names are all lower cased in AWS Glue, you can configure the Neptune connector for case insensitive matching.

To enable this feature, set the Neptune connector Lambda function environment variable `enable_caseinsensitivematch` to `true`. 

### Specifying the AWS Glue glabel table parameter for cased table names


Because AWS Glue supports only lowercase table names, it is important to specify the `glabel` AWS Glue table parameter when you create an AWS Glue table for Neptune and your Neptune table name includes casing. 

In your AWS Glue table definition, include the `glabel` parameter and set its value to your table name with its original casing. This ensures that the correct casing is preserved when AWS Glue interacts with your Neptune table. The following example sets the value of `glabel` to the table name `Airport`.

```
glabel = Airport
```

![\[Setting the glabel AWS Glue table property to preserve table name casing for a Neptune table\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-neptune-1.png)


For more information on setting up a AWS Glue Data Catalog to work with Neptune, see [Set up AWS Glue Catalog](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-neptune/docs/aws-glue-sample-scripts) on GitHub.com.

## Performance


The Athena Neptune connector performs predicate pushdown to decrease the data scanned by the query. However, predicates using the primary key result in query failure. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data. The Neptune connector is resilient to throttling due to concurrency.

## Passthrough queries


The Neptune connector supports [passthrough queries](federated-query-passthrough.md). You can use this feature to run Gremlin queries on property graphs and to run SPARQL queries on RDF data.

To create passthrough queries with Neptune, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            DATABASE => 'database_name',
            COLLECTION => 'collection_name',
            QUERY => 'query_string'
        ))
```

The following example Neptune passthrough query filters for airports with the code `ATL`. The doubled single quotes are for escaping.

```
SELECT * FROM TABLE(
        system.query(
            DATABASE => 'graph-database',
            COLLECTION => 'airport',
            QUERY => 'g.V().has(''airport'', ''code'', ''ATL'').valueMap()' 
        ))
```

## Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-neptune) on GitHub.com.

# Amazon Athena OpenSearch connector
OpenSearch

OpenSearch Service

The Amazon Athena OpenSearch connector enables Amazon Athena to communicate with your OpenSearch instances so that you can use SQL to query your OpenSearch data.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

**Note**  
Due to a known issue, the OpenSearch connector cannot be used with a VPC.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Terms


The following terms relate to the OpenSearch connector.
+ **Domain** – A name that this connector associates with the endpoint of your OpenSearch instance. The domain is also used as the database name. For OpenSearch instances defined within the Amazon OpenSearch Service, the domain is auto-discoverable. For other instances, you must provide a mapping between the domain name and endpoint.
+ **Index** – A database table defined in your OpenSearch instance.
+ **Mapping** – If an index is a database table, then the mapping is its schema (that is, the definitions of its fields and attributes).

  This connector supports both metadata retrieval from the OpenSearch instance and from the AWS Glue Data Catalog. If the connector finds a AWS Glue database and table that match your OpenSearch domain and index names, the connector attempts to use them for schema definition. We recommend that you create your AWS Glue table so that it is a superset of all fields in your OpenSearch index.
+ **Document** – A record within a database table.
+ **Data stream** – Time based data that is composed of multiple backing indices. For more information, see [Data streams](https://opensearch.org/docs/latest/dashboards/im-dashboards/datastream/) in the OpenSearch documentation and [Getting started with data streams](https://docs.aws.amazon.com/opensearch-service/latest/developerguide/data-streams.html#data-streams-example) in the *Amazon OpenSearch Service Developer Guide*.
**Note**  
Because data stream indices are internally created and managed by open search, the connector chooses the schema mapping from the first available index. For this reason, we strongly recommend setting up an AWS Glue table as a supplemental metadata source. For more information, see [Setting up databases and tables in AWS Glue](#connectors-opensearch-setting-up-databases-and-tables-in-aws-glue). 

## Parameters


Use the parameters in this section to configure the OpenSearch connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)


We recommend that you configure a OpenSearch connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the OpenSearch connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type OPENSEARCH
```

**Lambda environment properties**
+  **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The OpenSearch connector created using Glue connections does not support the use of a multiplexing handler.
The OpenSearch connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections

+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **disable\$1glue** – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
+ **query\$1timeout\$1cluster** – The timeout period, in seconds, for cluster health queries used in the generation of parallel scans.
+ **query\$1timeout\$1search** – The timeout period, in seconds, for search queries used in the retrieval of documents from an index.
+ **auto\$1discover\$1endpoint** – Boolean. The default is `true`. When you use the Amazon OpenSearch Service and set this parameter to true, the connector can auto-discover your domains and endpoints by calling the appropriate describe or list API operations on the OpenSearch Service. For any other type of OpenSearch instance (for example, self-hosted), you must specify the associated domain endpoints in the `domain_mapping` variable. If `auto_discover_endpoint=true`, the connector uses AWS credentials to authenticate to the OpenSearch Service. Otherwise, the connector retrieves user name and password credentials from AWS Secrets Manager through the `domain_mapping` variable.
+ **domain\$1mapping** – Used only when `auto_discover_endpoint` is set to false and defines the mapping between domain names and their associated endpoints. The `domain_mapping` variable can accommodate multiple OpenSearch endpoints in the following format:

  ```
  domain1=endpoint1,domain2=endpoint2,domain3=endpoint3,...       
  ```

  For the purpose of authenticating to an OpenSearch endpoint, the connector supports substitution strings injected using the format `${SecretName}` with user name and password retrieved from AWS Secrets Manager. The secret should be stored in the following JSON format:

  ```
  { "username": "your_username", "password": "your_password" }
  ```

  The connector will automatically parse this JSON structure to retrieve the credentials.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.

  The following example uses the `opensearch-creds` secret.

  ```
  movies=https://${opensearch-creds}:search-movies-ne...qu---us-east-1---es.amazonaws.com     
  ```

  At runtime, `${opensearch-creds}` is rendered as the user name and password, as in the following example.

  ```
  movies=https://myusername@mypassword:search-movies-ne...qu---us-east-1---es.amazonaws.com
  ```

  In the `domain_mapping` parameter, each domain-endpoint pair can use a different secret. The secret itself must be specified in the format *user\$1name*@*password*. Although the password may contain embedded `@` signs, the first `@` serves as a separator from *user\$1name*.

  It is also important to note that the comma (,) and equal sign (=) are used by this connector as separators for the domain-endpoint pairs. For this reason, you should not use them anywhere inside the stored secret.

## Setting up databases and tables in AWS Glue


The connector obtains metadata information by using AWS Glue or OpenSearch. You can set up an AWS Glue table as a supplemental metadata definition source. To enable this feature, define a AWS Glue database and table that match the domain and index of the source that you are supplementing. The connector can also take advantage of metadata definitions stored in the OpenSearch instance by retrieving the mapping for the specified index.

### Defining metadata for arrays in OpenSearch


OpenSearch does not have a dedicated array data type. Any field can contain zero or more values so long as they are of the same data type. If you want to use OpenSearch as your metadata definition source, you must define a `_meta` property for all indices used with Athena for the fields that to be considered a list or array. If you fail to complete this step, queries return only the first element in the list field. When you specify the `_meta` property, field names should be fully qualified for nested JSON structures (for example, `address.street`, where `street` is a nested field inside an `address` structure).

The following example defines `actor` and `genre` lists in the `movies` table.

```
PUT movies/_mapping 
{ 
  "_meta": { 
    "actor": "list", 
    "genre": "list" 
  } 
}
```

### Data types


The OpenSearch connector can extract metadata definitions from either AWS Glue or the OpenSearch instance. The connector uses the mapping in the following table to convert the definitions to Apache Arrow data types, including the points noted in the section that follows.


****  

| OpenSearch | Apache Arrow | AWS Glue | 
| --- | --- | --- | 
| text, keyword, binary | VARCHAR | string | 
| long | BIGINT | bigint | 
| scaled\$1float | BIGINT | SCALED\$1FLOAT(...) | 
| integer | INT | int | 
| short | SMALLINT | smallint | 
| byte | TINYINT | tinyint | 
| double | FLOAT8 | double | 
| float, half\$1float | FLOAT4 | float | 
| boolean | BIT | boolean | 
| date, date\$1nanos | DATEMILLI | timestamp | 
| JSON structure | STRUCT | STRUCT | 
| \$1meta (for information, see the section [Defining metadata for arrays in OpenSearch](#connectors-opensearch-defining-metadata-for-arrays-in-opensearch).) | LIST | ARRAY | 

#### Notes on data types

+ Currently, the connector supports only the OpenSearch and AWS Glue data-types listed in the preceding table.
+ A `scaled_float` is a floating-point number scaled by a fixed double scaling factor and represented as a `BIGINT` in Apache Arrow. For example, 0.756 with a scaling factor of 100 is rounded to 76.
+ To define a `scaled_float` in AWS Glue, you must select the `array` column type and declare the field using the format SCALED\$1FLOAT(*scaling\$1factor*).

  The following examples are valid:

  ```
  SCALED_FLOAT(10.51) 
  SCALED_FLOAT(100) 
  SCALED_FLOAT(100.0)
  ```

  The following examples are not valid:

  ```
  SCALED_FLOAT(10.) 
  SCALED_FLOAT(.5)
  ```
+ When converting from `date_nanos` to `DATEMILLI`, nanoseconds are rounded to the nearest millisecond. Valid values for `date` and `date_nanos` include, but are not limited to, the following formats:

  ```
  "2020-05-18T10:15:30.123456789" 
  "2020-05-15T06:50:01.123Z" 
  "2020-05-15T06:49:30.123-05:00" 
  1589525370001 (epoch milliseconds)
  ```
+ An OpenSearch `binary` is a string representation of a binary value encoded using `Base64` and is converted to a `VARCHAR`.

## Running SQL queries


The following are examples of DDL queries that you can use with this connector. In the examples, *function\$1name* corresponds to the name of your Lambda function, *domain* is the name of the domain that you want to query, and *index* is the name of your index.

```
SHOW DATABASES in `lambda:function_name`
```

```
SHOW TABLES in `lambda:function_name`.domain
```

```
DESCRIBE `lambda:function_name`.domain.index
```

## Performance


The Athena OpenSearch connector supports shard-based parallel scans. The connector uses cluster health information retrieved from the OpenSearch instance to generate multiple requests for a document search query. The requests are split for each shard and run concurrently.

The connector also pushes down predicates as part of its document search queries. The following example query and predicate shows how the connector uses predicate push down.

**Query**

```
SELECT * FROM "lambda:elasticsearch".movies.movies 
WHERE year >= 1955 AND year <= 1962 OR year = 1996
```

**Predicate**

```
(_exists_:year) AND year:([1955 TO 1962] OR 1996)
```

## Passthrough queries


The OpenSearch connector supports [passthrough queries](federated-query-passthrough.md) and uses the Query DSL language. For more information about querying with Query DSL, see [Query DSL](https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html) in the Elasticsearch documentation or [Query DSL](https://opensearch.org/docs/latest/query-dsl/) in the OpenSearch documentation.

To use passthrough queries with the OpenSearch connector, use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            schema => 'schema_name',
            index => 'index_name',
            query => "{query_string}"
        ))
```

The following OpenSearch example passthrough query filters for employees with active employment status in the `employee` index of the `default` schema.

```
SELECT * FROM TABLE(
        system.query(
            schema => 'default',
            index => 'employee',
            query => "{ ''bool'':{''filter'':{''term'':{''status'': ''active''}}}}"
        ))
```

## Additional resources

+ For an article on using the Amazon Athena OpenSearch connector to query data in Amazon OpenSearch Service and Amazon S3 in a single query, see [Query data in Amazon OpenSearch Service using SQL from Amazon Athena](https://aws.amazon.com/blogs/big-data/query-data-in-amazon-opensearch-service-using-sql-from-amazon-athena/) in the *AWS Big Data Blog*.
+ For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-elasticsearch) on GitHub.com.

# Amazon Athena Oracle connector
Oracle

The Amazon Athena connector for Oracle enables Amazon Athena to run SQL queries on data stored in Oracle running on-premises or on Amazon EC2 or Amazon RDS. You can also use the connector to query data on [Oracle exadata](https://www.oracle.com/engineered-systems/exadata/).

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Only version 12.1.0.2 Oracle Databases are supported.
+ If the Oracle connector does not use a Glue connection, the database, table, and column names will be converted to upper case by the connector. 

  If the Oracle connector uses a Glue connection, the database, table, and column names will not default to upper case by the connector. To change this casing behavior, change the Lambda by environment variable `casing_mode` to `upper` or `lower` as needed.

   An Oracle connector using Glue connection does not support the use of a Multiplexing handler.
+ When you use the Oracle `NUMBER` without Precision and Scale defined, Athena treats this as `BIGINT`. To get the required decimal places in Athena, specify `default_scale=<number of decimal places>` in your Lambda environment variables.

## Terms


The following terms relate to the Oracle connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the Oracle connector.

### Glue connections (recommended)


We recommend that you configure a Oracle connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Oracle connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type ORACLE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **is\$1fips\$1enabled** – (Optional) Set to true when FIPS mode is enabled. The default is false.
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **lower** – Lower case all given schema and table names. This is the default for connectors that have an associated glue connection.
  + **upper** – Upper case all given schema and table names. This is the default for connectors that do not have an associated glue connection.
  + **case\$1insensitive\$1search** – Perform case insensitive searches against schema and tables names in Oracle. Use this value if your query contains schema or table names that do not match the default casing for your connector.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Oracle connector created using Glue connections does not support the use of a multiplexing handler.
The Oracle connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **default** – The JDBC connection string to use to connect to the Oracle database instance. For example, `oracle://${jdbc_connection_string}`
+ **catalog\$1connection\$1string** – Used by the Multiplexing handler (not supported when using a glue connection). A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myoraclecatalog, then the environment variable name is myoraclecatalog\$1connection\$1string.
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **is\$1fips\$1enabled** – (Optional) Set to true when FIPS mode is enabled. The default is false.
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **lower** – Lower case all given schema and table names. This is the default for connectors that have an associated glue connection.
  + **upper** – Upper case all given schema and table names. This is the default for connectors that do not have an associated glue connection.
  + **case\$1insensitive\$1search** – Perform case insensitive searches against schema and tables names in Oracle. Use this value if your query contains schema or table names that do not match the default casing for your connector.

#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
oracle://${jdbc_connection_string}
```

**Note**  
If your password contains special characters (for example, `some.password`), enclose your password in double quotes when you pass it to the connection string (for example, `"some.password"`). Failure to do so can result in an Invalid Oracle URL specified error.

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single Oracle instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | OracleCompositeHandler | 
| Metadata handler | OracleMetadataHandler | 
| Record handler | OracleRecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 
| IsFIPSEnabled | Optional. Set to true when FIPS mode is enabled. The default is false.  | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The connector supports SSL based connections to Amazon RDS instances. Support is limited to the Transport Layer Security (TLS) protocol and to authentication of the server by the client. Mutual authentication it is not supported in Amazon RDS. The second row in the table below shows the syntax for using SSL.

The following example property is for a single Oracle instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | oracle://jdbc:oracle:thin:\$1\$1Test/RDS/Oracle\$1@//hostname:port/servicename | 
|  | oracle://jdbc:oracle:thin:\$1\$1Test/RDS/Oracle\$1@(DESCRIPTION=(ADDRESS=(PROTOCOL=TCPS) (HOST=<HOST\$1NAME>)(PORT=))(CONNECT\$1DATA=(SID=))(SECURITY=(SSL\$1SERVER\$1CERT\$1DN=))) | 

#### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Note**  
If your password contains special characters (for example, `some.password`), enclose your password in double quotes when you store it in Secrets Manager (for example, `"some.password"`). Failure to do so can result in an Invalid Oracle URL specified error.

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/Oracle}`.

```
oracle://jdbc:oracle:thin:${Test/RDS/Oracle}@//hostname:port/servicename 
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
oracle://jdbc:oracle:thin:username/password@//hostname:port/servicename
```

Currently, the Oracle connector recognizes the `UID` and `PWD` JDBC properties.

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | OracleMuxCompositeHandler | 
| Metadata handler | OracleMuxMetadataHandler | 
| Record handler | OracleMuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myoraclecatalog, then the environment variable name is myoraclecatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Oracle MUX Lambda function that supports two database instances: `oracle1` (the default), and `oracle2`.


****  

| Property | Value | 
| --- | --- | 
| default | oracle://jdbc:oracle:thin:\$1\$1Test/RDS/Oracle1\$1@//oracle1.hostname:port/servicename | 
| oracle\$1catalog1\$1connection\$1string | oracle://jdbc:oracle:thin:\$1\$1Test/RDS/Oracle1\$1@//oracle1.hostname:port/servicename | 
| oracle\$1catalog2\$1connection\$1string | oracle://jdbc:oracle:thin:\$1\$1Test/RDS/Oracle2\$1@//oracle2.hostname:port/servicename | 

## Data type support


The following table shows the corresponding data types for JDBC, Oracle, and Arrow.


****  

| JDBC | Oracle | Arrow | 
| --- | --- | --- | 
| Boolean | boolean | Bit | 
| Integer | N/A | Tiny | 
| Short | smallint | Smallint | 
| Integer | integer | Int | 
| Long | bigint | Bigint | 
| float | float4 | Float4 | 
| Double | float8 | Float8 | 
| Date | date | DateDay | 
| Timestamp | timestamp | DateMilli | 
| String | text | Varchar | 
| Bytes | bytes | Varbinary | 
| BigDecimal | numeric(p,s) | Decimal | 
| ARRAY | N/A (see note) | List | 

## Partitions and splits


Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance


Oracle supports native partitions. The Athena Oracle connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, native partitioning is highly recommended. Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The Oracle connector is resilient to throttling due to concurrency. However, query runtimes tend to be long.

The Athena Oracle connector performs predicate pushdown to decrease the data scanned by the query. Simple predicates and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Oracle connector can combine these expressions and push them directly to Oracle for enhanced functionality and to reduce the amount of data scanned.

The following Athena Oracle connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%');
```

## Passthrough queries


The Oracle connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Oracle, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Oracle. The query selects all columns in the `customer` table.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-oracle/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-oracle/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-oracle/pom.xml) file for the Oracle connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-oracle) on GitHub.com.

# Amazon Athena PostgreSQL connector
PostgreSQL

The Amazon Athena PostgreSQL connector enables Athena to access your PostgreSQL databases.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Like PostgreSQL, Athena treats trailing spaces in PostgreSQL `CHAR` types as semantically insignificant for length and comparison purposes. Note that this applies only to `CHAR` but not to `VARCHAR` types. Athena ignores trailing spaces for the `CHAR` type, but treats them as significant for the `VARCHAR` type.
+ When you use the [citext](https://www.postgresql.org/docs/current/citext.html) case-insensitive character string data type, PostgreSQL uses a case insensitive data comparison that is different from Athena. This difference creates a data discrepancy during SQL `JOIN` operations. To workaround this issue, use the PostgreSQL connector passthrough query feature. For more information, see the [passthrough queries](#connectors-postgres-passthrough-queries) section later in this document. 

## Terms


The following terms relate to the PostgreSQL connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the PostgreSQL connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

### Glue connections (recommended)


We recommend that you configure a PostgreSQL connector by using a Glue connections object. 

To do this, set the `glue_connection` environment variable of the PostgreSQL connector Lambda to the name of the Glue connection to use.

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type POSTGRESQL
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The PostgreSQL connector created using Glue connections does not support the use of a multiplexing handler.
The PostgreSQL connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
postgres://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | PostGreSqlMuxCompositeHandler | 
| Metadata handler | PostGreSqlMuxMetadataHandler | 
| Record handler | PostGreSqlMuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mypostgrescatalog, then the environment variable name is mypostgrescatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a PostGreSql MUX Lambda function that supports two database instances: `postgres1` (the default), and `postgres2`.


****  

| Property | Value | 
| --- | --- | 
| default | postgres://jdbc:postgresql://postgres1.host:5432/default?\$1\$1Test/RDS/PostGres1\$1 | 
| postgres\$1catalog1\$1connection\$1string | postgres://jdbc:postgresql://postgres1.host:5432/default?\$1\$1Test/RDS/PostGres1\$1 | 
| postgres\$1catalog2\$1connection\$1string | postgres://jdbc:postgresql://postgres2.host:5432/default?user=sample&password=sample | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/PostGres1}`.

```
postgres://jdbc:postgresql://postgres1.host:5432/default?...&${Test/RDS/PostGres1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
postgres://jdbc:postgresql://postgres1.host:5432/default?...&user=sample2&password=sample2&...
```

Currently, the PostgreSQL connector recognizes the `user` and `password` JDBC properties.

##### Enabling SSL


To support SSL in your PostgreSQL connection, append the following to your connection string:

```
&sslmode=verify-ca&sslfactory=org.postgresql.ssl.DefaultJavaSSLFactory
```

**Example**  
The following example connection string does not use SSL.

```
postgres://jdbc:postgresql://example-asdf-aurora-postgres-endpoint:5432/asdf?user=someuser&password=somepassword
```

To enable SSL, modify the string as follows.

```
postgres://jdbc:postgresql://example-asdf-aurora-postgres-endpoint:5432/asdf?user=someuser&password=somepassword&sslmode=verify-ca&sslfactory=org.postgresql.ssl.DefaultJavaSSLFactory
```

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single PostgreSQL instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | PostGreSqlCompositeHandler | 
| Metadata handler | PostGreSqlMetadataHandler | 
| Record handler | PostGreSqlRecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single PostgreSQL instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | postgres://jdbc:postgresql://postgres1.host:5432/default?secret=\$1\$1Test/RDS/PostgreSQL1\$1 | 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for JDBC, PostGreSQL, and Arrow.


****  

| JDBC | PostGreSQL | Arrow | 
| --- | --- | --- | 
| Boolean | Boolean | Bit | 
| Integer | N/A | Tiny | 
| Short | smallint | Smallint | 
| Integer | integer | Int | 
| Long | bigint | Bigint | 
| float | float4 | Float4 | 
| Double | float8 | Float8 | 
| Date | date | DateDay | 
| Timestamp | timestamp | DateMilli | 
| String | text | Varchar | 
| Bytes | bytes | Varbinary | 
| BigDecimal | numeric(p,s) | Decimal | 
| ARRAY | N/A (see note) | List | 

**Note**  
The `ARRAY` type is supported for the PostgreSQL connector with the following constraints: Multidimensional arrays (`<data_type>[][]` or nested arrays) are not supported. Columns with unsupported `ARRAY` data-types are converted to an array of string elements (`array<varchar>`).

## Partitions and splits


Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

## Performance


PostgreSQL supports native partitions. The Athena PostgreSQL connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, native partitioning is highly recommended.

The Athena PostgreSQL connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. However, selecting a subset of columns sometimes results in a longer query execution runtime.

### LIMIT clauses


A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena PostgreSQL connector can combine these expressions and push them directly to PostgreSQL for enhanced functionality and to reduce the amount of data scanned.

The following Athena PostgreSQL connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries


The PostgreSQL connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with PostgreSQL, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in PostgreSQL. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-postgresql/pom.xml) file for the PostgreSQL connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-postgresql) on GitHub.com.

# Amazon Athena Redis OSS connector
Redis OSS

The Amazon Athena Redis OSS connector enables Amazon Athena to communicate with your Redis OSS instances so that you can query your Redis OSS data with SQL. You can use the AWS Glue Data Catalog to map your Redis OSS key-value pairs into virtual tables.

Unlike traditional relational data stores, Redis OSS does not have the concept of a table or a column. Instead, Redis OSS offers key-value access patterns where the key is essentially a `string` and the value is a `string`, `z-set`, or `hmap`.

You can use the AWS Glue Data Catalog to create schema and configure virtual tables. Special table properties tell the Athena Redis OSS connector how to map your Redis OSS keys and values into a table. For more information, see [Setting up databases and tables in AWS Glue](#connectors-redis-setting-up-databases-and-tables-in-glue) later in this document.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

The Amazon Athena Redis OSS connector supports Amazon MemoryDB and Amazon ElastiCache (Redis OSS).

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Parameters


Use the parameters in this section to configure the Redis connector.
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.

## Setting up databases and tables in AWS Glue


To enable an AWS Glue table for use with Redis OSS, you can set the following table properties on the table: `redis-endpoint`, `redis-value-type`, and either `redis-keys-zset` or `redis-key-prefix`.

In addition, any AWS Glue database that contains Redis OSS tables must have a `redis-db-flag` in the URI property of the database. To set the `redis-db-flag` URI property, use the AWS Glue console to edit the database.

The following list describes the table properties.
+ **redis-endpoint** – (Required) The *hostname*`:`*port*`:`*password* of the Redis OSS server that contains data for this table (for example, `athena-federation-demo.cache.amazonaws.com:6379`) Alternatively, you can store the endpoint, or part of the endpoint, in AWS Secrets Manager by using \$1\$1*Secret\$1Name*\$1 as the table property value.

**Note**  
To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.
+ **redis-keys-zset** – (Required if `redis-key-prefix` is not used) A comma-separated list of keys whose value is a [zset](https://redis.com/ebook/part-2-core-concepts/chapter-3-commands-in-redis/3-5-sorted-sets/) (for example, `active-orders,pending-orders`). Each of the values in the zset is treated as a key that is part of the table. Either the `redis-keys-zset` property or the `redis-key-prefix` property must be set.
+ **redis-key-prefix** – (Required if `redis-keys-zset` is not used) A comma separated list of key prefixes to scan for values in the table (for example, `accounts-*,acct-`). Either the `redis-key-prefix` property or the `redis-keys-zset` property must be set.
+ **redis-value-type** – (Required) Defines how the values for the keys defined by either `redis-key-prefix` or `redis-keys-zset` are mapped to your table. A literal maps to a single column. A zset also maps to a single column, but each key can store many rows. A hash enables each key to be a row with multiple columns (for example, a hash, literal, or zset.)
+ **redis-ssl-flag** – (Optional) When `True`, creates a Redis connection that uses SSL/TLS. The default is `False`.
+ **redis-cluster-flag** – (Optional) When `True`, enables support for clustered Redis instances. The default is `False`.
+ **redis-db-number** – (Optional) Applies only to standalone, non-clustered instances.) Set this number (for example 1, 2, or 3) to read from a non-default Redis database. The default is Redis logical database 0. This number does not refer to a database in Athena or AWS Glue, but to a Redis logical database. For more information, see [SELECT index](https://redis.io/commands/select) in the Redis documentation.

## Data types


The Redis OSS connector supports the following data types. Redis OSS streams are not supported.
+ [String](https://redis.com/ebook/part-1-getting-started/chapter-1-getting-to-know-redis/1-2-what-redis-data-structures-look-like/1-2-1-strings-in-redis/)
+ [Hash](https://redis.com/ebook/part-1-getting-started/chapter-1-getting-to-know-redis/1-2-what-redis-data-structures-look-like/1-2-4-hashes-in-redis/)
+ Sorted Set ([ZSet](https://redis.com/ebook/part-2-core-concepts/chapter-3-commands-in-redis/3-5-sorted-sets/))

All Redis OSS values are retrieved as the `string` data type. Then they are converted to one of the following Apache Arrow data types based on how your tables are defined in the AWS Glue Data Catalog.


****  

| AWS Glue data type | Apache Arrow data type | 
| --- | --- | 
| int | INT | 
| string | VARCHAR | 
| bigint | BIGINT | 
| double | FLOAT8 | 
| float | FLOAT4 | 
| smallint | SMALLINT | 
| tinyint | TINYINT | 
| boolean | BIT | 
| binary | VARBINARY | 

## Required Permissions


For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-redis.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-redis/athena-redis.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The Redis connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.
+ **AWS Secrets Manager read access** – If you choose to store Redis endpoint details in Secrets Manager, you must grant the connector access to those secrets.
+ **VPC access** – The connector requires the ability to attach and detach interfaces to your VPC so that it can connect to it and communicate with your Redis instances.

## Performance


The Athena Redis OSS connector attempts to parallelize queries against your Redis OSS instance according to the type of table that you have defined (for example, zset keys or prefix keys).

The Athena Redis connector performs predicate pushdown to decrease the data scanned by the query. However, queries containing a predicate against the primary key fail with timeout. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data. The Redis connector is resilient to throttling due to concurrency.

## Passthrough queries


The Redis connector supports [passthrough queries](federated-query-passthrough.md). You can use this feature to run queries that use Lua script on Redis databases. 

To create passthrough queries with Redis, use the following syntax:

```
SELECT * FROM TABLE(
        system.script(
            script => 'return redis.[call|pcall](query_script)',
            keys => '[key_pattern]',
            argv => '[script_arguments]'
))
```

The following example runs a Lua script to get the value at key `l:a`.

```
SELECT * FROM TABLE(
        system.script(
            script => 'return redis.call("GET", KEYS[1])',
            keys => '[l:a]',
            argv => '[]'
))
```

## License information


The Amazon Athena Redis connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-redis) on GitHub.com.

# Amazon Athena Redshift connector
Redshift

The Amazon Athena Redshift connector enables Amazon Athena to access your Amazon Redshift and Amazon Redshift Serverless databases, including Redshift Serverless views. You can connect to either service using the JDBC connection string configuration settings described on this page.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Because Redshift does not support external partitions, all data specified by a query is retrieved every time.
+ Like Redshift, Athena treats trailing spaces in Redshift `CHAR` types as semantically insignificant for length and comparison purposes. Note that this applies only to `CHAR` but not to `VARCHAR` types. Athena ignores trailing spaces for the `CHAR` type, but treats them as significant for the `VARCHAR` type.

## Terms


The following terms relate to the Redshift connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the Redshift connector.

### Glue connections (recommended)


We recommend that you configure a Redshift connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Amazon Redshift connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type REDSHIFT
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Redshift connector created using Glue connections does not support the use of a multiplexing handler.
The Redshift connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **disable\$1glue** – (Optional) If present and set to true, the connector does not attempt to retrieve supplemental metadata from AWS Glue.
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.

#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
redshift://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | RedshiftMuxCompositeHandler | 
| Metadata handler | RedshiftMuxMetadataHandler | 
| Record handler | RedshiftMuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myredshiftcatalog, then the environment variable name is myredshiftcatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Redshift MUX Lambda function that supports two database instances: `redshift1` (the default), and `redshift2`.


****  

| Property | Value | 
| --- | --- | 
| default | redshift://jdbc:redshift://redshift1.host:5439/dev?user=sample2&password=sample2 | 
| redshift\$1catalog1\$1connection\$1string | redshift://jdbc:redshift://redshift1.host:3306/default?\$1\$1Test/RDS/Redshift1\$1 | 
| redshift\$1catalog2\$1connection\$1string | redshift://jdbc:redshift://redshift2.host:3333/default?user=sample2&password=sample2 | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name \$1\$1Test/RDS/ `Redshift1`\$1.

```
redshift://jdbc:redshift://redshift1.host:3306/default?...&${Test/RDS/Redshift1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
redshift://jdbc:redshift://redshift1.host:3306/default?...&user=sample2&password=sample2&...
```

Currently, the Redshift connector recognizes the `user` and `password` JDBC properties.

## Data type support


The following table shows the corresponding data types for JDBC and Apache Arrow.


****  

| JDBC | Arrow | 
| --- | --- | 
| Boolean | Bit | 
| Integer | Tiny | 
| Short | Smallint | 
| Integer | Int | 
| Long | Bigint | 
| float | Float4 | 
| Double | Float8 | 
| Date | DateDay | 
| Timestamp | DateMilli | 
| String | Varchar | 
| Bytes | Varbinary | 
| BigDecimal | Decimal | 
| ARRAY | List | 

## Partitions and splits


Redshift does not support external partitions. For information about performance related issues, see [Performance](#connectors-redshift-performance).

## Performance


The Athena Redshift connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, `ORDER BY` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. However, selecting a subset of columns sometimes results in a longer query execution runtime. Amazon Redshift is particularly susceptible to query execution slowdown when you run multiple queries concurrently.

### LIMIT clauses


A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Top N queries


A top `N` query specifies an ordering of the result set and a limit on the number of rows returned. You can use this type of query to determine the top `N` max values or top `N` min values for your datasets. With top `N` pushdown, the connector returns only `N` ordered rows to Athena.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Redshift connector can combine these expressions and push them directly to Redshift for enhanced functionality and to reduce the amount of data scanned.

The following Athena Redshift connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
ORDER BY col_a DESC 
LIMIT 10;
```

For an article on using predicate pushdown to improve performance in federated queries, including Amazon Redshift, see [Improve federated queries with predicate pushdown in Amazon Athena](https://aws.amazon.com/blogs/big-data/improve-federated-queries-with-predicate-pushdown-in-amazon-athena/) in the *AWS Big Data Blog*.

## Passthrough queries


The Redshift connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Redshift, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Redshift. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-redshift/pom.xml) file for the Redshift connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-redshift) on GitHub.com.

# Amazon Athena SAP HANA connector
SAP HANA

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ In SAP HANA, object names are converted to uppercase when they are stored in the SAP HANA database. However, because names in quotation marks are case sensitive, it is possible for two tables to have the same name in lower and upper case (for example, `EMPLOYEE` and `employee`).

  In Athena Federated Query, schema table names are provided to the Lambda function in lower case. To work around this issue, you can provide `@schemaCase` query hints to retrieve the data from the tables that have case sensitive names. Following are two sample queries with query hints.

  ```
  SELECT * 
  FROM "lambda:saphanaconnector".SYSTEM."MY_TABLE@schemaCase=upper&tableCase=upper"
  ```

  ```
  SELECT * 
  FROM "lambda:saphanaconnector".SYSTEM."MY_TABLE@schemaCase=upper&tableCase=lower"
  ```

## Terms


The following terms relate to the SAP HANA connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the SAP HANA connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)


We recommend that you configure a SAP HANA connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the SAP HANA connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type SAPHANA
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The SAP HANA connector created using Glue connections does not support the use of a multiplexing handler.
The SAP HANA connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
saphana://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | SaphanaMuxCompositeHandler | 
| Metadata handler | SaphanaMuxMetadataHandler | 
| Record handler | SaphanaMuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mysaphanacatalog, then the environment variable name is mysaphanacatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Saphana MUX Lambda function that supports two database instances: `saphana1` (the default), and `saphana2`.


****  

| Property | Value | 
| --- | --- | 
| default | saphana://jdbc:sap://saphana1.host:port/?\$1\$1Test/RDS/ Saphana1\$1 | 
| saphana\$1catalog1\$1connection\$1string | saphana://jdbc:sap://saphana1.host:port/?\$1\$1Test/RDS/ Saphana1\$1 | 
| saphana\$1catalog2\$1connection\$1string | saphana://jdbc:sap://saphana2.host:port/?user=sample2&password=sample2 | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/Saphana1}`.

```
saphana://jdbc:sap://saphana1.host:port/?${Test/RDS/Saphana1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
saphana://jdbc:sap://saphana1.host:port/?user=sample2&password=sample2&...
```

Currently, the SAP HANA connector recognizes the `user` and `password` JDBC properties.

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single SAP HANA instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | SaphanaCompositeHandler | 
| Metadata handler | SaphanaMetadataHandler | 
| Record handler | SaphanaRecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single SAP HANA instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | saphana://jdbc:sap://saphana1.host:port/?secret=Test/RDS/Saphana1 | 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for JDBC and Apache Arrow.


****  

| JDBC | Arrow | 
| --- | --- | 
| Boolean | Bit | 
| Integer | Tiny | 
| Short | Smallint | 
| Integer | Int | 
| Long | Bigint | 
| float | Float4 | 
| Double | Float8 | 
| Date | DateDay | 
| Timestamp | DateMilli | 
| String | Varchar | 
| Bytes | Varbinary | 
| BigDecimal | Decimal | 
| ARRAY | List | 

## Data type conversions


In addition to the JDBC to Arrow conversions, the connector performs certain other conversions to make the SAP HANA source and Athena data types compatible. These conversions help ensure that queries get executed successfully. The following table shows these conversions.


****  

| Source data type (SAP HANA) | Converted data type (Athena) | 
| --- | --- | 
| DECIMAL | BIGINT | 
| INTEGER | INT | 
| DATE | DATEDAY | 
| TIMESTAMP | DATEMILLI | 

All other unsupported data types are converted to `VARCHAR`.

## Partitions and splits


A partition is represented by a single partition column of type `Integer`. The column contains partition names of the partitions defined on an SAP HANA table. For a table that does not have partition names, \$1 is returned, which is equivalent to a single partition. A partition is equivalent to a split.


****  

| Name | Type | Description | 
| --- | --- | --- | 
| PART\$1ID | Integer | Named partition in SAP HANA. | 

## Performance


SAP HANA supports native partitions. The Athena SAP HANA connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, native partitioning is highly recommended. Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The connector shows significant throttling, and sometimes query failures, due to concurrency.

The Athena SAP HANA connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### LIMIT clauses


A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena SAP HANA connector can combine these expressions and push them directly to SAP HANA for enhanced functionality and to reduce the amount of data scanned.

The following Athena SAP HANA connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

## Passthrough queries


The SAP HANA connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with SAP HANA, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in SAP HANA. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-saphana/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-saphana/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-saphana/pom.xml) file for the SAP HANA connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-saphana) on GitHub.com.

# Amazon Athena Snowflake connector
Snowflake

The Amazon Athena connector for [Snowflake](https://www.snowflake.com/) enables Amazon Athena to run SQL queries on data stored in your Snowflake SQL database or RDS instances using JDBC.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites


Deploy the connector to your AWS account using the Athena console or the `CreateDataCatalog` API operation. For more information, see [Create a data source connection](connect-to-a-data-source.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ Only legacy connections support multiplexer setup. 
+ Currently, Snowflake views with single split are supported. 
+  In Snowflake, object names are case-sensitive. Athena accepts mixed case in DDL and DML queries, but by default [lower cases](https://docs.aws.amazon.com/athena/latest/ug/tables-databases-columns-names.html#table-names-and-table-column-names-in-ate-must-be-lowercase) the object names when it executes the query. The Snowflake connector supports only lower case when Glue Catalog/Lake Formation are used. When the Athena Catalog is used, customers can control the casing behavior using the `casing_mode` Lambda environment variable whose possible values are listed in the [Parameters](#connectors-snowflake-parameters) section (for example, `key=casing_mode, value = CASE_INSENSITIVE_SEARCH`). 

## Terms


The following terms relate to the Snowflake connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the Snowflake connector.

### Glue connections (recommended)


We recommend that you configure a Snowflake connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Snowflake connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type SNOWFLAKE
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **NONE** – Do not change case of the given schema and table names (run the query as is against Snowflake). This is the default value when **casing\$1mode** is not specified. 
  + **UPPER** – Upper case all given schema and table names in the query before running it against Snowflake.
  + **LOWER** – Lower case all given schema and table names in the query before running it against Snowflake.
  + **CASE\$1INSENSITIVE\$1SEARCH** – Perform case insensitive searches against schema and tables names in Snowflake. For example, you can use this mode when you have a query like `SELECT * FROM EMPLOYEE` and Snowflake contains a table called `Employee`. However, in the presence of name collisions, such as having a table called `EMPLOYEE` and another table called `Employee` in Snowflake, the query will fail.

**Note**  
The Snowflake connector created using Glue connections does not support the use of a multiplexing handler.
The Snowflake connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

**Storing credentials**

All connectors that use Glue connections must use AWS Secrets Manager to store credentials. For more information, see [Authenticate with Snowflake](connectors-snowflake-authentication.md).

### Legacy connections


**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **default** – The JDBC connection string to use to connect to the Snowflake database instance. For example, `snowflake://${jdbc_connection_string}`
+ **catalog\$1connection\$1string** – Used by the Multiplexing handler (not supported when using a glue connection). A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mysnowflakecatalog, then the environment variable name is mysnowflakecatalog\$1connection\$1string.
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **NONE** – Do not change case of the given schema and table names (run the query as is against Snowflake). This is the default value when **casing\$1mode** is not specified. 
  + **UPPER** – Upper case all given schema and table names in the query before running it against Snowflake.
  + **LOWER** – Lower case all given schema and table names in the query before running it against Snowflake.
  + **CASE\$1INSENSITIVE\$1SEARCH** – Perform case insensitive searches against schema and tables names in Snowflake. For example, you can use this mode when you have a query like `SELECT * FROM EMPLOYEE` and Snowflake contains a table called `Employee`. However, in the presence of name collisions, such as having a table called `EMPLOYEE` and another table called `Employee` in Snowflake, the query will fail.
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).

#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
snowflake://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | SnowflakeMuxCompositeHandler | 
| Metadata handler | SnowflakeMuxMetadataHandler | 
| Record handler | SnowflakeMuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mysnowflakecatalog, then the environment variable name is mysnowflakecatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Snowflake MUX Lambda function that supports two database instances: `snowflake1` (the default), and `snowflake2`.


****  

| Property | Value | 
| --- | --- | 
| default | snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1&schema=schema1&\$1\$1Test/RDS/Snowflake1\$1 | 
| snowflake\$1catalog1\$1connection\$1string | snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1&schema=schema1\$1\$1Test/RDS/Snowflake1\$1 | 
| snowflake\$1catalog2\$1connection\$1string | snowflake://jdbc:snowflake://snowflake2.host:port/?warehouse=warehousename&db=db1&schema=schema1&user=sample2&password=sample2 | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/Snowflake1}`.

```
snowflake://jdbc:snowflake://snowflake1.host:port/?warehouse=warehousename&db=db1&schema=schema1${Test/RDS/Snowflake1}&... 
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
snowflake://jdbc:snowflake://snowflake1.host:port/warehouse=warehousename&db=db1&schema=schema1&user=sample2&password=sample2&... 
```

Currently, Snowflake recognizes the `user` and `password` JDBC properties. It also accepts the user name and password in the format *username*`/`*password* without the keys `user` or `password`.

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single Snowflake instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | SnowflakeCompositeHandler | 
| Metadata handler | SnowflakeMetadataHandler | 
| Record handler | SnowflakeRecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Snowflake instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | snowflake://jdbc:snowflake://snowflake1.host:port/?secret=Test/RDS/Snowflake1 | 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for JDBC and Apache Arrow.


****  

| JDBC | Arrow | 
| --- | --- | 
| Boolean | Bit | 
| Integer | Tiny | 
| Short | Smallint | 
| Integer | Int | 
| Long | Bigint | 
| float | Float4 | 
| Double | Float8 | 
| Date | DateDay | 
| Timestamp | DateMilli | 
| String | Varchar | 
| Bytes | Varbinary | 
| BigDecimal | Decimal | 
| ARRAY | List | 

## Data type conversions


In addition to the JDBC to Arrow conversions, the connector performs certain other conversions to make the Snowflake source and Athena data types compatible. These conversions help ensure that queries get executed successfully. The following table shows these conversions.


****  

| Source data type (Snowflake) | Converted data type (Athena) | 
| --- | --- | 
| TIMESTAMP | TIMESTAMPMILLI | 
| DATE | TIMESTAMPMILLI | 
| INTEGER | INT | 
| DECIMAL | BIGINT | 
| TIMESTAMP\$1NTZ | TIMESTAMPMILLI | 

All other unsupported data types are converted to `VARCHAR`.

## Partitions and splits


Partitions are used to determine how to generate splits for the connector. Athena constructs a synthetic column of type `varchar` that represents the partitioning scheme for the table to help the connector generate splits. The connector does not modify the actual table definition.

To create this synthetic column and the partitions, Athena requires a primary key to be defined. However, because Snowflake does not enforce primary key constraints, you must enforce uniqueness yourself. Failure to do so causes Athena to default to a single split.

## Performance


For optimal performance, use filters in queries whenever possible. In addition, we highly recommend native partitioning to retrieve huge datasets that have uniform partition distribution. Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The Snowflake connector is resilient to throttling due to concurrency.

The Athena Snowflake connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses, simple predicates, and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### LIMIT clauses


A `LIMIT N` statement reduces the data scanned by the query. With `LIMIT N` pushdown, the connector returns only `N` rows to Athena.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Snowflake connector can combine these expressions and push them directly to Snowflake for enhanced functionality and to reduce the amount of data scanned.

The following Athena Snowflake connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d))
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%') 
LIMIT 10;
```

# Authenticate with Snowflake
Snowflake authentication

You can configure the Amazon Athena Snowflake connector to use either the key-pair authentication or OAuth authentication method to connect to your Snowflake data warehouse. Both methods provide secure access to Snowflake and eliminate the need to store passwords in connection strings.
+ **Key-pair authentication** – This method uses RSA public or private key pairs to authenticate with Snowflake. The private key digitally signs authentication requests while the corresponding public key is registered in Snowflake for verification. This method eliminates password storage.
+ **OAuth authentication** – This method uses authorization token and refresh token to authenticate with Snowflake. It supports automatic token refresh, making it suitable for long-running applications.

For more information, see the [Key-pair authentication](https://docs.snowflake.com/en/user-guide/key-pair-auth) and [OAuth authentication](https://docs.snowflake.com/en/user-guide/oauth-custom) in the Snowflake user guide.

## Prerequisites


Before you begin, complete the following prerequisites:
+ Snowflake account access with administrative privileges.
+ Snowflake user account dedicated for the Athena connector.
+ OpenSSL or equivalent key generation tools for key-pair authentication.
+ AWS Secrets Manager access to create and manage secrets.
+ Web browser to complete the OAuth flow for the OAuth authentication.

## Configure key-pair authentication


This process involves generating an RSA key-pair, configuring your Snowflake account with the public key, and securely storing the private key in AWS Secrets Manager. The following steps will guide you through creating the cryptographic keys, setting up the necessary Snowflake permissions, and configuring AWS credentials for seamless authentication. 

1. **Generate RSA key-pair**

   Generate a private and public key pair using OpenSSL.
   + To generate an unencrypted version, use the following command in your local command line application.

     ```
     openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out rsa_key.p8 -nocrypt
     ```
   + To generate an encrypted version, use the following command, which omits `-nocrypt`.

     ```
     openssl genrsa 2048 | openssl pkcs8 -topk8 -v2 des3 -inform PEM -out rsa_key.p8
     ```
   + To generate a public key from a private key.

     ```
     openssl rsa -in rsa_key.p8 -pubout -out rsa_key.pub
     # Set appropriate permissions (Unix/Linux)
     chmod 600 rsa_key.p8
     chmod 644 rsa_key.pub
     ```
**Note**  
Do not share your private key. The private key should only be accessible to the application that needs to authenticate with Snowflake.

1. **Extract public key content without delimiters for Snowflake**

   ```
   # Extract public key content (remove BEGIN/END lines and newlines)
   cat rsa_key.pub | grep -v "BEGIN\|END" | tr -d '\n'
   ```

   Save this output as you will need it later in the next step.

1. **Configure Snowflake user**

   Follow these steps to configure a Snowflake user.

   1. Create a dedicated user for the Athena connector if it doesn't already exists.

      ```
      -- Create user for Athena connector
      CREATE USER athena_connector_user;
      
      -- Grant necessary privileges
      GRANT USAGE ON WAREHOUSE your_warehouse TO ROLE athena_connector_role;
      GRANT USAGE ON DATABASE your_database TO ROLE athena_connector_role;
      GRANT SELECT ON ALL TABLES IN DATABASE your_database TO ROLE athena_connector_role;
      ```

   1. Grant authentication privileges. To assign a public key to a user, you must have one of the following roles or privileges.
      + The `MODIFY PROGRAMMATIC AUTHENTICATION METHODS` or `OWNERSHIP` privilege on the user.
      + The `SECURITYADMIN` role or higher.

      Grant the necessary privileges to assign public keys with the following command.

      ```
      GRANT MODIFY PROGRAMMATIC AUTHENTICATION METHODS ON USER athena_connector_user TO ROLE your_admin_role;
      ```

   1. Assign the public key to the Snowflake user with the following command.

      ```
      ALTER USER athena_connector_user SET RSA_PUBLIC_KEY='RSAkey';
      ```

      Verify that the public key is successfully assigned to the user with the following command.

      ```
      DESC USER athena_connector_user;
      ```

1. **Store private key in AWS Secrets Manager**

   1. Convert your private key to the format required by the connector.

      ```
      # Read private key content
      cat rsa_key.p8
      ```

   1. Create a secret in AWS Secrets Manager with the following structure.

      ```
      {
        "sfUser": "your_snowflake_user",
        "pem_private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----",
        "pem_private_key_passphrase": "passphrase_in_case_of_encrypted_private_key(optional)"
      }
      ```
**Note**  
Header and footer are optional.
The private key must be separated by `\n`.

## Configure OAuth authentication


This authentication method enables secure, token-based access to Snowflake with automatic credential refresh capabilities. The configuration process involves creating a security integration in Snowflake, retrieving OAuth client credentials, completing the authorization flow to obtain an access code, and storing the OAuth credentials in AWS Secrets Manager for the connector to use. 

1. **Create a security integration in Snowflake**

   Execute the following SQL command in Snowflake to create a Snowflake OAuth security integration.

   ```
   CREATE SECURITY INTEGRATION my_snowflake_oauth_integration_a
     TYPE = OAUTH
     ENABLED = TRUE
     OAUTH_CLIENT = CUSTOM
     OAUTH_CLIENT_TYPE = 'CONFIDENTIAL'
     OAUTH_REDIRECT_URI = 'https://localhost:8080/oauth/callback'
     OAUTH_ISSUE_REFRESH_TOKENS = TRUE
     OAUTH_REFRESH_TOKEN_VALIDITY = 7776000;
   ```

   **Configuration parameters**
   + `TYPE = OAUTH` – Specifies OAuth authentication type.
   + `ENABLED = TRUE` – Enables the security integration.
   + `OAUTH_CLIENT = CUSTOM` – Uses custom OAuth client configuration.
   + `OAUTH_CLIENT_TYPE = 'CONFIDENTIAL'` – Sets client type for secure applications.
   + `OAUTH_REDIRECT_URI` – The callback URL for OAuth flow. It can be localhost for testing.
   + `OAUTH_ISSUE_REFRESH_TOKENS = TRUE` – Enables refresh token generation.
   + `OAUTH_REFRESH_TOKEN_VALIDITY = 7776000` – Sets refresh token validity (90 days in seconds).

1. **Retrieve OAuth client secrets**

   1. Run the following SQL command to get the client credentials.

      ```
      DESC SECURITY INTEGRATION 'MY_SNOWFLAKE_OAUTH_INTEGRATION_A';
      ```

   1. Retrieve the OAuth client secrets.

      ```
      SELECT SYSTEM$SHOW_OAUTH_CLIENT_SECRETS('MY_SNOWFLAKE_OAUTH_INTEGRATION_A');
      ```

      **Example response**

      ```
      {
        "OAUTH_CLIENT_SECRET_2": "wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY",
        "OAUTH_CLIENT_SECRET": "je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY,
        "OAUTH_CLIENT_ID": "AIDACKCEVSQ6C2EXAMPLE"
      }
      ```
**Note**  
Keep these credentials secure and do not share them. These will be used to configure the OAuth client.

1. **Authorize user and retrieve authorization code**

   1. Open the following URL in a browser.

      ```
      https://<your_account>.snowflakecomputing.com/oauth/authorize?client_id=<OAUTH_CLIENT_ID>&response_type=code&redirect_uri=https://localhost:8080/oauth/callback
      ```

   1. Complete the authorization flow.

      1. Sign in using your Snowflake credentials.

      1. Grant the requested permissions. You will be redirected to the callback URI with an authorization code.

   1. Extract the authorization code by copying the code parameter from the redirect URL.

      ```
      https://localhost:8080/oauth/callback?code=<authorizationcode>
      ```
**Note**  
The authorization code is valid for a limited time and can only be used once.

1. **Store OAuth credentials in AWS Secrets Manager**

   Create a secret in AWS Secrets Manager with the following structure.

   ```
   {
     "redirect_uri": "https://localhost:8080/oauth/callback",
     "client_secret": "je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY",
     "token_url": "https://<your_account>.snowflakecomputing.com/oauth/token-request",
     "client_id": "AIDACKCEVSQ6C2EXAMPLE,
     "username": "your_snowflake_username",
     "auth_code": "authorizationcode"
   }
   ```

   **Required fields**
   + `redirect_uri` – OAuth redirect URI that you obtained from Step 1.
   + `client_secret` – OAuth client secret that you obtained from Step 2.
   + `token_url` – Snowflake The OAuth token endpoint.
   + `client_id` – The OAuth client ID from Step 2.
   + `username` – The Snowflake username for the connector.
   + `auth_code` – The authorization code that you obtained from Step 3.

After you create a secret, you get a secret ARN that you can use in your Glue connection when you [create a data source connection](connect-to-a-data-source.md). 

## Passthrough queries


The Snowflake connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Snowflake, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Snowflake. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-snowflake/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-snowflake/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-snowflake/pom.xml) file for the Snowflake connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-snowflake) on GitHub.com.

# Amazon Athena Microsoft SQL Server connector
SQL Server

The Amazon Athena connector for [Microsoft SQL Server](https://docs.microsoft.com/en-us/sql/?view=sql-server-ver15) enables Amazon Athena to run SQL queries on your data stored in Microsoft SQL Server using JDBC.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.
+ In filter conditions, you must cast the `Date` and `Timestamp` data types to the appropriate data type.
+ To search for negative values of type `Real` and `Float`, use the `<=` or `>=` operator.
+ The `binary`, `varbinary`, `image`, and `rowversion` data types are not supported.

## Terms


The following terms relate to the SQL Server connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the SQL Server connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)


We recommend that you configure a SQL Server connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the SQL Server connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type SQLSERVER
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The SQL Server connector created using Glue connections does not support the use of a multiplexing handler.
The SQL Server connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
sqlserver://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | SqlServerMuxCompositeHandler | 
| Metadata handler | SqlServerMuxMetadataHandler | 
| Record handler | SqlServerMuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is mysqlservercatalog, then the environment variable name is mysqlservercatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a SqlServer MUX Lambda function that supports two database instances: `sqlserver1` (the default), and `sqlserver2`.


****  

| Property | Value | 
| --- | --- | 
| default | sqlserver://jdbc:sqlserver://sqlserver1.hostname:port;databaseName=<database\$1name>;\$1\$1secret1\$1name\$1 | 
| sqlserver\$1catalog1\$1connection\$1string | sqlserver://jdbc:sqlserver://sqlserver1.hostname:port;databaseName=<database\$1name>;\$1\$1secret1\$1name\$1 | 
| sqlserver\$1catalog2\$1connection\$1string | sqlserver://jdbc:sqlserver://sqlserver2.hostname:port;databaseName=<database\$1name>;\$1\$1secret2\$1name\$1 | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${secret_name}`.

```
sqlserver://jdbc:sqlserver://hostname:port;databaseName=<database_name>;${secret_name}
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
sqlserver://jdbc:sqlserver://hostname:port;databaseName=<database_name>;user=<user>;password=<password>
```

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single SQL Server instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | SqlServerCompositeHandler | 
| Metadata handler | SqlServerMetadataHandler | 
| Record handler | SqlServerRecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single SQL Server instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | sqlserver://jdbc:sqlserver://hostname:port;databaseName=<database\$1name>;\$1\$1secret\$1name\$1 | 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for SQL Server and Apache Arrow.


****  

| SQL Server | Arrow | 
| --- | --- | 
| bit | TINYINT | 
| tinyint | SMALLINT | 
| smallint | SMALLINT | 
| int | INT | 
| bigint | BIGINT | 
| decimal | DECIMAL | 
| numeric | FLOAT8 | 
| smallmoney | FLOAT8 | 
| money | DECIMAL | 
| float[24] | FLOAT4 | 
| float[53] | FLOAT8 | 
| real | FLOAT4 | 
| datetime | Date(MILLISECOND) | 
| datetime2 | Date(MILLISECOND) | 
| smalldatetime | Date(MILLISECOND) | 
| date | Date(DAY) | 
| time | VARCHAR | 
| datetimeoffset | Date(MILLISECOND) | 
| char[n] | VARCHAR | 
| varchar[n/max] | VARCHAR | 
| nchar[n] | VARCHAR | 
| nvarchar[n/max] | VARCHAR | 
| text | VARCHAR | 
| ntext | VARCHAR | 

## Partitions and splits


A partition is represented by a single partition column of type `varchar`. In case of the SQL Server connector, a partition function determines how partitions are applied on the table. The partition function and column name information are retrieved from the SQL Server metadata table. A custom query then gets the partition. Splits are created based upon the number of distinct partitions received.

## Performance


Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The SQL Server connector is resilient to throttling due to concurrency.

The Athena SQL Server connector performs predicate pushdown to decrease the data scanned by the query. Simple predicates and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time. 

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena SQL Server connector can combine these expressions and push them directly to SQL Server for enhanced functionality and to reduce the amount of data scanned.

The following Athena SQL Server connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, IS\$1DISTINCT\$1FROM, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%');
```

## Passthrough queries


The SQL Server connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with SQL Server, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in SQL Server. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-sqlserver/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-sqlserver/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-sqlserver/pom.xml) file for the SQL Server connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-sqlserver) on GitHub.com.

# Amazon Athena Teradata connector
Teradata

 The Amazon Athena connector for Teradata enables Athena to run SQL queries on data stored in your Teradata databases. 

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Limitations

+ Write DDL operations are not supported.
+ In a multiplexer setup, the spill bucket and prefix are shared across all database instances.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.

## Terms


The following terms relate to the Teradata connector.
+ **Database instance** – Any instance of a database deployed on premises, on Amazon EC2, or on Amazon RDS.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.
+ **Multiplexing handler** – A Lambda handler that can accept and use multiple database connections.

## Parameters


Use the parameters in this section to configure the Teradata connector.

### Glue connections (recommended)


We recommend that you configure a Teradata connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Teradata connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type TERADATA
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Teradata connector created using Glue connections does not support the use of a multiplexing handler.
The Teradata connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
teradata://${jdbc_connection_string}
```

#### Using a multiplexing handler


You can use a multiplexer to connect to multiple database instances with a single Lambda function. Requests are routed by catalog name. Use the following classes in Lambda.


****  

| Handler | Class | 
| --- | --- | 
| Composite handler | TeradataMuxCompositeHandler | 
| Metadata handler | TeradataMuxMetadataHandler | 
| Record handler | TeradataMuxRecordHandler | 

##### Multiplexing handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| \$1catalog\$1connection\$1string | Required. A database instance connection string. Prefix the environment variable with the name of the catalog used in Athena. For example, if the catalog registered with Athena is myteradatacatalog, then the environment variable name is myteradatacatalog\$1connection\$1string. | 
| default | Required. The default connection string. This string is used when the catalog is lambda:\$1\$1AWS\$1LAMBDA\$1FUNCTION\$1NAME\$1. | 

The following example properties are for a Teradata MUX Lambda function that supports two database instances: `teradata1` (the default), and `teradata2`.


****  

| Property | Value | 
| --- | --- | 
| default | teradata://jdbc:teradata://teradata2.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,user=sample2&password=sample2 | 
| teradata\$1catalog1\$1connection\$1string | teradata://jdbc:teradata://teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,\$1\$1Test/RDS/Teradata1\$1 | 
| teradata\$1catalog2\$1connection\$1string | teradata://jdbc:teradata://teradata2.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,user=sample2&password=sample2 | 

##### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret name**  
The following string has the secret name `${Test/RDS/Teradata1}`.

```
teradata://jdbc:teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,${Test/RDS/Teradata1}&...
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
teradata://jdbc:teradata://teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,...&user=sample2&password=sample2&...
```

Currently, Teradata recognizes the `user` and `password` JDBC properties. It also accepts the user name and password in the format *username*`/`*password* without the keys `user` or `password`.

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single Teradata instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | TeradataCompositeHandler | 
| Metadata handler | TeradataMetadataHandler | 
| Record handler | TeradataRecordHandler | 

##### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

The following example property is for a single Teradata instance supported by a Lambda function.


****  

| Property | Value | 
| --- | --- | 
| default | teradata://jdbc:teradata://teradata1.host/TMODE=ANSI,CHARSET=UTF8,DATABASE=TEST,secret=Test/RDS/Teradata1 | 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the corresponding data types for JDBC and Apache Arrow.


****  

| JDBC | Arrow | 
| --- | --- | 
| Boolean | Bit | 
| Integer | Tiny | 
| Short | Smallint | 
| Integer | Int | 
| Long | Bigint | 
| float | Float4 | 
| Double | Float8 | 
| Date | DateDay | 
| Timestamp | DateMilli | 
| String | Varchar | 
| Bytes | Varbinary | 
| BigDecimal | Decimal | 
| ARRAY | List | 

## Partitions and splits


A partition is represented by a single partition column of type `Integer`. The column contains partition names of the partitions defined on a Teradata table. For a table that does not have partition names, \$1 is returned, which is equivalent to a single partition. A partition is equivalent to a split.


****  

| Name | Type | Description | 
| --- | --- | --- | 
| partition | Integer | Named partition in Teradata. | 

## Performance


Teradata supports native partitions. The Athena Teradata connector can retrieve data from these partitions in parallel. If you want to query very large datasets with uniform partition distribution, native partitioning is highly recommended. Selecting a subset of columns significantly slows down query runtime. The connector shows some throttling due to concurrency.

The Athena Teradata connector performs predicate pushdown to decrease the data scanned by the query. Simple predicates and complex expressions are pushed down to the connector to reduce the amount of data scanned and decrease query execution run time.

### Predicates


A predicate is an expression in the `WHERE` clause of a SQL query that evaluates to a Boolean value and filters rows based on multiple conditions. The Athena Teradata connector can combine these expressions and push them directly to Teradata for enhanced functionality and to reduce the amount of data scanned.

The following Athena Teradata connector operators support predicate pushdown:
+ **Boolean: **AND, OR, NOT
+ **Equality: **EQUAL, NOT\$1EQUAL, LESS\$1THAN, LESS\$1THAN\$1OR\$1EQUAL, GREATER\$1THAN, GREATER\$1THAN\$1OR\$1EQUAL, NULL\$1IF, IS\$1NULL
+ **Arithmetic: **ADD, SUBTRACT, MULTIPLY, DIVIDE, MODULUS, NEGATE
+ **Other: **LIKE\$1PATTERN, IN

### Combined pushdown example


For enhanced querying capabilities, combine the pushdown types, as in the following example:

```
SELECT * 
FROM my_table 
WHERE col_a > 10 
    AND ((col_a + col_b) > (col_c % col_d)) 
    AND (col_e IN ('val1', 'val2', 'val3') OR col_f LIKE '%pattern%');
```

## Passthrough queries


The Teradata connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Teradata, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Teradata. The query selects all columns in the `customer` table.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-teradata/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-teradata/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-teradata/pom.xml) file for the Teradata connector on GitHub.com.

For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-teradata) on GitHub.com.

# Amazon Athena Timestream connector
Timestream

The Amazon Athena Timestream connector enables Amazon Athena to communicate with [Amazon Timestream](https://aws.amazon.com/timestream/), making your time series data accessible through Amazon Athena. You can optionally use AWS Glue Data Catalog as a source of supplemental metadata.

Amazon Timestream is a fast, scalable, fully managed, purpose-built time series database that makes it easy to store and analyze trillions of time series data points per day. Timestream saves you time and cost in managing the lifecycle of time series data by keeping recent data in memory and moving historical data to a cost optimized storage tier based upon user defined policies.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

If you have Lake Formation enabled in your account, the IAM role for your Athena federated Lambda connector that you deployed in the AWS Serverless Application Repository must have read access in Lake Formation to the AWS Glue Data Catalog.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters


Use the parameters in this section to configure the Timestream connector.

### Glue connections (recommended)


We recommend that you configure a Timestream connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Timestream connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type TIMESTREAM
```

**Lambda environment properties**

**glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Timestream connector created using Glue connections does not support the use of a multiplexing handler.
The Timestream connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.

The parameter names and definitions listed below are for Athena data source connectors created without an associated Glue connection. Use the following parameters only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector or when the `glue_connection` environment property is not specified.

**Lambda environment properties**
+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).
+ **glue\$1catalog** – (Optional) Use this option to specify a [cross-account AWS Glue catalog](data-sources-glue-cross-account.md). By default, the connector attempts to get metadata from its own AWS Glue account.

## Setting up databases and tables in AWS Glue


You can optionally use the AWS Glue Data Catalog as a source of supplemental metadata. To enable an AWS Glue table for use with Timestream, you must have an AWS Glue database and table with names that match the Timestream database and table that you want to supply supplemental metadata for.

**Note**  
For best performance, use only lowercase for your database names and table names. Using mixed casing causes the connector to perform a case insensitive search that is more computationally intensive.

To configure AWS Glue table for use with Timestream, you must set its table properties in AWS Glue.

**To use an AWS Glue table for supplemental metadata**

1. Edit the table in the AWS Glue console to add the following table properties:
   + **timestream-metadata-flag** – This property indicates to the Timestream connector that the connector can use the table for supplemental metadata. You can provide any value for `timestream-metadata-flag` as long as the `timestream-metadata-flag` property is present in the list of table properties.
   + **\$1view\$1template** – When you use AWS Glue for supplemental metadata, you can use this table property and specify any Timestream SQL as the view. The Athena Timestream connector uses the SQL from the view together with your SQL from Athena to run your query. This is useful if you want to use a feature of Timestream SQL that is not otherwise available in Athena.

1. Make sure that you use the data types appropriate for AWS Glue as listed in this document.

### Data types


Currently, the Timestream connector supports only a subset of the data types available in Timestream, specifically: the scalar values `varchar`, `double`, and `timestamp`.

To query the `timeseries` data type, you must configure a view in AWS Glue table properties that uses the Timestream `CREATE_TIME_SERIES` function. You also need to provide a schema for the view that uses the syntax `ARRAY<STRUCT<time:timestamp,measure_value::double:double>>` as the type for any of your time series columns. Be sure to replace `double` with the appropriate scalar type for your table.

The following image shows an example of AWS Glue table properties configured to set up a view over a time series.

![\[Configuring table properties in AWS Glue to set up a view over a time series.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-timestream-1.png)


## Required Permissions


For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-timestream.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-timestream/athena-timestream.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.
+ **AWS Glue Data Catalog** – The Timestream connector requires read only access to the AWS Glue Data Catalog to obtain schema information.
+ **CloudWatch Logs** – The connector requires access to CloudWatch Logs for storing logs.
+ **Timestream Access** – For running Timestream queries.

## Performance


We recommend that you use the `LIMIT` clause to limit the data returned (not the data scanned) to less than 256 MB to ensure that interactive queries are performant.

The Athena Timestream connector performs predicate pushdown to decrease the data scanned by the query. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data. Selecting a subset of columns significantly speeds up query runtime and reduces data scanned. The Timestream connector is resilient to throttling due to concurrency.

## Passthrough queries


The Timestream connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Timestream, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Timestream. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


The Amazon Athena Timestream connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-timestream) on GitHub.com.

# Amazon Athena TPC benchmark DS (TPC-DS) connector
TPC-DS

The Amazon Athena TPC-DS connector enables Amazon Athena to communicate with a source of randomly generated TPC Benchmark DS data for use in benchmarking and functional testing of Athena Federation. The Athena TPC-DS connector generates a TPC-DS compliant database at one of four scale factors. We do not recommend the use of this connector as an alternative to Amazon S3-based data lake performance tests.

This connector can be registered with Glue Data Catalog as a federated catalog. It supports data access controls defined in Lake Formation at the catalog, database, table, column, row, and tag levels. This connector uses Glue Connections to centralize configuration properties in Glue.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).

## Parameters


Use the parameters in this section to configure the TPC-DS connector.

**Note**  
Athena data source connectors created on December 3, 2024 and later use AWS Glue connections.  
The parameter names and definitions listed below are for Athena data source connectors created prior to December 3, 2024. These can differ from their corresponding [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html). Starting December 3, 2024, use the parameters below only when you [manually deploy](connect-data-source-serverless-app-repo.md) an earlier version of an Athena data source connector.

### Glue connections (recommended)


We recommend that you configure a TPC-DS connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the TPC-DS connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type TPCDS
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The TPC-DS connector created using Glue connections does not support the use of a multiplexing handler.
The TPC-DS connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections

+ **spill\$1bucket** – Specifies the Amazon S3 bucket for data that exceeds Lambda function limits.
+ **spill\$1prefix** – (Optional) Defaults to a subfolder in the specified `spill_bucket` called `athena-federation-spill`. We recommend that you configure an Amazon S3 [storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) on this location to delete spills older than a predetermined number of days or hours.
+ **spill\$1put\$1request\$1headers** – (Optional) A JSON encoded map of request headers and values for the Amazon S3 `putObject` request that is used for spilling (for example, `{"x-amz-server-side-encryption" : "AES256"}`). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the *Amazon Simple Storage Service API Reference*.
+ **kms\$1key\$1id** – (Optional) By default, any data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key. To have your Lambda function use stronger encryption keys generated by KMS like `a7e63k4b-8loc-40db-a2a1-4d0en2cd8331`, you can specify a KMS key ID.
+ **disable\$1spill\$1encryption** – (Optional) When set to `True`, disables spill encryption. Defaults to `False` so that data that is spilled to S3 is encrypted using AES-GCM – either using a randomly generated key or KMS to generate keys. Disabling spill encryption can improve performance, especially if your spill location uses [server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html).

## Test databases and tables


The Athena TPC-DS connector generates a TPC-DS compliant database at one of the four scale factors `tpcds1`, `tpcds10`, `tpcds100`, `tpcds250`, or `tpcds1000`.

### Summary of tables


For a complete list of the test data tables and columns, run the `SHOW TABLES` or `DESCRIBE TABLE` queries. The following summary of tables is provided for convenience.

1. call\$1center

1. catalog\$1page

1. catalog\$1returns

1. catalog\$1sales

1. customer

1. customer\$1address

1. customer\$1demographics

1. date\$1dim

1. dbgen\$1version

1. household\$1demographics

1. income\$1band

1. inventory

1. item

1. promotion

1. reason

1. ship\$1mode

1. store

1. store\$1returns

1. store\$1sales

1. time\$1dim

1. warehouse

1. web\$1page

1. web\$1returns

1. web\$1sales

1. web\$1site

For TPC-DS queries that are compatible with this generated schema and data, see the [athena-tpcds/src/main/resources/queries/](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-tpcds/src/main/resources/queries) directory on GitHub.

### Example query


The following `SELECT` query example queries the `tpcds` catalog for customer demographics in specific counties.

```
SELECT
  cd_gender,
  cd_marital_status,
  cd_education_status,
  count(*) cnt1,
  cd_purchase_estimate,
  count(*) cnt2,
  cd_credit_rating,
  count(*) cnt3,
  cd_dep_count,
  count(*) cnt4,
  cd_dep_employed_count,
  count(*) cnt5,
  cd_dep_college_count,
  count(*) cnt6
FROM
  "lambda:tpcds".tpcds1.customer c, "lambda:tpcds".tpcds1.customer_address ca, "lambda:tpcds".tpcds1.customer_demographics
WHERE
  c.c_current_addr_sk = ca.ca_address_sk AND
    ca_county IN ('Rush County', 'Toole County', 'Jefferson County',
                  'Dona Ana County', 'La Porte County') AND
    cd_demo_sk = c.c_current_cdemo_sk AND
    exists(SELECT *
           FROM "lambda:tpcds".tpcds1.store_sales, "lambda:tpcds".tpcds1.date_dim
           WHERE c.c_customer_sk = ss_customer_sk AND
             ss_sold_date_sk = d_date_sk AND
             d_year = 2002 AND
             d_moy BETWEEN 1 AND 1 + 3) AND
    (exists(SELECT *
            FROM "lambda:tpcds".tpcds1.web_sales, "lambda:tpcds".tpcds1.date_dim
            WHERE c.c_customer_sk = ws_bill_customer_sk AND
              ws_sold_date_sk = d_date_sk AND
              d_year = 2002 AND
              d_moy BETWEEN 1 AND 1 + 3) OR
      exists(SELECT *
             FROM "lambda:tpcds".tpcds1.catalog_sales, "lambda:tpcds".tpcds1.date_dim
             WHERE c.c_customer_sk = cs_ship_customer_sk AND
               cs_sold_date_sk = d_date_sk AND
               d_year = 2002 AND
               d_moy BETWEEN 1 AND 1 + 3))
GROUP BY cd_gender,
  cd_marital_status,
  cd_education_status,
  cd_purchase_estimate,
  cd_credit_rating,
  cd_dep_count,
  cd_dep_employed_count,
  cd_dep_college_count
ORDER BY cd_gender,
  cd_marital_status,
  cd_education_status,
  cd_purchase_estimate,
  cd_credit_rating,
  cd_dep_count,
  cd_dep_employed_count,
  cd_dep_college_count
LIMIT 100
```

## Required Permissions


For full details on the IAM policies that this connector requires, review the `Policies` section of the [athena-tpcds.yaml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-tpcds/athena-tpcds.yaml) file. The following list summarizes the required permissions.
+ **Amazon S3 write access** – The connector requires write access to a location in Amazon S3 in order to spill results from large queries.
+ **Athena GetQueryExecution** – The connector uses this permission to fast-fail when the upstream Athena query has terminated.

## Performance


The Athena TPC-DS connector attempts to parallelize queries based on the scale factor that you choose. Predicate pushdown is performed within the Lambda function.

## License information


The Amazon Athena TPC-DS connector project is licensed under the [Apache-2.0 License](https://www.apache.org/licenses/LICENSE-2.0.html).

## Additional resources


For additional information about this connector, visit [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-tpcds) on GitHub.com.

# Amazon Athena Vertica connector
Vertica

Vertica is a columnar database platform that can be deployed in the cloud or on premises that supports exabyte scale data warehouses. You can use the Amazon Athena Vertica connector in federated queries to query Vertica data sources from Athena. For example, you can run analytical queries over a data warehouse on Vertica and a data lake in Amazon S3.

This connector does not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

## Prerequisites

+ Deploy the connector to your AWS account using the Athena console or the AWS Serverless Application Repository. For more information, see [Create a data source connection](connect-to-a-data-source.md) or [Use the AWS Serverless Application Repository to deploy a data source connector](connect-data-source-serverless-app-repo.md).
+ Set up a VPC and a security group before you use this connector. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).

## Limitations

+ Because the Athena Vertica connector reads exported Parquet files from Amazon S3, performance of the connector can be slow. When you query large tables, we recommend that you use a [CREATE TABLE AS (SELECT ...)](ctas.md) query and SQL predicates.
+ Currently, due to a known issue in Athena Federated Query, the connector causes Vertica to export all columns of the queried table to Amazon S3, but only the queried columns are visible in the results on the Athena console.
+ Write DDL operations are not supported.
+ Any relevant Lambda limits. For more information, see [Lambda quotas](https://docs.aws.amazon.com/lambda/latest/dg/gettingstarted-limits.html) in the *AWS Lambda Developer Guide*.

## Workflow


The following diagram shows the workflow of a query that uses the Vertica connector.

![\[Workflow of a Vertica query from Amazon Athena\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-vertica-1.png)


1. A SQL query is issued against one or more tables in Vertica.

1. The connector parses the SQL query to send the relevant portion to Vertica through the JDBC connection.

1. The connection strings use the user name and password stored in AWS Secrets Manager to gain access to Vertica.

1. The connector wraps the SQL query with a Vertica `EXPORT` command, as in the following example.

   ```
   EXPORT TO PARQUET (directory = 's3://amzn-s3-demo-bucket/folder_name, 
      Compression='Snappy', fileSizeMB=64) OVER() as 
   SELECT
   PATH_ID,
   ...
   SOURCE_ITEMIZED,
   SOURCE_OVERRIDE
   FROM DELETED_OBJECT_SCHEMA.FORM_USAGE_DATA
   WHERE PATH_ID <= 5;
   ```

1. Vertica processes the SQL query and sends the result set to an Amazon S3 bucket. For better throughput, Vertica uses the `EXPORT` option to parallelize the write operation of multiple Parquet files.

1. Athena scans the Amazon S3 bucket to determine the number of files to read for the result set.

1. Athena makes multiple calls to the Lambda function and uses an Apache `ArrowReader` to read the Parquet files from the resulting data set. Multiple calls enable Athena to parallelize the reading of the Amazon S3 files and achieve a throughput of up to 100GB per second.

1. Athena processes the data returned from Vertica with data scanned from the data lake and returns the result.

## Terms


The following terms relate to the Vertica connector.
+ **Database instance** – Any instance of a Vertica database deployed on Amazon EC2.
+ **Handler** – A Lambda handler that accesses your database instance. A handler can be for metadata or for data records.
+ **Metadata handler** – A Lambda handler that retrieves metadata from your database instance.
+ **Record handler** – A Lambda handler that retrieves data records from your database instance.
+ **Composite handler** – A Lambda handler that retrieves both metadata and data records from your database instance.
+ **Property or parameter** – A database property used by handlers to extract database information. You configure these properties as Lambda environment variables.
+ **Connection String** – A string of text used to establish a connection to a database instance.
+ **Catalog** – A non-AWS Glue catalog registered with Athena that is a required prefix for the `connection_string` property.

## Parameters


Use the parameters in this section to configure the Vertica connector.

### Glue connections (recommended)


We recommend that you configure a Vertica connector by using a Glue connections object. To do this, set the `glue_connection` environment variable of the Vertica connector Lambda to the name of the Glue connection to use.

**Glue connections properties**

Use the following command to get the schema for a Glue connection object. This schema contains all the parameters that you can use to control your connection.

```
aws glue describe-connection-type --connection-type VERTICA
```

**Lambda environment properties**
+ **glue\$1connection** – Specifies the name of the Glue connection associated with the federated connector. 
+ **casing\$1mode** – (Optional) Specifies how to handle casing for schema and table names. The `casing_mode` parameter uses the following values to specify the behavior of casing:
  + **none** – Do not change case of the given schema and table names. This is the default for connectors that have an associated glue connection. 
  + **upper** – Upper case all given schema and table names.
  + **lower** – Lower case all given schema and table names.

**Note**  
All connectors that use Glue connections must use AWS Secrets Manager to store credentials.
The Vertica connector created using Glue connections does not support the use of a multiplexing handler.
The Vertica connector created using Glue connections only supports `ConnectionSchemaVersion` 2.

### Legacy connections


The Amazon Athena Vertica connector exposes configuration options through Lambda environment variables. You can use the following Lambda environment variables to configure the connector. 
+  **AthenaCatalogName** – Lambda function name 
+  **ExportBucket** – The Amazon S3 bucket where the Vertica query results are exported. 
+  **SpillBucket** – The name of the Amazon S3 bucket where this function can spill data. 
+  **SpillPrefix** – The prefix for the `SpillBucket` location where this function can spill data. 
+  **SecurityGroupIds** – One or more IDs that correspond to the security group that should be applied to the Lambda function (for example, `sg1`, `sg2`, or `sg3`). 
+  **SubnetIds** – One or more subnet IDs that correspond to the subnet that the Lambda function can use to access your data source (for example, `subnet1`, or `subnet2`). 
+  **SecretNameOrPrefix** – The name or prefix of a set of names in Secrets Manager that this function has access to (for example, `vertica-*`) 
+  **VerticaConnectionString** – The Vertica connection details to use by default if no catalog specific connection is defined. The string can optionally use AWS Secrets Manager syntax (for example, `${secret_name}`). 
+  **VPC ID** – The VPC ID to be attached to the Lambda function. 

#### Connection string


Use a JDBC connection string in the following format to connect to a database instance.

```
vertica://jdbc:vertica://host_name:
                        port/database?user=vertica-username&password=
                        vertica-password
```

#### Using a single connection handler


You can use the following single connection metadata and record handlers to connect to a single Vertica instance.


****  

| Handler type | Class | 
| --- | --- | 
| Composite handler | VerticaCompositeHandler | 
| Metadata handler | VerticaMetadataHandler | 
| Record handler | VerticaRecordHandler | 

#### Single connection handler parameters



****  

| Parameter | Description | 
| --- | --- | 
| default | Required. The default connection string. | 

The single connection handlers support one database instance and must provide a `default` connection string parameter. All other connection strings are ignored.

#### Providing credentials


To provide a user name and password for your database in your JDBC connection string, you can use connection string properties or AWS Secrets Manager.
+ **Connection String** – A user name and password can be specified as properties in the JDBC connection string.
**Important**  
As a security best practice, don't use hardcoded credentials in your environment variables or connection strings. For information about moving your hardcoded secrets to AWS Secrets Manager, see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html) in the *AWS Secrets Manager User Guide*.
+ **AWS Secrets Manager** – To use the Athena Federated Query feature with AWS Secrets Manager, the VPC connected to your Lambda function should have [internet access](https://aws.amazon.com/premiumsupport/knowledge-center/internet-access-lambda-function/) or a [VPC endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html) to connect to Secrets Manager.

  You can put the name of a secret in AWS Secrets Manager in your JDBC connection string. The connector replaces the secret name with the `username` and `password` values from Secrets Manager.

  For Amazon RDS database instances, this support is tightly integrated. If you use Amazon RDS, we highly recommend using AWS Secrets Manager and credential rotation. If your database does not use Amazon RDS, store the credentials as JSON in the following format:

  ```
  {"username": "${username}", "password": "${password}"}
  ```

**Example connection string with secret names**  
The following string has the secret names \$1\$1`vertica-username`\$1 and `${vertica-password}`. 

```
vertica://jdbc:vertica://
                        host_name:port/database?user=${vertica-username}&password=${vertica-password}
```

The connector uses the secret name to retrieve secrets and provide the user name and password, as in the following example.

```
vertica://jdbc:vertica://
                        host_name:port/database?user=sample-user&password=sample-password
```

Currently, the Vertica connector recognizes the `vertica-username` and `vertica-password` JDBC properties. 

#### Spill parameters


The Lambda SDK can spill data to Amazon S3. All database instances accessed by the same Lambda function spill to the same location.


****  

| Parameter | Description | 
| --- | --- | 
| spill\$1bucket | Required. Spill bucket name. | 
| spill\$1prefix | Required. Spill bucket key prefix. | 
| spill\$1put\$1request\$1headers | (Optional) A JSON encoded map of request headers and values for the Amazon S3 putObject request that is used for spilling (for example, \$1"x-amz-server-side-encryption" : "AES256"\$1). For other possible headers, see [PutObject](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObject.html) in the Amazon Simple Storage Service API Reference. | 

## Data type support


The following table shows the supported data types for the Vertica connector.


****  

| Boolean | 
| --- | 
| BigInt | 
| Short | 
| Integer | 
| Long | 
| Float | 
| Double | 
| Date | 
| Varchar | 
| Bytes | 
| BigDecimal | 
| TimeStamp as Varchar | 

## Performance


The Lambda function performs projection pushdown to decrease the data scanned by the query. `LIMIT` clauses reduce the amount of data scanned, but if you don't provide a predicate, you should expect `SELECT` queries with a `LIMIT` clause to scan at least 16 MB of data. The Vertica connector is resilient to throttling due to concurrency.

## Passthrough queries


The Vertica connector supports [passthrough queries](federated-query-passthrough.md). Passthrough queries use a table function to push your full query down to the data source for execution.

To use passthrough queries with Vertica, you can use the following syntax:

```
SELECT * FROM TABLE(
        system.query(
            query => 'query string'
        ))
```

The following example query pushes down a query to a data source in Vertica. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        system.query(
            query => 'SELECT * FROM customer LIMIT 10'
        ))
```

## License information


By using this connector, you acknowledge the inclusion of third party components, a list of which can be found in the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-vertica/pom.xml) file for this connector, and agree to the terms in the respective third party licenses provided in the [LICENSE.txt](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-vertica/LICENSE.txt) file on GitHub.com.

## Additional resources


For the latest JDBC driver version information, see the [pom.xml](https://github.com/awslabs/aws-athena-query-federation/blob/master/athena-vertica/pom.xml) file for the Vertica connector on GitHub.com.

For additional information about this connector, see [the corresponding site](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-vertica) on GitHub.com and [Querying a Vertica data source in Amazon Athena using the Athena Federated Query SDK](https://aws.amazon.com/blogs/big-data/querying-a-vertica-data-source-in-amazon-athena-using-the-athena-federated-query-sdk/) in the *AWS Big Data Blog*.

# Create a data source connection


To use an Athena data source connector, you create the AWS Glue connection that stores the connection information about the connector and your data source. When you create the connection, you give the data source a name that you will use to reference your data source in your SQL queries.

You can create and configure a data source connection in Athena by using the [console](connect-to-a-data-source-console-steps.md) or the [CreateDataCatalog API](https://docs.aws.amazon.com/athena/latest/APIReference/API_CreateDataCatalog.html) operations.

**Topics**
+ [

# Permissions to create and use a data source in Athena
](connect-to-a-data-source-permissions.md)
+ [

# Use the Athena console to connect to a data source
](connect-to-a-data-source-console-steps.md)
+ [

# Use the AWS Serverless Application Repository to deploy a data source connector
](connect-data-source-serverless-app-repo.md)
+ [

# Create a VPC for a data source connector or AWS Glue connection
](athena-connectors-vpc-creation.md)
+ [

# Pull ECR images to your AWS account
](pull-ecr-customer-account.md)
+ [

# Register your connection as a Glue Data Catalog
](register-connection-as-gdc.md)
+ [

# Enable cross-account federated queries
](xacct-fed-query-enable.md)
+ [

# Update a data source connector
](connectors-updating.md)

# Permissions to create and use a data source in Athena
Permissions

To create and use a data source, you need the following sets of permissions.
+ AmazonAthenaFullAccess that provides full access to Amazon Athena and scoped access to the dependencies needed to enable querying, writing results, and data management. For more information, see [AmazonAthenaFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonAthenaFullAccess.html) in the AWS Managed Policy Reference Guide.
+ Permissions to call the CreateDataCatalog API. These permissions are only needed when you create a data source that integrates with Glue connections. For more information on the example policy, see [Permissions required to create connector and Athena catalog](athena-catalog-access.md).
+ If you want to use Lake Formation fine-grain access control, in addition to the permissions listed above, you also need the following permissions.

------
#### [ JSON ]

****  

  ```
  {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
      {
        "Effect": "Allow",
        "Action": [
          "lakeformation:RegisterResource",
          "iam:ListRoles",
          "glue:CreateCatalog",
          "glue:GetCatalogs",
          "glue:GetCatalog"
        ],
        "Resource": "*"
      }
    ]
  }
  ```

------

# Use the Athena console to connect to a data source
Use the Athena console

You can use the Athena console to create and configure a data source connection.

**To create a connection to a data source**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. If the console navigation pane is not visible, choose the expansion menu on the left.  
![\[Choose the expansion menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/nav-pane-expansion.png)

1. In the navigation pane, choose **Data sources and catalogs**.

1. On the **Data sources and catalogs** page, choose **Create data source**.

1. For **Choose a data source**, choose the data source that you want Athena to query, considering the following guidelines:
   + Choose a connection option that corresponds to your data source. Athena has prebuilt data source connectors that you can configure for sources including MySQL, Amazon DocumentDB, and PostgreSQL.
   + Choose **S3 - AWS Glue Data Catalog** if you want to query data in Amazon S3 and you are not using an Apache Hive metastore or one of the other federated query data source options on this page. Athena uses the AWS Glue Data Catalog to store metadata and schema information for data sources in Amazon S3. This is the default (non-federated) option. For more information, see [Use AWS Glue Data Catalog to connect to your data](data-sources-glue.md). For steps using this workflow, see [Register and use data catalogs in Athena](gdc-register.md).
   + Choose **S3 - Apache Hive metastore** to query data sets in Amazon S3 that use an Apache Hive metastore. For more information about this option, see [Connect Athena to an Apache Hive metastore](connect-to-data-source-hive-connecting-athena-to-an-apache-hive-metastore.md).
   + Choose **Custom or shared connector** if you want to create your own data source connector for use with Athena. For information about writing a data source connector, see [Develop a data source connector using the Athena Query Federation SDK](connect-data-source-federation-sdk.md).

1. Choose **Next**.

1. On the **Enter data source details** page, for **Data source name**, use the name autogenerated name, or enter a unique name that you want to use in your SQL statements when you query the data source from Athena. The name can be up to 127 characters and must be unique within your account. It cannot be changed after you create it. Valid characters are a-z, A-Z, 0-9, \$1 (underscore), @ (at sign) and - (hyphen). The names `awsdatacatalog`, `hive`, `jmx`, and `system` are reserved by Athena and cannot be used for data source names. 

1. If the data source you choose integrates with AWS Glue connections.

   1. For **AWS Glue connection details**, enter the information required. A connection contains the properties that are required to connect to a particular data source. The properties required vary depending on the connection type. For more information on properties related to your connector, see [Available data source connectors](connectors-available.md). For information about additional connection properties, see [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html) in the *AWS Glue User Guide*.
**Note**  
When you update the Glue connection properties, the Lambda connector needs to be restarted to get the updated properties. To do this, edit the environment properties and save it without actually changing anything. 
When you update a Glue connection, the following properties will not automatically get updated in the corresponding Lambda function. You must manually update your Lambda function for these properties.  
Lambda VPC configuration – `security_group_ids`, `subnet_ids`
Lambda execution role – `spill_bucket`, `secret_name`, `spill_kms_key_id`

   1. For **Lambda execution IAM role**, choose one of the following:
      + **Create and use a new execution role** – (Default) Athena creates an execution role that it will then use to access resources in AWS Lambda on your behalf. Athena requires this role to create your federated data source.
      + **Use an existing execution role** – Use this option to choose an existing execution role. For this option, choose execution role that you want to use from **Execution role** drop-down.

1. If the data source you choose does not integrate with AWS Glue connections. 

   1. For **Lambda function**, choose **Create Lambda function**. The function page for the connector that you chose opens in the AWS Lambda console. The page includes detailed information about the connector.

   1. Under **Application settings**, read the description for each application setting carefully, and then enter values that correspond to your requirements.

      The application settings that you see vary depending on the connector for your data source. The minimum required settings include:
      + **AthenaCatalogName** – A name, in lower case, for the Lambda function that indicates the data source that it targets, such as `cloudwatchlogs`.
      + **SpillBucket** – An Amazon S3 bucket in your account to store data that exceeds Lambda function response size limits.
**Note**  
Spilled data is not reused in subsequent executions and can be safely deleted. Athena does not delete this data for you. To manage these objects, consider adding an object lifecycle policy that deletes old data from your Amazon S3 spill bucket. For more information, see [Managing your storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) in the Amazon S3 User Guide.

   1. Select **I acknowledge that this app creates custom IAM roles and resource policies**. For more information, choose the **Info** link.

   1. Choose **Deploy**. When the deployment is complete, the Lambda function appears in the **Resources** section in the Lambda console.

      After you deploy the data source connector to your account, you can connect Athena to it.

   1. Return to the **Enter data source details** page of the Athena console.

   1. In the **Connection details** section, choose the refresh icon next to the **Select or enter a Lambda function** search box.

   1. Choose the name of the function that you just created in the Lambda console. The ARN of the Lambda function displays.

1. (Optional) For **Tags**, add key-value pairs to associate with this data source. For more information about tags, see [Tag Athena resources](tags.md).

1. Choose **Next**.

1. On the **Review and create** page, review the data source details. To make changes, choose **Edit**. 

1. Read the information in **Athena will create resources in your account**. If you agree, select **I acknowledge that Athena will create resources on my behalf**.

1. Choose **Create data source**. **Athena ** will create the following resources for you.
   + Lambda execution IAM role
   + AWS Glue connection (only if the data source is compatible with AWS Glue Connections)
   + Lambda function

The **Data source details** section of the page for your data source shows information about your new connector. You can now use the connector in your Athena queries. 

For information about using data connectors in queries, see [Run federated queries](running-federated-queries.md).

# Use the AWS Serverless Application Repository to deploy a data source connector
Use the SAR

To deploy a data source connector, you can use the [AWS Serverless Application Repository](https://aws.amazon.com/serverless/serverlessrepo/) instead of using a AWS Glue connection.

**Note**  
We recommend that you use the SAR only if you have a custom connector or require the use of an older connector. Otherwise, the use of the Athena console is recommended. 

You can use the AWS Serverless Application Repository to find the connector that you want to use, provide the parameters that the connector requires, and then deploy the connector to your account. Then, after you deploy the connector, you use the Athena console to make the data source available to Athena.

## Deploying the connector to Your Account


**To use the AWS Serverless Application Repository to deploy a data source connector to your account**

1. Sign in to the AWS Management Console and open the **Serverless App Repository**.

1. In the navigation pane, choose **Available applications**.

1. Select the option **Show apps that create custom IAM roles or resource policies**.

1. In the search box, type the name of the connector. For a list of prebuilt Athena data connectors, see [Available data source connectors](connectors-available.md).

1. Choose the name of the connector. Choosing a connector opens the Lambda function's **Application details** page in the AWS Lambda console.

1. On the right side of the details page, for **Application settings**, fill in the required information. The minimum required settings include the following. For information about the remaining configurable options for data connectors built by Athena, see the corresponding [Available connectors](https://github.com/awslabs/aws-athena-query-federation/wiki/Available-Connectors) topic on GitHub.
   + **AthenaCatalogName** – A name for the Lambda function in lower case that indicates the data source that it targets, such as `cloudwatchlogs`.
   + **SpillBucket** – Specify an Amazon S3 bucket in your account to receive data from any large response payloads that exceed Lambda function response size limits.

1. Select **I acknowledge that this app creates custom IAM roles and resource policies**. For more information, choose the **Info** link.

1. At the bottom right of the **Application settings** section, choose **Deploy.** When the deployment is complete, the Lambda function appears in the **Resources** section in the Lambda console.

## Making the connector available in Athena


Now you are ready to use the Athena console to make the data source connector available to Athena.

**To make the data source connector available to Athena**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. If the console navigation pane is not visible, choose the expansion menu on the left.  
![\[Choose the expansion menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/nav-pane-expansion.png)

1. In the navigation pane, choose **Data sources and catalogs**.

1. On the **Data sources and catalogs** page, choose **Create data source**.

1. For **Choose a data source**, choose the data source for which you created a connector in the AWS Serverless Application Repository. This tutorial uses **Amazon CloudWatch Logs** as the federated data source.

1. Choose **Next**.

1. On the **Enter data source details** page, for **Data source name**, enter the name that you want to use in your SQL statements when you query the data source from Athena (for example, `CloudWatchLogs`). The name can be up to 127 characters and must be unique within your account. It cannot be changed after you create it. Valid characters are a-z, A-Z, 0-9, \$1 (underscore), @ (at sign) and - (hyphen). The names `awsdatacatalog`, `hive`, `jmx`, and `system` are reserved by Athena and cannot be used for data source names. 

1. In the **Connection details** section, use the **Select or enter a Lambda function** box to choose the name of the function that you just created. The ARN of the Lambda function displays.

1. (Optional) For **Tags**, add key-value pairs to associate with this data source. For more information about tags, see [Tag Athena resources](tags.md).

1. Choose **Next**.

1. On the **Review and create** page, review the data source details, and then choose **Create data source**. 

1. The **Data source details** section of the page for your data source shows information about your new connector. You can now use the connector in your Athena queries. 

   For information about using data connectors in queries, see [Run federated queries](running-federated-queries.md).

# Create a VPC for a data source connector or AWS Glue connection
Create a VPC

Some Athena data source connectors and AWS Glue connections require a VPC and a security group. This topic shows you how to create a VPC with a subnet and a security group for the VPC. As part of this process, you retrieve the IDs for the VPC, subnet, and security group that you create. These IDs are required when you configure your AWS Glue connection or data source connector for use with Athena.

**To create a VPC for an Athena data source connector**

1. Sign in to the AWS Management Console and open the Amazon VPC console at [https://console.aws.amazon.com/vpc/](https://console.aws.amazon.com/vpc/).

1. Choose **Create VPC**.

1. On the **Create VPC** page, under **VPC Settings**, for **Resources to create**, choose **VPC and more**.

1. Under **Name tag auto-generation**, for **Auto-generate**, enter a value that will be used to generate name tags for all resources in your VPC.

1. Choose **Create VPC**.

1. When the process completes, choose **View VPC**.

1. In the **Details** section, for **VPC ID**, copy your VPC ID for later reference.

Now you are ready to retrieve the subnet ID for the VPC that you just created.

**To retrieve your VPC subnet ID**

1. In the VPC console navigation pane, choose **Subnets**.

1. Select the name of a subnet whose **VPC** column has the VPC ID that you noted.

1. In the **Details** section, for **Subnet ID**, copy your subnet ID for later reference.

Next, you create a security group for your VPC.

**To create a security group for your VPC**

1. In the VPC console navigation pane, choose **Security**, **Security Groups**.

1. Choose **Create security group**.

1. On the **Create security group** page, enter the following information:
   + For **Security group name**, enter a name for your security group.
   + For **Description**, enter a description for the security group. A description is required.
   + For **VPC**, choose the VPC ID of the VPC that you created for your data source connector.
   + For **Inbound rules** and **Outbound rules**, add any inbound and outbound rules that you require.

1. Choose **Create security group**.

1. On the **Details** page for the security group, copy the **Security group ID** for later reference.

## Important considerations for using VPC with Athena connectors


The following instructions apply to all Athena connectors, as all connectors can utilize VPC.

**Note**  
When using a VPC with AWS Glue connections, you will need to set up the following PrivateLink endpoints:  
Amazon S3
AWS Glue
AWS Secrets Manager

Alternatively, you can use public internet access, though this is not recommended for security reasons.

**Warning**  
Using public internet access may expose your resources to additional security risks. It is strongly recommended to use PrivateLink endpoints for enhanced security in your VPC configuration.

# Pull ECR images to your AWS account
Pull ECR images

Athena federation connector Lambda functions use container images that are stored in Athena-managed Amazon ECR repositories. To perform security scans on these container images, you must first copy them to an Amazon ECR repository in your account. This section provides step-by-step instructions on how to copy an image to your repository and configure your Lambda function to use the image.

## Prerequisites

+ An Athena Federation Connector – The connector can be created through any source, provided it uses a container image.
**Note**  
To verify image deployment, check the Image tab in your Athena Federation Connector Lambda
+ Docker installed and running
+ AWS CLI installed
+ Account credentials with appropriate pull permissions

## How to transfer an image


1. Locate the Image URI from your Athena Federation Connector Lambda  
**Example**  

   ```
   account_id_1.dkr.ecr.us-east-1.amazonaws.com/athena-federation-repository:2025.15.1
   ```

1. Generate a Docker authentication token for the Athena-managed account:

   ```
   aws ecr get-login-password --region regionID | docker login --username AWS --password-stdin athena-managed-registry
   ```

   Where:
   + *regionID* is your deployment region (e.g., us-east-1)
   + *athena-managed-registry* is the registry portion of the Image URI (e.g., account\$1id\$11.dkr.ecr.us-east-1.amazonaws.com)

1. Pull the image from the Athena managed account:

   ```
   docker pull athenaImageURI
   ```

1. Authenticate Docker to your registry:

   ```
   aws ecr get-login-password --region regionID | docker login --username AWS --password-stdin customer-registry
   ```

   Where *customer-registry* is your ECR registry (e.g., account\$1id\$12.dkr.ecr.us-east-1.amazonaws.com)

1. Tag the pulled image for your repository:

   ```
   docker tag athenaImageURI yourImageURI
   ```

1. Push the image to your repository:

   ```
   docker push yourImageURI
   ```

1. Update your Athena Federation Connector:

   1. Navigate to your Lambda function

   1. Select **Deploy New Image**

   1. Enter your new image URI

   The Athena federated connector image is now located in your account, which allows you to perform CVE scans on the image.

# Register your connection as a Glue Data Catalog


After you create your data source, you can use the Athena console to register your connection as a Glue Data Catalog. Once registered, you can manage your federated data catalog and enable fine-grained access control using Lake Formation. For more information, see [Creating a federated catalog](https://docs.aws.amazon.com/lake-formation/latest/dg/create-fed-catalog-data-source.html).

You can register the following connectors to integrate with AWS Glue for fine-grained access control.
+ Redshift
+ BigQuery
+ DynamoDB (Preview)
+ Snowflake (Preview)
+ MySQL
+ PostgreSQL
+ AWS CMDB
+ Timestream
+ Azure Data Lake Storage
+ Azure Synapse
+ IBM Db2
+ IBM Db2 AS/400 (Db2 iSeries)
+ DocumentDB
+ Google Cloud Storage
+ HBase
+ OpenSearch
+ Oracle
+ SAP HANA
+ SQL Server
+ TPC-DS
+ Cloudera Hive
+ Cloudwatch
+ Cloudwatch Metrics
+ Teradata
+ Vertica

## Prerequisites


Before you begin, you must complete the following prerequisites.
+ Ensure that you have the roles and permissions needed to register locations. For more information, see the [Requirements for roles](https://docs.aws.amazon.com/lake-formation/latest/dg/registration-role.html) in the AWS Lake Formation Developer Guide.
+ Ensure that you have the required Lake Formation roles. For more information, see [Prerequisites for connecting the Data Catalog to external data sources](https://docs.aws.amazon.com/lake-formation/latest/dg/federated-catalog-data-connection.html) in the AWS Lake Formation Developer Guide.
+ The role that you register in Glue must have the permissions as listed in the following example.

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Action": [
                  "s3:ListBucket",
                  "s3:GetObject"
              ],
              "Resource": [
      "arn:aws:s3:::amzn-s3-demo-bucket/spill-prefix/*",
      "arn:aws:s3:::amzn-s3-demo-bucket/spill-prefix"
              ]
          },
          {
              "Sid": "lambdainvoke",
              "Effect": "Allow",
              "Action": "lambda:InvokeFunction",
              "Resource": "arn:aws:lambda:us-east-1:111122223333:function:lambda_function_name"
          },
          {
              "Sid": "gluepolicy",
              "Effect": "Allow",
              "Action": "glue:*",
              "Resource": [
              "arn:aws:glue:us-east-1:111122223333:connection/<connection_name>",
      "arn:aws:glue:us-east-1:111122223333:catalog"
              ]
          }
      ]
  }
  ```

------
+ You are responsible to determine and manage the appropriate data access. With fine-grain access controls on federated queries, it is recommended that you use the [AmazonAthenaFullAccess](https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonAthenaFullAccess.html) managed policy. If you want to use your own policy, you must ensure that the users executing federated queries do not have access to the following resources.
  + `lambda:InvokeFunction` on the Lambda connector that is specified in Glue connection
  + Spill bucket location access in IAM
  + Access to the Glue connection associated with your federated catalog
  + Lake Formation Role in IAM

## Register your connection using console


**To register your connection as a Glue Data Catalog**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. In the navigation pane, choose **Data sources and catalogs**.

1. From the **Data sources** list, choose the data source that you created to open the **Data source details** page. 

1. Choose **Get started with AWS Lake Formation**.
**Note**  
After you choose this option, you must manage your Lambda function on your own. Athena will not delete your Lambda function.

1. For **Data catalog name** provide a unique name for your catalog.

1. Choose the **Lake Formation IAM role** that grants permission to Lake Formation to invoke the Lambda function. Make sure the role has the permissions as shown in [the example](#register-connection-as-gdc-pre).

1. In the text box, type **confirm** to delete the Athena data source, replace it with a Glue data catalog registration.
**Note**  
This action will delete your Athena data source and create a new Glue Data Catalog in its place. After this process is completed, you may need to update queries that access the data source to refer to the newly created Glue data catalog instead.

1. Choose **Create catalog and go to Lake Formation**. This opens the Lake Formation console where you can manage the catalog and grant permissions to users on catalogs, databases and tables.

# Enable cross-account federated queries


Federated query allows you to query data sources other than Amazon S3 using data source connectors deployed on AWS Lambda. The cross-account federated query feature allows the Lambda function and the data sources that are to be queried to be located in different accounts.

**Note**  
Use this method only if you have not registered your federated data source with the AWS Glue Data Catalog. If you have registered your data source with the AWS Glue Data Catalog, use the AWS Glue Data Catalog cross account features and permissions model. For more information, see [Granting cross-account access](https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html) in the *AWS Glue User Guide*.

As a data administrator, you can enable cross-account federated queries by sharing your data connector with a data analyst's account or, as a data analyst, by using a shared Lambda ARN from a data administrator to add to your account. When configuration changes are made to a connector in the originating account, the updated configuration is automatically applied to the shared instances of the connector in other user's accounts.

## Considerations and limitations

+ The cross-account federated query feature is available for non-Hive metastore data connectors that use a Lambda-based data source.
+ The feature is not available for the AWS Glue Data Catalog data source type. For information about cross-account access to AWS Glue Data Catalogs, see [Configure cross-account access to AWS Glue data catalogs](security-iam-cross-account-glue-catalog-access.md).
+ If the response from your connector's Lambda function exceeds the Lambda response size limit of 6MB, Athena automatically encrypts, batches, and spills the response to an Amazon S3 bucket that you configure. The entity running the Athena query must have access to the spill location in order for Athena to read the spilled data. We recommend that you set an Amazon S3 lifecycle policy to delete objects from the spill location since the data is not needed after the query completes. 
+ Using federated queries across AWS Regions is not supported. 

## Required permissions


To set up the required permissions, actions must be taken in both Account A (*444455556666*) and Account B (*111122223333*).

### Actions for Account A


For data administrator Account A to share a Lambda function with data analyst Account B, Account B requires Lambda invoke function and spill bucket access. Accordingly, Account A should add a [ resource-based policy](https://docs.aws.amazon.com/lambda/latest/dg/access-control-resource-based.html) to the Lambda function and [principal](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-policy-language-overview.html) access to its spill bucket in Amazon S3.

1. The following policy grants Lambda invoke function permissions to Account B on a Lambda function in Account A.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "CrossAccountInvocationStatement",
               "Effect": "Allow",
               "Principal": {
                   "AWS": [
                       "arn:aws:iam::111122223333:user/username"
                   ]
               },
               "Action": "lambda:InvokeFunction",
               "Resource": "arn:aws:lambda:us-east-1:444455556666:function:lambda-function-name"
           }
       ]
   }
   ```

------

1. The following policy allows spill bucket access to the principal in Account B.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Principal": {
               "AWS": ["arn:aws:iam::111122223333:user/username"]
               },
               "Action": [
                   "s3:GetObject",
                   "s3:ListBucket"
                ],
               "Resource": [
                   "arn:aws:s3:::spill-bucket",
                   "arn:aws:s3:::spill-bucket/*"
               ]
           }
        ]
    }
   ```

------

1. If the Lambda function is encrypting the spill bucket with a AWS KMS key instead of the default encryption offered by the federation SDK, the AWS KMS key policy in Account A must grant access to the user in Account B, as in the following example.

   ```
   { 
       "Sid": "Allow use of the key", 
       "Effect": "Allow", 
       "Principal": 
       { 
          "AWS": ["arn:aws:iam::account-B-id:user/username"] 
       }, 
       "Action": [ "kms:Decrypt" ], 
       "Resource": "*" // Resource policy that gets placed on the KMS key. 
    }
   ```

### Actions for Account B


For Account A to share its connector with Account B, Account B must create a role called `AthenaCrossAccountCreate-account-A-id` that Account A assumes by calling the AWS Security Token Service [AssumeRole](https://docs.aws.amazon.com/STS/latest/APIReference/API_AssumeRole.html) API action.

1. Use the IAM console or the AWS CLI to create the `AthenaCrossAccountCreate-account-A-id` role in as a custom trust policy role. A custom trust policy delegates access and allows others to perform actions in your AWS account. For steps, see [Create a role using custom trust policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-custom.html) in the *IAM User Guide*.

   The trust relationship should have a principal object in which the key is `AWS` and the value is the ARN of Account A, as in the following example.

   ```
   ...
   "Principal": 
   { 
      "AWS": ["arn:aws:iam::account-A-id:user/username"]
   }, 
   ...
   ```

1. Also in Account B, create a policy like the following that allows the `CreateDataCatalog` action. 

   ```
   {
    "Effect": "Allow",
    "Action": "athena:CreateDataCatalog",
    "Resource": "arn:aws:athena:*:account-B-id:datacatalog/*"
   }
   ```

1. Add the policy that allows the `CreateDataCatalog` action to the `AthenaCrossAccountCreate-account-A-id` role that you created using Account B. 

## Sharing a data source in Account A with Account B


After permissions are in place, you can use the **Data sources and catalogs** page in the Athena console to share a data connector in your account (Account A) with another account (Account B). Account A retains full control and ownership of the connector. When Account A makes configuration changes to the connector, the updated configuration applies to the shared connector in Account B.

**Note**  
You can only share a Lambda type data source and cannot share data sources that use AWS Glue connections. For more information, see [Available data source connectors](connectors-available.md).

**To share a Lambda data source in Account A with Account B**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. If the console navigation pane is not visible, choose the expansion menu on the left.  
![\[Choose the expansion menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/nav-pane-expansion.png)

1. Choose **Data sources and catalogs**.

1. On the **Data sources and catalogs** page, choose the link of the connector that you want to share.

1. On the details page for a Lambda data source, from the **Actions** menu in the top right corner, choose **Share**.

1. In the **Share *Lambda-name* with another account?** dialog box, enter the required information.
   + For **Data source name**, enter the name of the copied data source as you want it to appear in the other account.
   + For **Account ID**, enter the ID of the account with which you want to share your data source (in this case, Account B).

1. Choose **Share**. The shared data connector that you specified is created in Account B. Configuration changes to the connector in Account A apply to the connector in Account B.

## Adding a shared data source from Account A to Account B


As a data analyst, you may be given the ARN of a connector to add to your account from a data administrator. You can use the **Data sources and catalogs** page of the Athena console to add the Lambda ARN provided by your administrator to your account.

**To add the Lambda ARN of a shared data connector to your account**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. If the navigation pane is not visible, choose the expansion menu on the left.

1. Choose **Data sources and catalogs**.

1. On the **Data sources and catalogs** page, choose **Create data source**.

1. On the **Choose a data source** page, choose **Custom or shared connector**.

1. Choose **Next**.

1. On the **Enter data source details** page, in the **Connection details** section, for **Select or enter a Lambda function**, enter the Lambda ARN of Account A.

1. Choose **Next**.

1. On the **Review and create** page, choose **Create data source**.

## Troubleshooting


If you receive an error message that Account A does not have the permissions to assume a role in Account B, make sure that the name of the role created in Account B is spelled correctly and that it has the proper policy attached.

# Update a data source connector


Athena recommends that you regularly update the data source connectors that you use to the latest version to take advantage of new features and enhancements. Updating a data source connector includes the following steps:

# Glue connections (recommended)


## Find the latest Athena Query Federation version


The latest version number of Athena data source connectors corresponds to the latest Athena Query Federation version. In certain cases, the GitHub releases can be slightly newer than what is available on the AWS Serverless Application Repository (SAR).

**To find the latest Athena Query Federation version number**

1. Visit the GitHub URL [https://github.com/awslabs/aws-athena-query-federation/releases/latest](https://github.com/awslabs/aws-athena-query-federation/releases/latest).

1. Note the release number in the main page heading in the following format:

   **Release v** *year*.*week\$1of\$1year*.*iteration\$1of\$1week* **of Athena Query Federation **

   For example, the release number for **Release v2023.8.3 of Athena Query Federation** is 2023.8.3.

## Finding your connector version


Follow these steps to determine which version of your connector you are currently using.

**To find your connector version**

1. On the Lambda console page for your Lambda application, choose the **Image** tab.

1. Under Image tab, locate the Image URI. The URI follows this format:

   ```
   Image_location_account.dkr.ecr.us-west-2.amazonaws.com/athena-federation-repository:Version
   ```

1. The version number in the Image URI follows the format `year.week_of_year.iteration_of_week` (for example, `2021.42.1`). This number represents your connector version.

## Deploying a new connector version


Follow these steps to deploy a new version of your connector.

**To deploy a new connector version**

1. Find the desired version by following the procedure to find the latest Athena Query Federation version.

1. In the federated connector Lambda function, locate the ImageURI and update the tag to the desired version. For example:

   From:

   ```
   509399631660.dkr.ecr.us-east-1.amazonaws.com/athena-federation-repository:2025.15.1
   ```

   To:

   ```
   509399631660.dkr.ecr.us-east-1.amazonaws.com/athena-federation-repository:2025.26.1
   ```

**Note**  
If your current version is older than 2025.15.1, be aware of these important changes:  
The repository name has been updated to `athena-federation-repository`
For versions before this update, the command override may not be set. You must set it to the composite handler.

# Legacy connections


## Find the latest Athena Query Federation version


The latest version number of Athena data source connectors corresponds to the latest Athena Query Federation version. In certain cases, the GitHub releases can be slightly newer than what is available on the AWS Serverless Application Repository (SAR).

**To find the latest Athena Query Federation version number**

1. Visit the GitHub URL [https://github.com/awslabs/aws-athena-query-federation/releases/latest](https://github.com/awslabs/aws-athena-query-federation/releases/latest).

1. Note the release number in the main page heading in the following format:

   **Release v** *year*.*week\$1of\$1year*.*iteration\$1of\$1week* **of Athena Query Federation **

   For example, the release number for **Release v2023.8.3 of Athena Query Federation** is 2023.8.3.

## Find and note resource names


In preparation for the upgrade, you must find and note the following information:

1. The Lambda function name for the connector.

1. The Lambda function environment variables.

1. The Lambda application name, which manages the Lambda function for the connector.

**To find resource names from the Athena console**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. If the console navigation pane is not visible, choose the expansion menu on the left.  
![\[Choose the expansion menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/nav-pane-expansion.png)

1. In the navigation pane, choose **Data sources and catalogs**.

1. In the **Data source name** column, choose the link to the data source for your connector.

1. In the **Data source details** section, under **Lambda function**, choose the link to your Lambda function.  
![\[Choose the link to your Lambda function.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-updating-1.png)

1. On the **Functions** page, in the **Function name** column, note the function name for your connector.  
![\[Note the function name.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-updating-2.png)

1. Choose the function name link.

1. Under the **Function overview** section, choose the **Configuration** tab.

1. In the pane on the left, choose **Environment variables**.

1. In the **Environment variables** section, make a note of the keys and their corresponding values.

1. Scroll to the top of the page.

1. In the message **This function belongs to an application. Click here to manage it**, choose the **Click here** link.

1. On the **serverlessrepo-*your\$1application\$1name*** page, make a note of your application name without **serverlessrepo**. For example, if the application name is **serverlessrepo-DynamoDbTestApp**, then your application name is **DynamoDbTestApp**.

1. Stay on the Lambda console page for your application, and then continue with the steps in **Finding the version of the connector that you are using**.

## Find the version of the connector that you are using


Follow these steps to find the version of the connector that you are using.

**To find the version of the connector that you are using**

1. On the Lambda console page for your Lambda application, choose the **Deployments** tab.

1. On the **Deployments** tab, expand **SAM template**.

1. Search for **CodeUri**.

1. In the **Key** field under **CodeUri**, find the following string:

   ```
   applications-connector_name-versions-year.week_of_year.iteration_of_week/hash_number
   ```

   The following example shows a string for the CloudWatch connector:

   ```
   applications-AthenaCloudwatchConnector-versions-2021.42.1/15151159...
   ```

1. Record the value for *year*.*week\$1of\$1year*.*iteration\$1of\$1week* (for example, **2021.42.1**). This is the version for your connector.

## Deploy the new version of your connector


Follow these steps to deploy a new version of your connector.

**To deploy a new version of your connector**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. If the console navigation pane is not visible, choose the expansion menu on the left.  
![\[Choose the expansion menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/nav-pane-expansion.png)

1. In the navigation pane, choose **Data sources and catalogs**.

1. On the **Data sources and catalogs** page, choose **Create data source**.

1. Choose the data source that you want to upgrade, and then choose **Next**.

1. In the **Connection details** section, choose **Create Lambda function**. This opens the Lambda console where you will be able to deploy your updated application.  
![\[Connector page in the AWS Lambda console.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-updating-3.png)

1. Because you are not actually creating a new data source, you can close the Athena console tab.

1. On the Lambda console page for the connector, perform the following steps:

   1. Ensure that you have removed the **serverlessrepo-** prefix from your application name, and then copy the application name to the **Application name** field.

   1. Copy your Lambda function name to the **AthenaCatalogName** field. Some connectors call this field **LambdaFunctionName**.

   1. Copy the environment variables that you recorded into their corresponding fields.

1. Select the option **I acknowledge that this app creates custom IAM roles and resource policies**, and then choose **Deploy**.

1. To verify that your application has been updated, choose the **Deployments** tab.

   The **Deployment history** section shows that your update is complete.  
![\[Connector update completed.\]](http://docs.aws.amazon.com/athena/latest/ug/images/connectors-updating-4.png)

1. To confirm the new version number, you can expand **SAM template** as before, find **CodeUri**, and check the connector version number in the **Key** field.

You can now use your updated connector to create Athena federated queries.

# Edit or delete a data source connection


You can use the Athena console to update the description, host, port, database, and other properties for an existing connection. You can also delete the data sources from Athena console.

## Edit a data source connection


**To edit a data source connection**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. If the console navigation pane is not visible, choose the expansion menu on the left.  
![\[Choose the expansion menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/nav-pane-expansion.png)

1. In the navigation pane, choose **Data sources and catalogs**.

1. On the **Data sources and catalogs** page, choose the data source connection that you want to edit.

1. For **AWS Glue connection details**, choose **Edit**.

1. Choose **Next**.

1. On the **Edit <connection-name>** page, update the information as required. Available properties depend on the connection type.
**Note**  
When you update connection properties for secrets, spill location, or AWS KMS key ID, make sure the Lambda execution role still has access to the updated resources. For more information, see [Viewing and updating permissions in the execution role](https://docs.aws.amazon.com/lambda/latest/dg/permissions-executionrole-update.html) in the AWS Lambda Developer Guide.
   + **Description** – Edit the description for your connection.
   + **Host** – Edit the host name for your database.
   + **Port** – Edit the port number for your database.
   + **Database** – Edit the name of your database.
   + **JDBC parameters** – Edit any additional JDBC parameters required for your connection. 
   + **Secret** – Choose or create a secret from AWS Secrets Manager. Use AWS secrets to avoid hardcoding sensitive information in your JDBC connection string. For more information, see [What is AWS Secrets Manager?](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html) For information about creating a secret in Secrets Manager, see [Create an AWS Secrets Manager secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html) in the *AWS Secrets Manager User Guide*.

     To use AWS Secrets Manager with Athena federated queries, you must configure an Amazon VPC private endpoint for Secrets Manager. For more information, see [Create a Secrets Manager VPC private endpoint](https://docs.aws.amazon.com/secretsmanager/latest/userguide/vpc-endpoint-overview.html#vpc-endpoint-create) in the *AWS Secrets Manager User Guide*.
   + **Spill location in Amazon S3** – Choose or create an Amazon S3 bucket location in your account to store data that exceeds Lambda function response size limits.
**Note**  
Spilled data is not reused in subsequent executions and can be safely deleted after 12 hours. Athena does not delete this data for you. To manage these objects, consider adding an object lifecycle policy that deletes old data from your Amazon S3 spill bucket. For more information, see [Managing your storage lifecycle](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) in the *Amazon S3 User Guide*.
   + **Encryption for query results in S3** – Choose one of the following:
     + (Default) **Use a randomly generated key** – Data that is spilled to Amazon S3 is encrypted using the AES-GCM authenticated encryption mode and a randomly generated key.
     + **Use an AWS KMS key** – Choose or create a stronger, AWS KMS generated encryption key. For more information, see [Creating keys](https://docs.aws.amazon.com/kms/latest/developerguide/create-keys.html) in the *AWS Key Management Service Developer Guide*.
     + **Turn off** – Do not encrypt spill data.
   + **Networking settings** – Some connections require a virtual private cloud (VPC). Choose or create a VPC that has the data store that you want to access, a subnet, and one or more security groups. For more information, see [Create a VPC for a data source connector or AWS Glue connection](athena-connectors-vpc-creation.md).
**Note**  
After you update connection properties for resources such as secrets, spill location, or AWS KMS key ID, make sure that the Lambda execution role continues to have access to the updated resources.
After you update the network settings for your connection, make sure you update the Lambda function with the same settings to make your connection compatible with the data source.

   For information about additional connection properties, see [AWS Glue connection properties](https://docs.aws.amazon.com/glue/latest/dg/connection-properties.html) in the *AWS Glue User Guide* or [Available data source connectors](connectors-available.md) in the *Amazon Athena User Guide*.

1. Choose **Save**.

The **AWS Glue connection details** section of the page for your data source shows the updated information for your connector.

## Delete a data source


When you delete a data source, it only deletes the Athena data source and does not delete resources like the Glue connections, IAM execution role, and Lambda function.

**To delete a data source**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. In the navigation pane, choose **Data sources and catalogs**.

1. On the **Data sources and catalogs** page, choose the data source that you want to delete.

1. Choose **Delete**.

1. On the **Delete data source** page, type *confirm* to confirm deletion and the choose **Delete**. It may take some time before the data source deletion completes. You get a success alert once the data source is deleted.

# Run federated queries


After you have configured one or more data connectors and deployed them to your account, you can use them in your Athena queries. 

## Query a single data source


The examples in this section assume that you have configured and deployed the [Amazon Athena CloudWatch connector](connectors-cloudwatch.md) to your account. Use the same approach to query when you use other connectors.

**To create an Athena query that uses the CloudWatch connector**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. In the Athena query editor, create a SQL query that uses the following syntax in the `FROM` clause.

   ```
   MyCloudwatchCatalog.database_name.table_name       
   ```

### Examples


The following example uses the Athena CloudWatch connector to connect to the `all_log_streams` view in the `/var/ecommerce-engine/order-processor` CloudWatch Logs [Log group](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html). The `all_log_streams` view is a view of all the log streams in the log group. The example query limits the number of rows returned to 100.

```
SELECT * 
FROM "MyCloudwatchCatalog"."/var/ecommerce-engine/order-processor".all_log_streams 
LIMIT 100;
```

The following example parses information from the same view as the previous example. The example extracts the order ID and log level and filters out any message that has the level `INFO`.

```
SELECT 
    log_stream as ec2_instance, 
    Regexp_extract(message '.*orderId=(\d+) .*', 1) AS orderId, 
    message AS order_processor_log, 
    Regexp_extract(message, '(.*):.*', 1) AS log_level 
FROM MyCloudwatchCatalog."/var/ecommerce-engine/order-processor".all_log_streams 
WHERE Regexp_extract(message, '(.*):.*', 1) != 'INFO'
```

## Query multiple data sources


As a more complex example, imagine an e-commerce company that uses the following data sources to store data related to customer purchases:
+ [Amazon RDS for MySQL](https://aws.amazon.com/rds/mysql/) to store product catalog data
+ [Amazon DocumentDB](https://aws.amazon.com/documentdb/) to store customer account data such as email and shipping addresses
+ [Amazon DynamoDB](https://aws.amazon.com/dynamodb/) to store order shipment and tracking data

Imagine that a data analyst for this e-commerce application learns that shipping time in some regions has been impacted by local weather conditions. The analyst wants to know how many orders are delayed, where the affected customers are located, and which products are most affected. Instead of investigating the sources of information separately, the analyst uses Athena to join the data together in a single federated query.

**Example**  

```
SELECT 
     t2.product_name AS product, 
     t2.product_category AS category, 
     t3.customer_region AS region, 
     count(t1.order_id) AS impacted_orders 
FROM my_dynamodb.default.orders t1 
JOIN my_mysql.products.catalog t2 ON t1.product_id = t2.product_id 
JOIN my_documentdb.default.customers t3 ON t1.customer_id = t3.customer_id 
WHERE 
     t1.order_status = 'PENDING'
     AND t1.order_date between '2022-01-01' AND '2022-01-05' 
GROUP BY 1, 2, 3 
ORDER BY 4 DESC
```

## Query federated views


When querying federated sources, you can use views to obfuscate the underlying data sources or hide complex joins from other analysts who query the data.

### Considerations and limitations

+ Federated views require Athena engine version 3. 
+ Federated views are stored in AWS Glue, not with the underlying data source.
+ Federated views are not supported on data sources that are [registered as a Glue Data Catalog](register-connection-as-gdc.md).
+ Views created with federated catalogs must use fully qualified name syntax, as in the following example:

  ```
  "ddbcatalog"."default"."customers"
  ```
+ Users who run queries on federated sources must have permission to query the federated sources.
+ The `athena:GetDataCatalog` permission is required for federated views. For more information, see [Allow access to Athena Federated Query: Example policies](federated-query-iam-access.md).

### Examples


The following example creates a view called `customers` on data stored in a federated data source.

**Example**  

```
CREATE VIEW customers AS
SELECT *
FROM my_federated_source.default.table
```

The following example query shows a query that references the `customers` view instead of the underlying federated data source.

**Example**  

```
SELECT id, SUM(order_amount)
FROM customers
GROUP by 1
ORDER by 2 DESC
LIMIT 50
```

The following example creates a view called `order_summary` that combines data from a federated data source and from an Amazon S3 data source. From the federated source, which has already been created in Athena, the view uses the `person` and `profile` tables. From Amazon S3, the view uses the `purchase` and `payment` tables. To refer to Amazon S3, the statement uses the keyword `awsdatacatalog`. Note that the federated data source uses the fully qualified name syntax *federated\$1source\$1name*.*federated\$1source\$1database*.*federated\$1source\$1table*.

**Example**  

```
CREATE VIEW default.order_summary AS
SELECT *
FROM federated_source_name.federated_source_database."person" p
    JOIN federated_source_name.federated_source_database."profile" pr ON pr.id = p.id
    JOIN awsdatacatalog.default.purchase i ON p.id = i.id
    JOIN awsdatacatalog.default.payment pay ON pay.id = p.id
```

### Additional resources

+ For an example of a federated view that is decoupled from its originating source and is available for on-demand analysis in a multi-user model, see [Extend your data mesh with Amazon Athena and federated views](https://aws.amazon.com/blogs/big-data/extend-your-data-mesh-with-amazon-athena-and-federated-views/) in the *AWS Big Data Blog*. 
+ For more information about working with views in Athena, see [Work with views](views.md).

# Use federated passthrough queries
Use passthrough queries

In Athena, you can run queries on federated data sources using the query language of the data source itself and push the full query down to the data source for execution. These queries are called passthrough queries. To run passthrough queries, you use a table function in your Athena query. You include the passthrough query to run on the data source in one of the arguments to the table function. Pass through queries return a table that you can analyze using Athena SQL.

## Supported connectors


The following Athena data source connectors support passthrough queries.
+ [Azure Data Lake Storage](connectors-adls-gen2.md)
+ [Azure Synapse](connectors-azure-synapse.md)
+ [Cloudera Hive](connectors-cloudera-hive.md)
+ [Cloudera Impala](connectors-cloudera-impala.md)
+ [CloudWatch](connectors-cloudwatch.md)
+ [Db2](connectors-ibm-db2.md)
+ [Db2 iSeries](connectors-ibm-db2-as400.md)
+ [DocumentDB](connectors-docdb.md) 
+ [DynamoDB](connectors-dynamodb.md) 
+ [HBase](connectors-hbase.md)
+ [Google BigQuery](connectors-bigquery.md)
+ [Hortonworks](connectors-hortonworks.md)
+ [MySQL](connectors-mysql.md)
+ [Neptune](connectors-neptune.md)
+ [OpenSearch](connectors-opensearch.md) 
+ [Oracle](connectors-oracle.md)
+ [PostgreSQL](connectors-postgresql.md)
+ [Redshift](connectors-redshift.md)
+ [SAP HANA](connectors-sap-hana.md)
+ [Snowflake](connectors-snowflake.md)
+ [SQL Server](connectors-microsoft-sql-server.md)
+ [Teradata](connectors-teradata.md)
+ [Timestream](connectors-timestream.md)
+ [Vertica](connectors-vertica.md)

## Considerations and limitations


When using passthrough queries in Athena, consider the following points:
+ Query passthrough is supported only for Athena `SELECT` statements or read operations.
+ Query performance can vary depending on the configuration of the data source.
+ Query passthrough does not support Lake Formation fine-grained access control.
+ Passthrough queries are not supported on data sources that are [registered as a Glue Data Catalog](register-connection-as-gdc.md).

## Syntax


The general Athena query passthrough syntax is as follows.

```
SELECT * FROM TABLE(catalog.system.function_name(arg1 => 'arg1Value'[, arg2 => 'arg2Value', ...]))
```

Note the following:
+ **catalog** – The target Athena federated connector name or data catalog name.
+ **system** – The namespace that contains the function. All Athena connector implementations use this namespace.
+ **function\$1name** – The name of the function that pushes the passthrough query down to the data source. This is often called `query`. The combination `catalog.system.function_name` is the full resolution path for the function.
+ **arg1, arg2, and so on** – Function arguments. The user must pass these to the function. In most cases, this is the query string that is passed down to the data source.

For most data sources, the first and only argument is `query` followed by the arrow operator `=>` and the query string.

```
SELECT * FROM TABLE(catalog.system.query(query => 'query string'))
```

For simplicity, you can omit the optional named argument `query` and the arrow operator `=>`.

```
SELECT * FROM TABLE(catalog.system.query('query string'))
```

You can further simplify the query by removing the `catalog` name if the query is run within the context of the target catalog.

```
SELECT * FROM TABLE(system.query('query string'))
```

If the data source requires more than the query string, use named arguments in the order expected by the data source. For example, the expression `arg1 => 'arg1Value'` contains the first argument and its value. The name *arg1* is specific to the data source and can differ from connector to connector.

```
SELECT * FROM TABLE(
        system.query(
            arg1 => 'arg1Value',
            arg2 => 'arg2Value',
            arg3 => 'arg3Value'
        ));
```

The above can also be simplified by omitting the argument names. However, you must follow the order of the method's signature. See each connector's documentation for more information about the function's signature.

```
SELECT * FROM TABLE(catalog.system.query('arg1Value', 'arg2Value', 'arg3Value'))
```

You can run multiple passthrough queries across different Athena connectors by utilizing the full function resolution path, as in the following example.

```
SELECT c_customer_sk 
    FROM TABLE (postgresql.system.query('select * from customer limit 10'))
UNION
SELECT c_customer_sk 
    FROM TABLE(dynamodb.system.query('select * from customer')) LIMIT 10
```

You can use passthrough queries as part of a federated view. The same limitations apply. For more information, see [Query federated views](https://docs.aws.amazon.com/athena/latest/ug/running-federated-queries.html#running-federated-queries-federated-views).

```
CREATE VIEW catalog.database.ViewName AS
    SELECT * FROM TABLE (
        catalog.system.query('query')
    )
```

For information about the exact syntax to use with a particular connector, see the individual connector documentation.

### Quotation mark usage


Argument values, including the query string that you pass, must be enclosed in single quotes, as in the following example.

```
SELECT * FROM TABLE(system.query(query => 'SELECT * FROM testdb.persons LIMIT 10'))
```

When the query string is surrounded by double quotes, the query fails. The following query fails with the error message COLUMN\$1NOT\$1FOUND: line 1:43: Column 'select \$1 from testdb.persons limit 10' cannot be resolved.

```
SELECT * FROM TABLE(system.query(query => "SELECT * FROM testdb.persons LIMIT 10"))
```

To escape a single quote, add a single quote to the original (for example, `terry's_group` to `terry''s_group`).

## Examples


The following example query pushes down a query to a data source. The query selects all columns in the `customer` table, limiting the results to 10.

```
SELECT * FROM TABLE(
        catalog.system.query(
            query => 'SELECT * FROM customer LIMIT 10;'
        ))
```

The following statement runs the same query, but eliminates the optional named argument `query` and the arrow operator `=>`.

```
SELECT * FROM TABLE(
        catalog.system.query(
            'SELECT * FROM customer LIMIT 10;'
        ))
```

This can also be encapsulated within a federated view for ease of reuse. When used with a view, you must use the full function resolution path.

```
CREATE VIEW AwsDataCatalog.default.example_view AS
    SELECT * FROM TABLE (
        catalog.system.query('SELECT * FROM customer LIMIT 10;')
    )
```

## Opt out of query passthrough


To disable passthrough queries, add a Lambda environment variable named `enable_query_passthrough` and set it to `false`.

# Understand federated table name qualifiers
Understand federated table name qualifiers

Athena uses the following terms to refer to hierarchies of data objects:
+ **Data source** – a group of databases
+ **Database** – a group of tables
+ **Table** – data organized as a group of rows or columns

Sometimes these objects are also referred to with alternate but equivalent names such as the following:
+ A data source is sometimes referred to as a catalog.
+ A database is sometimes referred to as a schema.

## Terms in federated data sources


When you query federated data sources, note that the underlying data source might not use the same terminology as Athena. Keep this distinction in mind when you write your federated queries. The following sections describe how data object terms in Athena correspond to those in federated data sources.

### Amazon Redshift


An Amazon Redshift *database* is a group of Redshift *schemas* that contains a group of Redshift *tables*.


****  

| Athena | Redshift | 
| --- | --- | 
| Redshift data source | A Redshift connector Lambda function configured to point to a Redshift database. | 
| data\$1source.database.table | database.schema.table | 

Example query

```
SELECT * FROM 
Athena_Redshift_connector_data_source.Redshift_schema_name.Redshift_table_name
```

For more information about this connector, see [Amazon Athena Redshift connector](connectors-redshift.md).

### Cloudera Hive


An Cloudera Hive *server* or *cluster* is a group of Cloudera Hive *databases* that contains a group of Cloudera Hive *tables*.


****  

| Athena | Hive | 
| --- | --- | 
| Cloudera Hive data source | Cloudera Hive connector Lambda function configured to point to a Cloudera Hive server. | 
| data\$1source.database.table | server.database.table | 

Example query

```
SELECT * FROM 
Athena_Cloudera_Hive_connector_data_source.Cloudera_Hive_database_name.Cloudera_Hive_table_name
```

For more information about this connector, see [Amazon Athena Cloudera Hive connector](connectors-cloudera-hive.md).

### Cloudera Impala


An Impala *server* or *cluster* is a group of Impala *databases* that contains a group of Impala *tables*.


****  

| Athena | Impala | 
| --- | --- | 
| Impala data source | Impala connector Lambda function configured to point to an Impala server. | 
| data\$1source.database.table | server.database.table | 

Example query

```
SELECT * FROM 
Athena_Impala_connector_data_source.Impala_database_name.Impala_table_name
```

For more information about this connector, see [Amazon Athena Cloudera Impala connector](connectors-cloudera-impala.md).

### MySQL


A MySQL *server* is a group of MySQL *databases* that contains a group of MySQL *tables*.


****  

| Athena | MySQL | 
| --- | --- | 
| MySQL data source | MySQL connector Lambda function configured to point to a MySQL server. | 
| data\$1source.database.table | server.database.table | 

Example query

```
SELECT * FROM 
Athena_MySQL_connector_data source.MySQL_database_name.MySQL_table_name
```

For more information about this connector, see [Amazon Athena MySQL connector](connectors-mysql.md).

### Oracle


An Oracle *server* (or *database*) is a group of Oracle *schemas* that contains a group of Oracle *tables*.


****  

| Athena | Oracle | 
| --- | --- | 
| Oracle data source | Oracle connector Lambda function configured to point to an Oracle server. | 
| data\$1source.database.table | server.schema.table | 

Example query

```
SELECT * FROM 
Athena_Oracle_connector_data_source.Oracle_schema_name.Oracle_table_name
```

For more information about this connector, see [Amazon Athena Oracle connector](connectors-oracle.md).

### Postgres


A Postgres *server* (or *cluster*) is a group of Postgres *databases*. A Postgres *database* is a group of Postgres *schemas* that contains a group of Postgres *tables*.


****  

| Athena | Postgres | 
| --- | --- | 
| Postgres data source | Postgres connector Lambda function configured to point to a Postgres server and database. | 
| data\$1source.database.table | server.database.schema.table | 

Example query

```
SELECT * FROM 
Athena_Postgres_connector_data_source.Postgres_schema_name.Postgres_table_name
```

For more information about this connector, see [Amazon Athena PostgreSQL connector](connectors-postgresql.md).

# Develop a data source connector using the Athena Query Federation SDK
Develop a data source connector

To write your own data source connectors, you can use the [Athena Query Federation SDK](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-federation-sdk). The Athena Query Federation SDK defines a set of interfaces and wire protocols that you can use to enable Athena to delegate portions of its query execution plan to code that you write and deploy. The SDK includes a connector suite and an example connector.

Custom connectors do not use Glue Connections to centralize configuration properties in Glue. Connection configuration is done through Lambda.

You can also customize Amazon Athena's [prebuilt connectors](https://github.com/awslabs/aws-athena-query-federation/wiki/Available-Connectors) for your own use. You can modify a copy of the source code from GitHub and then use the [Connector publish tool](https://github.com/awslabs/aws-athena-query-federation/wiki/Connector_Publish_Tool) to create your own AWS Serverless Application Repository package. After you deploy your connector in this way, you can use it in your Athena queries.

For information about how to download the SDK and detailed instructions for writing your own connector, see [Example Athena connector](https://github.com/awslabs/aws-athena-query-federation/tree/master/athena-example) on GitHub.

# Work with data source connectors for Apache Spark
Work with connectors for Spark

Some Athena data source connectors are available as Spark DSV2 connectors. The Spark DSV2 connector names have a `-dsv2` suffix (for example, `athena-dynamodb-dsv2`).

Following are the currently available DSV2 connectors, their Spark `.format()` class name, and links to their corresponding Amazon Athena Federated Query documentation:


| DSV2 connector | Spark .format() class name | Documentation | 
| --- | --- | --- | 
| athena-cloudwatch-dsv2 | com.amazonaws.athena.connectors.dsv2.cloudwatch.CloudwatchTableProvider | [CloudWatch](connectors-cloudwatch.md) | 
| athena-cloudwatch-metrics-dsv2 | com.amazonaws.athena.connectors.dsv2.cloudwatch.metrics.CloudwatchMetricsTableProvider | [CloudWatch metrics](connectors-cwmetrics.md) | 
| athena-aws-cmdb-dsv2 | com.amazonaws.athena.connectors.dsv2.aws.cmdb.AwsCmdbTableProvider | [CMDB](connectors-cmdb.md) | 
| athena-dynamodb-dsv2 | com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider | [DynamoDB](connectors-dynamodb.md) | 

To download `.jar` files for the DSV2 connectors, visit the [Amazon Athena Query Federation DSV2](https://github.com/awslabs/aws-athena-query-federation-dsv2) GitHub page and see the **Releases**, **Release *<version>***, **Assets** section.

## Specify the jar to Spark


To use the Athena DSV2 connectors with Spark, you submit the `.jar` file for the connector to the Spark environment that you are using. The following sections describe specific cases.

### Athena for Spark


For information on adding custom `.jar` files and custom configuration to Amazon Athena for Apache Spark, see [Use Spark properties to specify custom configuration](notebooks-spark-custom-jar-cfg.md).

### General Spark


To pass in the connector `.jar` file to Spark, use the `spark-submit` command and specify the `.jar` file in the `--jars` option, as in the following example:

```
spark-submit \ 
  --deploy-mode cluster \ 
  --jars https://github.com/awslabs/aws-athena-query-federation-dsv2/releases/download/some_version/athena-dynamodb-dsv2-some_version.jar
```

### Amazon EMR Spark


In order to run a `spark-submit` command with the `--jars` parameter on Amazon EMR, you must add a step to your Amazon EMR Spark cluster. For details on how to use `spark-submit` on Amazon EMR, see [Add a Spark step](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html) in the *Amazon EMR Release Guide*.

### AWS Glue ETL Spark


For AWS Glue ETL, you can pass in the `.jar` file's GitHub.com URL to the `--extra-jars` argument of the `aws glue start-job-run` command. The AWS Glue documentation describes the `--extra-jars` parameter as taking an Amazon S3 path, but the parameter can also take an HTTPS URL. For more information, see [Job parameter reference](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-glue-arguments.html#w5aac32c13c11) in the *AWS Glue Developer Guide*.

## Query the connector on Spark


To submit the equivalent of your existing Athena federated query on Apache Spark, use the `spark.sql()` function. For example, suppose you have the following Athena query that you want to use on Apache Spark.

```
SELECT somecola, somecolb, somecolc 
FROM ddb_datasource.some_schema_or_glue_database.some_ddb_or_glue_table 
WHERE somecola > 1
```

To perform the same query on Spark using the Amazon Athena DynamoDB DSV2 connector, use the following code:

```
dynamoDf = (spark.read 
    .option("athena.connectors.schema", "some_schema_or_glue_database") 
    .option("athena.connectors.table", "some_ddb_or_glue_table") 
    .format("com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider") 
    .load()) 
 
dynamoDf.createOrReplaceTempView("ddb_spark_table") 
 
spark.sql(''' 
SELECT somecola, somecolb, somecolc 
FROM ddb_spark_table 
WHERE somecola > 1 
''')
```

## Specify parameters


The DSV2 versions of the Athena data source connectors use the same parameters as the corresponding Athena data source connectors. For parameter information, refer to the documentation for the corresponding Athena data source connector.

In your PySpark code, use the following syntax to configure your parameters.

```
spark.read.option("athena.connectors.conf.parameter", "value")
```

For example, the following code sets the Amazon Athena DynamoDB connector `disable_projection_and_casing` parameter to `always`.

```
dynamoDf = (spark.read 
    .option("athena.connectors.schema", "some_schema_or_glue_database") 
    .option("athena.connectors.table", "some_ddb_or_glue_table") 
    .option("athena.connectors.conf.disable_projection_and_casing", "always") 
    .format("com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider") 
    .load())
```