Work with data source connectors for Apache Spark
Some Athena data source connectors are available as Spark DSV2 connectors. The Spark DSV2
connector names have a -dsv2 suffix (for example,
athena-dynamodb-dsv2).
Following are the currently available DSV2 connectors, their Spark .format()
class name, and links to their corresponding Amazon Athena Federated Query documentation:
| DSV2 connector | Spark .format() class name | Documentation |
|---|---|---|
| athena-cloudwatch-dsv2 | com.amazonaws.athena.connectors.dsv2.cloudwatch.CloudwatchTableProvider |
CloudWatch |
| athena-cloudwatch-metrics-dsv2 | com.amazonaws.athena.connectors.dsv2.cloudwatch.metrics.CloudwatchMetricsTableProvider |
CloudWatch metrics |
| athena-aws-cmdb-dsv2 | com.amazonaws.athena.connectors.dsv2.aws.cmdb.AwsCmdbTableProvider |
CMDB |
| athena-dynamodb-dsv2 | com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider |
DynamoDB |
To download .jar files for the DSV2 connectors, visit the Amazon Athena Query
Federation DSV2<version>,
Assets section.
Specify the jar to Spark
To use the Athena DSV2 connectors with Spark, you submit the .jar
file for the connector to the Spark environment that you are using. The following
sections describe specific cases.
Athena for Spark
For information on adding custom .jar files and custom
configuration to Amazon Athena for Apache Spark, see Use Spark properties to specify custom
configuration.
General Spark
To pass in the connector .jar file to Spark, use the
spark-submit command and specify the .jar file
in the --jars option, as in the following example:
spark-submit \ --deploy-mode cluster \ --jars https://github.com/awslabs/aws-athena-query-federation-dsv2/releases/download/some_version/athena-dynamodb-dsv2-some_version.jar
Amazon EMR Spark
In order to run a spark-submit command with the --jars
parameter on Amazon EMR, you must add a step to your Amazon EMR Spark cluster. For details on
how to use spark-submit on Amazon EMR, see Add a Spark
step in the Amazon EMR Release Guide.
AWS Glue ETL Spark
For AWS Glue ETL, you can pass in the .jar file's GitHub.com URL
to the --extra-jars argument of the aws glue start-job-run
command. The AWS Glue documentation describes the --extra-jars parameter
as taking an Amazon S3 path, but the parameter can also take an HTTPS URL. For more
information, see Job parameter reference in the AWS Glue Developer Guide.
Query the connector on Spark
To submit the equivalent of your existing Athena federated query on Apache Spark, use
the spark.sql() function. For example, suppose you have the following Athena
query that you want to use on Apache Spark.
SELECT somecola, somecolb, somecolc FROM ddb_datasource.some_schema_or_glue_database.some_ddb_or_glue_table WHERE somecola > 1
To perform the same query on Spark using the Amazon Athena DynamoDB DSV2 connector, use the following code:
dynamoDf = (spark.read .option("athena.connectors.schema", "some_schema_or_glue_database") .option("athena.connectors.table", "some_ddb_or_glue_table") .format("com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider") .load()) dynamoDf.createOrReplaceTempView("ddb_spark_table") spark.sql(''' SELECT somecola, somecolb, somecolc FROM ddb_spark_table WHERE somecola > 1 ''')
Specify parameters
The DSV2 versions of the Athena data source connectors use the same parameters as the corresponding Athena data source connectors. For parameter information, refer to the documentation for the corresponding Athena data source connector.
In your PySpark code, use the following syntax to configure your parameters.
spark.read.option("athena.connectors.conf.parameter", "value")
For example, the following code sets the Amazon Athena DynamoDB connector
disable_projection_and_casing parameter to always.
dynamoDf = (spark.read .option("athena.connectors.schema", "some_schema_or_glue_database") .option("athena.connectors.table", "some_ddb_or_glue_table") .option("athena.connectors.conf.disable_projection_and_casing", "always") .format("com.amazonaws.athena.connectors.dsv2.dynamodb.DDBTableProvider") .load())