

# Use Apache Spark in Amazon Athena
<a name="notebooks-spark"></a>

Amazon Athena makes it easy to interactively run data analytics and exploration using Apache Spark without the need to plan for, configure, or manage resources. Running Apache Spark applications on Athena means submitting Spark code for processing and receiving the results directly without the need for additional configuration. Apache Spark on Amazon Athena is serverless and provides automatic, on-demand scaling that delivers instant-on compute to meet changing data volumes and processing requirements.

In release version [PySpark engine version 3](notebooks-spark-release-versions.md#notebooks-spark-release-versions-pyspark-3), you can use the simplified notebook experience in Amazon Athena console to develop Apache Spark applications using Python or Athena notebook APIs.

In release version [Apache Spark version 3.5](notebooks-spark-release-versions.md#notebooks-spark-release-versions-spark-35), you can run Spark code from Amazon SageMaker Unified Studio notebooks or your preferred Spark Connect compatible clients.

Amazon Athena offers the following features:
+ **Console usage** – Submit your Spark applications from the Amazon Athena console (Pyspark enginer version 3 only).
+ **Scripting** – Quickly and interactively build and debug Apache Spark applications in Python.
+ **Dynamic scaling** – Amazon Athena automatically determines the compute and memory resources needed to run a job and continuously scales those resources accordingly up to the maximums that you specify. This dynamic scaling reduces cost without affecting speed.
+ **Notebook experience** – Use the Amazon SageMaker AI Unified Studio notebooks to create, edit, and run computations using a familiar interface. In Pyspark engine version 3, you can use Athena in-console notebooks that are compatible with Jupyter notebooks and contain a list of cells that are executed in order as calculations. Cell content can include code, text, Markdown, mathematics, plots and rich media.

For additional information, see [Run Spark SQL on Amazon Athena Spark](https://aws.amazon.com/blogs/big-data/run-spark-sql-on-amazon-athena-spark/) and [Explore your data lake using Amazon Athena for Apache Spark](https://aws.amazon.com/blogs/big-data/explore-your-data-lake-using-amazon-athena-for-apache-spark/) in the *AWS Big Data Blog*. 

**Topics**
+ [

# Release versions
](notebooks-spark-release-versions.md)
+ [

# Considerations and limitations
](notebooks-spark-considerations-and-limitations.md)
+ [Get started](notebooks-spark-getting-started.md)
+ [

# Manage notebook files
](notebooks-spark-managing.md)
+ [Notebook editor](notebooks-spark-editor.md)
+ [Non-Hive table formats](notebooks-spark-table-formats.md)
+ [Python library support](notebooks-spark-python-library-support.md)
+ [Specify custom configuration](notebooks-spark-custom-jar-cfg.md)
+ [

# Supported data and storage formats
](notebooks-spark-data-and-storage-formats.md)
+ [Monitor Apache Spark](notebooks-spark-metrics.md)
+ [Cost attribution](notebooks-spark-cost-attribution.md)
+ [Logging and monitoring](notebooks-spark-logging-monitoring.md)
+ [Spark UI access](notebooks-spark-ui-access.md)
+ [Spark Connect](notebooks-spark-connect.md)
+ [Enable requester pays buckets](notebooks-spark-requester-pays.md)
+ [Lake Formation integration](notebooks-spark-lakeformation.md)
+ [Enable Spark encryption](notebooks-spark-encryption.md)
+ [Cross-account catalog access](spark-notebooks-cross-account-glue.md)
+ [Service quotas](notebooks-spark-quotas.md)
+ [Athena Spark APIs](notebooks-spark-api-list.md)
+ [Troubleshoot](notebooks-spark-troubleshooting.md)

# Release versions
<a name="notebooks-spark-release-versions"></a>

Amazon Athena for Apache Spark offers the following release versions:

## PySpark engine version 3
<a name="notebooks-spark-release-versions-pyspark-3"></a>

PySpark version 3 includes Apache Spark version 3.2.1. With this version, you can execute Spark code in Athena in-console notebooks.

## Apache Spark version 3.5
<a name="notebooks-spark-release-versions-spark-35"></a>

Apache Spark version 3.5 is based on Amazon EMR 7.12 and packages Apache Spark version 3.5.6. With this version, you can run Spark code from Amazon SageMaker AI Unified Studio notebook or your preferred compatible Spark clients. This version adds key features to deliver an improved experience for interactive workloads:
+ **Secure Spark Connect** – Adds Spark Connect as an authenticated and authorized AWS Endpoint.
+ **Session level cost attribution** – Users can track the costs per interactive session in AWS Cost Explorer or Cost and Usage reports. For more information, see [Session level cost attribution](notebooks-spark-cost-attribution.md).
+ **Advanced debugging capabilities** – Adds live Spark UI and Spark History Server support for debugging workloads both from the APIs as well as from notebooks. For more information, see [Accessing the Spark UI](notebooks-spark-ui-access.md#notebooks-spark-ui-access-methods).
+ ** unfiltered access support** – Access protected AWS Glue Data catalog tables where you have full table permissions. For more information, see [Using Lake Formation with Athena Spark workgroups](notebooks-spark-lakeformation.md).

### Spark default properties
<a name="notebooks-spark-release-versions-spark-35-default-properties"></a>

The following table lists Spark properties and their default values that are applied for Athena SparkConnect Sessions.


| Key | Default value | Description | 
| --- | --- | --- | 
|  `spark.app.id`  |  `<Athena SessionId>`  |  This is not modifiable.  | 
|  `spark.app.name`  |  `default`  |    | 
|  `spark.driver.cores`  |  `4`  |  The number of cores driver uses. This is not modifiable during the initial launch.  | 
|  `spark.driver.memory`  |  `10g`  |  Amount of memory that each driver uses. This is not modifiable during the initial launch.  | 
|  `spark.driver.memoryOverhead`  |  `6g`  |  Amount of memory overhead assigned for Python workloads and other processes running on driver. This is not modifiable during the initial launch.  | 
|  `spark.cortex.driver.disk`  |  `64g`  |  The Spark driver disk. This is not modifiable during the initial launch.  | 
|  `spark.executor.cores`  |  `4`  |  The number of cores that each executor uses. This is not modifiable during the initial launch.  | 
|  `spark.executor.memory`  |  `10g`  |  Amount of memory that each driver uses.  | 
|  `spark.executor.memoryOverhead`  |  `6g`  |  Amount of memory overhead assigned for Python workloads and other processes running on executor. This is not modifiable during the initial launch.  | 
|  `spark.cortex.executor.disk`  |  `64g`  |  The Spark executor disk. This is not modifiable during the initial launch.  | 
|  `spark.cortex.executor.architecture`  |  `AARCH_64`  |  Architecture of executor.  | 
|  `spark.driver.extraJavaOptions`  |  `-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false`  |  Extra Java options for the Spark driver. This is not modifiable during the initial launch.  | 
|  `spark.executor.extraJavaOptions`  |  `-Djava.net.preferIPv6Addresses=false -XX:+IgnoreUnrecognizedVMOptions --add-opens=java.base/java.lang=ALL-UNNAMED --add-opens=java.base/java.lang.invoke=ALL-UNNAMED --add-opens=java.base/java.lang.reflect=ALL-UNNAMED --add-opens=java.base/java.io=ALL-UNNAMED --add-opens=java.base/java.net=ALL-UNNAMED --add-opens=java.base/java.nio=ALL-UNNAMED --add-opens=java.base/java.util=ALL-UNNAMED --add-opens=java.base/java.util.concurrent=ALL-UNNAMED --add-opens=java.base/java.util.concurrent.atomic=ALL-UNNAMED --add-opens=java.base/jdk.internal.ref=ALL-UNNAMED --add-opens=java.base/sun.nio.ch=ALL-UNNAMED --add-opens=java.base/sun.nio.cs=ALL-UNNAMED --add-opens=java.base/sun.security.action=ALL-UNNAMED --add-opens=java.base/sun.util.calendar=ALL-UNNAMED --add-opens=java.security.jgss/sun.security.krb5=ALL-UNNAMED -Djdk.reflect.useDirectMethodHandle=false`  |  Extra Java options for the Spark executor. This is not modifiable during the initial launch.  | 
|  `spark.executor.instances`  |  `1`  |  The number of Spark executor containers to allocate.  | 
|  `spark.dynamicAllocation.enabled`  |  `TRUE`  |  Option that turns on dynamic resource allocation. This option scales up or down the number of executors registered with the application, based on the workload.  | 
|  `spark.dynamicAllocation.minExecutors`  |  `0`  |  The lower bound for the number of executors if you turn on dynamic allocation.  | 
|  `spark.dynamicAllocation.maxExecutors`  |  `59`  |  The upper bound for the number of executors if you turn on dynamic allocation.  | 
|  `spark.dynamicAllocation.initialExecutors`  |  `1`  |  The initial number of executors to run if you turn on dynamic allocation.  | 
|  `spark.dynamicAllocation.executorIdleTimeout`  |  `60s`  |  The length of time that an executor can remain idle before Spark removes it. This only applies if you turn on dynamic allocation.  | 
|  `spark.dynamicAllocation.shuffleTracking.enabled`  |  `TRUE`  |  DRA enabled requires shuffle tracking to be enabled.  | 
|  `spark.dynamicAllocation.sustainedSchedulerBacklogTimeout`  |  `1s`  |  Timeout defines how long the Spark scheduler must observe a sustained backlog of pending tasks before it triggers a request to the cluster manager to launch new executors.  | 
|  `spark.sql.catalogImplementation`  |  `hive`  |    | 
|  `spark.hadoop.hive.metastore.client.factory.class`  |  `com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory`  |  The AWS Glue metastore implementation class.  | 
|  `spark.hadoop.hive.metastore.glue.catalogid`  |  `<accountId>`  |  AWS Glue catalog accountId.  | 
|  `spark.sql.hive.metastore.sharedPrefixes`  |  `software.amazon.awssdk.services.dynamodb`  |  Property specifies a comma-separated list of package prefixes for classes that should be loaded by the Application ClassLoader rather than the isolated ClassLoader created for Hive Metastore Client code.  | 
|  `spark.hadoop.fs.s3.impl`  |  `org.apache.hadoop.fs.s3a.S3AFileSystem`  |  Defines the implementation for the S3 client to use S3A.  | 
|  `spark.hadoop.fs.s3a.impl`  |  `org.apache.hadoop.fs.s3a.S3AFileSystem`  |  Defines the implementation for the S3A client (S3A).  | 
|  `spark.hadoop.fs.s3n.impl`  |  `org.apache.hadoop.fs.s3a.S3AFileSystem`  |  Defines the implementation for the Native S3 client (S3N) to use S3A.  | 
|  `spark.hadoop.fs.AbstractFileSystem.s3.impl`  |  `org.apache.hadoop.fs.s3a.S3A`  |    | 
|  `spark.hadoop.fs.s3a.aws.credentials.provider`  |  `software.amazon.awssdk.auth.credentials.DefaultCredentialsProvider`  |    | 
|  `spark.hadoop.fs.s3.customAWSCredentialsProvider`  |  `com.amazonaws.auth.DefaultAWSCredentialsProviderChain`  |    | 
|  `spark.hadoop.mapreduce.output.fs.optimized.committer.enabled`  |  `TRUE`  |  This property enables an optimized commit protocol for Spark jobs when writing data to Amazon S3. When set to true, it helps Spark avoid costly file rename operations, resulting in faster and more reliable atomic writes compared to the default Hadoop committer.  | 
|  `spark.hadoop.fs.s3a.endpoint.region`  |  `<REGION>`  |  This configuration explicitly sets the AWS region for the Amazon S3 bucket accessed via the S3A client.  | 
|  `spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds`  |  `2000`  |  This specifies the socket connection timeout in milliseconds.  | 
|  `spark.hadoop.fs.s3a.committer.magic.enabled`  |  `TRUE`  |  This enables the S3A "Magic" Committer, a highly performant but specific commit protocol that relies on the underlying cluster manager's support for special paths.  | 
|  `spark.hadoop.fs.s3a.committer.magic.track.commits.in.memory.enabled`  |  `TRUE`  |  Relevant only when the Magic Committer is enabled, this specifies whether the list of files committed by a task should be tracked in memory instead of being written to temporary disk files.  | 
|  `spark.hadoop.fs.s3a.committer.name`  |  `magicv2`  |  This setting explicitly selects the specific S3A Output Committer algorithm to be used (e.g., directory, partitioned, or magic). By specifying the name, you choose the strategy that manages temporary data, handles task failures, and performs the final atomic commit to the target Amazon S3 path.  | 
|  `spark.hadoop.fs.s3.s3AccessGrants.enabled`  |  `FALSE`  |  Property enables support for Amazon S3 Access Grants when accessing Amazon S3 data via the S3A/EMRFS filesystem client.  | 
|  `spark.hadoop.fs.s3.s3AccessGrants.fallbackToIAM`  |  `FALSE`  |  When Amazon S3 Access Grants are enabled, this property controls whether the Amazon S3 client should fall back to traditional IAM credentials if the Access Grants lookup fails or does not provide sufficient permissions.  | 
|  `spark.pyspark.driver.python`  |  `/usr/bin/python3.11`  |  Python path for driver.  | 
|  `spark.pyspark.python`  |  `/usr/bin/python3.11`  |  Python path for executor.  | 
|  `spark.python.use.daemon`  |  `TRUE`  |  This configuration controls whether Spark utilizes a Python worker daemon process on each executor. When enabled (true, the default), the executor keeps the Python worker alive between tasks to avoid the overhead of repeatedly launching and initializing a new Python interpreter for every task, significantly improving the performance of PySpark applications.  | 
|  `spark.sql.execution.arrow.pyspark.enabled`  |  `TRUE`  |  Enables the use of Apache Arrow to optimize data transfer between the JVM and Python processes in PySpark.  | 
|  `spark.sql.execution.arrow.pyspark.fallback.enabled`  |  `TRUE`  |  Configuration property that controls Spark's behavior when an error occurs during the data transfer between the JVM and Python using the Apache Arrow optimization.  | 
|  `spark.sql.parquet.fs.optimized.committer.optimization-enabled`  |  `TRUE`  |  Configuration property that controls whether Spark uses an optimized file committer when writing Parquet files to certain file systems, specifically cloud storage systems like Amazon S3.  | 
|  `spark.sql.parquet.output.committer.class`  |  `com.amazon.emr.committer.EmrOptimizedSparkSqlParquetOutputCommitter`  |  Spark configuration property that specifies the fully qualified class name of the Hadoop OutputCommitter to be used when writing Parquet files.  | 
|  `spark.resourceManager.cleanupExpiredHost`  |  `TRUE`  |  This property controls whether the Driver actively cleans up Spark application resources associated with executors that were running on nodes that have been deleted or expired.  | 
|  `spark.blacklist.decommissioning.enabled`  |  `TRUE`  |  Property enables Spark's logic to automatically blacklist executors that are currently undergoing decommissioning (graceful shutdown) by the cluster manager. This prevents the scheduler from sending new tasks to executors that are about to exit, improving job stability during resource scaling down.  | 
|  `spark.blacklist.decommissioning.timeout`  |  `1h`  |  Maximum time Spark will wait for a task to be successfully migrated off a decommissioning executor before blacklisting the host.  | 
|  `spark.stage.attempt.ignoreOnDecommissionFetchFailure`  |  `TRUE`  |  Tells Spark to be lenient and not fail an entire stage attempt if a fetch failure occurs when reading shuffle data from a decommissioning executor. The fetch failure is considered recoverable, and Spark will re-fetch the data from a different location (potentially requiring re-computation), prioritizing job completion over strict error handling during graceful shutdowns.  | 
|  `spark.decommissioning.timeout.threshold`  |  `20`  |  This property is typically used internally or in specific cluster manager setups to define the maximum total duration Spark expects a host's decommissioning process to take. If the actual decommissioning time exceeds this threshold, Spark may take aggressive action, like blacklisting the host or requesting forced termination, to free up the resource.  | 
|  `spark.files.fetchFailure.unRegisterOutputOnHost`  |  `TRUE`  |  When a task fails to fetch shuffle or RDD data from a specific host, setting this to true instructs Spark to unregister all output blocks associated with the failing application on that host. This prevents future tasks from attempting to fetch data from the unreliable host, forcing Spark to re-calculate the necessary blocks elsewhere and increasing job robustness against intermittent network issues.  | 

# Considerations and limitations
<a name="notebooks-spark-considerations-and-limitations"></a>

## Apache Spark version 3.5
<a name="notebooks-spark-considerations-spark-35"></a>

The following are the considerations and limitations for the release version Apache Spark version 3.5:
+ This release version is available in the following AWS Regions:
  + Asia Pacific (Mumbai)
  + Asia Pacific (Seoul)
  + Asia Pacific (Singapore)
  + Asia Pacific (Sydney)
  + Asia Pacific (Tokyo)
  + Canada (Central)
  + Europe (Frankfurt)
  + Europe (Ireland)
  + Europe (London)
  + Europe (Paris)
  + Europe (Stockholm)
  + South America (São Paulo)
  + US East (N. Virginia)
  + US East (Ohio)
  + US West (Oregon)
+ This engine version does not support Athena in-console notebooks or notebook APIs. Instead, this version comes integrated with Amazon SageMaker AI Unified Studio notebooks. You can also use compatible Spark Connect clients.
+ The Calculation APIs - `StartCalculationExecution`, `ListCalculationExecutions` and `GetCalculationExecution`, are not supported in this release.
+ You cannot upgrade a workgroup from PySpark engine version 3 to Apache Spark version 3.5.

## Pyspark engine version 3
<a name="notebooks-spark-considerations-pyspark-3"></a>

The following are the considerations and limitations for the release version Pyspark engine version 3:
+ This release version is available in the following AWS Regions:
  + Asia Pacific (Mumbai)
  + Asia Pacific (Singapore)
  + Asia Pacific (Sydney)
  + Asia Pacific (Tokyo)
  + Europe (Frankfurt)
  + Europe (Ireland)
  + US East (N. Virginia)
  + US East (Ohio)
  + US West (Oregon)
+ AWS Lake Formation is not supported.
+ Tables that use partition projection are not supported.
+ Apache Spark enabled workgroups can use the Athena notebook editor, but not the Athena query editor. Only Athena SQL workgroups can use the Athena query editor.
+ Cross-engine view queries are not supported. Views created by Athena SQL are not queryable by Athena for Spark. Because views for the two engines are implemented differently, they are not compatible for cross-engine use.
+ MLlib (Apache Spark machine learning library) and the `pyspark.ml` package are not supported. For a list of supported Python libraries, see the [List of preinstalled Python libraries](notebooks-spark-preinstalled-python-libraries.md).
+ Currently, `pip install` is not supported in Athena for Spark sessions. 
+ Only one active session per notebook is allowed. 
+ When multiple users use the console to open an existing session in a workgroup, they access the same notebook. To avoid confusion, only open sessions that you create yourself.
+ The hosting domains for Apache Spark applications that you might use with Amazon Athena (for example, `analytics-gateway.us-east-1.amazonaws.com`) are registered in the internet [Public Suffix List (PSL)](https://publicsuffix.org/list/public_suffix_list.dat). If you ever need to set sensitive cookies in your domains, we recommend that you use cookies with a `__Host-` prefix to help defend your domain against cross-site request forgery (CSRF) attempts. For more information, see the [Set-Cookie](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Set-Cookie#cookie_prefixes) page in the Mozilla.org developer documentation.
+ For information on troubleshooting Spark notebooks, sessions, and workgroups in Athena, see [Troubleshoot Athena for Spark](notebooks-spark-troubleshooting.md).

# Get started with Apache Spark on Amazon Athena
<a name="notebooks-spark-getting-started"></a>

**Note**  
For the release version Apache Spark version 3.5, following the Getting Started guide in [SageMaker Notebooks](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/what-is-sagemaker-unified-studio.html). Use this guide for the release version Pyspark engine version 3.

To get started with Apache Spark on Amazon Athena, you must first create a Spark enabled workgroup. After you switch to the workgroup, you can create a notebook or open an existing notebook. When you open a notebook in Athena, a new session is started for it automatically and you can work with it directly in the Athena notebook editor.

**Note**  
Make sure that you create a Spark enabled workgroup before you attempt to create a notebook.

## Step 1: Create a Spark enabled workgroup in Athena
<a name="notebooks-spark-getting-started-creating-a-spark-enabled-workgroup"></a>

You can use [workgroups](workgroups-manage-queries-control-costs.md) in Athena to group users, teams, applications, or workloads, and to track costs. To use Apache Spark in Amazon Athena, you create an Amazon Athena workgroup that uses a Spark engine.

**Note**  
Apache Spark enabled workgroups can use the Athena notebook editor, but not the Athena query editor. Only Athena SQL workgroups can use the Athena query editor.

**To create a Spark enabled workgroup in Athena**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home)

1. If the console navigation pane is not visible, choose the expansion menu on the left.  
![\[Choose the expansion menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/nav-pane-expansion.png)

1. In the navigation pane, choose **Workgroups**.

1. On the **Workgroups** page, choose **Create workgroup**.

1. For **Workgroup name**, enter a name for your Apache Spark workgroup.

1. (Optional) For **Description**, enter a description for your workgroup.

1. For **Analytics engine**, choose **Apache Spark**.
**Note**  
After you create a workgroup, the workgroup's type of analytics engine cannot be changed. For example, an Athena engine version 3 workgroup cannot be changed to a PySpark engine version 3 workgroup. 

1. For the purposes of this tutorial, select **Turn on example notebook**. This optional feature adds an example notebook with the name `example-notebook-random_string` to your workgroup and adds AWS Glue-related permissions that the notebook uses to create, show, and delete specific databases and tables in your account, and read permissions in Amazon S3 for the sample dataset. To see the added permissions, choose **View additional permissions details**.
**Note**  
 Running the example notebook may incur some additional cost. 

1. For **Calculation result settings**choose from the following options:
   + **Create a new S3 bucket** – This option creates an Amazon S3 bucket in your account for your calculation results. The bucket name has the format `account_id-region-athena-results-bucket-alphanumeric_id` and uses the settings ACLs disabled, public access blocked, versioning disabled, and bucket owner enforced.
   + **Choose an existing S3 location** – For this option, do the following:
     + Enter the S3 path to an existing the location in the search box, or choose **Browse S3** to choose a bucket from a list.
**Note**  
When you select an existing location in Amazon S3, do not append a forward slash (`/`) to the location. Doing so causes the link to the calculation results location on the [calculation details page](#notebooks-spark-getting-started-viewing-session-and-calculation-details) to point to the incorrect directory. If this occurs, edit the workgroup's results location to remove the trailing forward slash. 
     + (Optional) Choose **View** to open the **Buckets** page of the Amazon S3 console where you can view more information about the existing bucket that you chose.
     + (Optional) For **Expected bucket owner**, enter the AWS account ID that you expect to be the owner of your query results output location bucket. We recommend that you choose this option as an added security measure whenever possible. If the account ID of the bucket owner does not match the ID that you specify, attempts to output to the bucket will fail. For in-depth information, see [Verifying bucket ownership with bucket owner condition](https://docs.aws.amazon.com/AmazonS3/latest/userguide/bucket-owner-condition.html) in the *Amazon S3 User Guide*. 
     + (Optional) Select **Assign bucket owner full control over query results** if your calculation result location is owned by another account and you want to grant full control over your query results to the other account.

1. (Optional) Choose **Encrypt query results** if you want your query results to be encrypted.
   + For **Encryption type**, choose one of the following options:
     + **SSE\$1S3** – This option uses server-side encryption (SSE) with Amazon S3-managed encryption keys.
     + **SSE\$1KMS** – This option uses server-side encryption (SSE) with AWS KMS-managed keys. 

       For **Choose an AWS KMS key**, choose one of the following options.
       + **Use AWS owned key** – The AWS KMS key is owned and managed by AWS. You are not charged an additional fee for using this key.
       + **Choose a different AWS KMS key (advanced)** – For this option, do one of the following:
         + To use an existing key, use the search box to choose an AWS KMS or enter a key ARN.
         + To create a key in the AWS KMS console, choose **Create an AWS KMS key**. Your execution role must have permission to use the key that you create. After you finish creating the key in the KMS console, return to the **Create workgroup** page in Athena console, and then use the **Choose an AWS KMS key or enter an ARN** search box to choose the key that you just created.
**Important**  
When you change the [AWS KMS key](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html) for a workgroup, notebooks managed before the update still reference the old KMS key. Notebooks managed after the update use the new KMS key. To update the old notebooks to reference the new KMS key, export and then import each of the old notebooks. If you delete the old KMS key before you update the old notebook references to the new KMS key, the old notebooks are no longer decryptable and cannot be recovered.  
This behavior also applies for updates to [aliases](https://docs.aws.amazon.com/kms/latest/developerguide/kms-alias.html), which are friendly names for KMS keys. When you update a KMS key alias to point to a new KMS key, notebooks managed before the alias update still reference the old KMS key, and notebooks managed after the alias update use the new KMS key. Consider these points before updating your KMS keys or aliases. 

1. For **Additional configurations**, choose **Use defaults**. This option helps you get started with your Spark-enabled workgroup. When you use the defaults, Athena creates an IAM role and calculation results location in Amazon S3 for you. The name of the IAM role and the S3 bucket location to be created are displayed in the box below the **Additional configurations** heading.

   If you do not want to use the defaults, continue with the steps in the [(Optional) Specify your own workgroup configurations](#notebooks-spark-getting-started-workgroup-configuration) section to configure your workgroup manually.

1. (Optional) **Tags** – Use this option to add tags to your workgroup. For more information, see [Tag Athena resources](tags.md).

1. Choose **Create workgroup**. A message informs you that the workgroup was created successfully, and the workgroup shows in the list of workgroups.

### (Optional) Specify your own workgroup configurations
<a name="notebooks-spark-getting-started-workgroup-configuration"></a>

If you want to specify your own IAM role and calculation results location for your notebook, follow the steps in this section. If you chose **Use defaults** for the **Additional configurations** option, skip this section and go directly to [Step 2: Open notebook explorer and switch workgroups](#notebooks-spark-getting-started-switching-workgroups-and-opening-notebook-explorer).

The following procedure assumes you have completed steps 1 to 9 of the **To create a Spark enabled workgroup in Athena** procedure in the previous section.

**To specify your own workgroup configurations**

1. If you want to create or use your own IAM role or configure notebook encryption, expand **IAM role configuration**.
   + For **Service Role to authorize Athena**, choose one of the following:
     + **Create and use a new service role** – Choose this option to have Athena create a service role for you. To see the permissions the role grants, choose **View permission details**.
     + **Use an existing service role** – From the drop down menu, choose an existing role. The role that you choose must include the permissions in the first option. For more information about permissions for notebook-enabled workgroups, see [Troubleshoot Spark-enabled workgroups](notebooks-spark-troubleshooting-workgroups.md).
   + For **Notebook and calculation code encryption key management**, choose one of the following options:
     + **Encrypt using AWS owned key (Default)** – The AWS KMS key is owned and managed by AWS. You are not charged an additional fee for using this key.
     + **Encrypt using your own AWS KMS key** – For this option, do one of the following:
       + To use an existing key, use the search box to choose an AWS KMS or enter a key ARN.
       + To create a key in the AWS KMS console, choose **Create an AWS KMS key**. Your execution role must have permission to use the key that you create. After you finish creating the key in the KMS console, return to the **Create workgroup** page in Athena console, and then use the **Choose an AWS KMS key or enter an ARN** search box to choose the key that you just created.
**Important**  
When you change the [AWS KMS key](https://docs.aws.amazon.com/kms/latest/developerguide/concepts.html) for a workgroup, notebooks managed before the update still reference the old KMS key. Notebooks managed after the update use the new KMS key. To update the old notebooks to reference the new KMS key, export and then import each of the old notebooks. If you delete the old KMS key before you update the old notebook references to the new KMS key, the old notebooks are no longer decryptable and cannot be recovered.  
This behavior also applies for updates to [aliases](https://docs.aws.amazon.com/kms/latest/developerguide/kms-alias.html), which are friendly names for KMS keys. When you update a KMS key alias to point to a new KMS key, notebooks managed before the alias update still reference the old KMS key, and notebooks managed after the alias update use the new KMS key. Consider these points before updating your KMS keys or aliases. 

1. <a name="notebook-gs-metrics"></a>(Optional) **Other settings** – Expand this option to enable or disable the **Publish CloudWatch metrics** option for the workgroup. This field is selected by default. For more information, see [Monitor Apache Spark with CloudWatch metrics](notebooks-spark-metrics.md).

1. (Optional) **Tags** – Use this option to add tags to your workgroup. For more information, see [Tag Athena resources](tags.md).

1. Choose **Create workgroup**. A message informs you that the workgroup was created successfully, and the workgroup shows in the list of workgroups.

## Step 2: Open notebook explorer and switch workgroups
<a name="notebooks-spark-getting-started-switching-workgroups-and-opening-notebook-explorer"></a>

Before you can use the Spark enabled workgroup that you just created, you must switch to the workgroup. To switch Spark enabled workgroups, you can use the **Workgroup** option in Notebook explorer or Notebook editor.

**Note**  
Before you start, check that your browser does not block third-party cookies. Any browser that blocks third party cookies either by default or as a user-enabled setting will prevent notebooks from launching. For more information on managing cookies, see:  
[Chrome](https://support.alertlogic.com/hc/en-us/articles/360018127132-Turn-Off-Block-Third-Party-Cookies-in-Chrome-for-Windows)
[Firefox](https://support.mozilla.org/en-US/kb/third-party-cookies-firefox-tracking-protection)
[Safari](https://support.apple.com/guide/safari/manage-cookies-sfri11471/mac)

**To open notebook explorer and switch workgroups**

1. In the navigation pane,choose **Notebook explorer**.

1. Use the **Workgroup** option on the upper right of the console to choose the Spark enabled workgroup that you created. The example notebook is shown in the list of notebooks.

   You can use the notebook explorer in the following ways:
   + Choose the linked name of a notebook to open the notebook in a new session.
   + To rename, delete, or export your notebook, use the **Actions** menu.
   + To import a notebook file, choose **Import file**.
   + To create a notebook, choose **Create notebook**.

## Step 3: Run the example notebook
<a name="notebooks-spark-getting-started-running-the-example-notebook"></a>

The sample notebook queries data from a publicly available New York City taxi trip dataset. The notebook has examples that show how to work with Spark DataFrames, Spark SQL, and the AWS Glue Data Catalog.

**To run the example notebook**

1. In Notebook explorer, choose the linked name of the example notebook.

   This starts a notebook session with the default parameters and opens the notebook in the notebook editor. A message informs you that a new Apache Spark session has been started using the default parameters (20 maximum DPUs).

1. To run the cells in order and observe the results, choose the **Run** button once for each cell in the notebook. 
   + Scroll down to see the results and bring new cells into view.
   + For the cells that have a calculation, a progress bar shows the percentage completed, the time elapsed, and the time remaining.
   + The example notebook creates a sample database and table in your account. The final cell removes these as a clean-up step.

**Note**  
If you change folder, table, or database names in the example notebook, make sure those changes are reflected in the IAM roles that you use. Otherwise, the notebook can fail to run due to insufficient permissions. 

## Step 4: Edit session details
<a name="notebooks-spark-getting-started-editing-session-details"></a>

After you start a notebook session, you can edit session details like table format, encryption, session idle timeout, and the maximum concurrent number of data processing units (DPUs) that you want to use. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.

**To edit session details**

1. In the notebook editor, from the **Session** menu on the upper right, choose **Edit session**.

1. In the **Edit session details** dialog box, in the **Spark properties** section, choose or enter values for the following options:
   + **Additional table format** – Choose **Linux Foundation Delta Lake**, **Apache Hudi**, **Apache Iceberg**, or **Custom**.
     + For the **Delta**, **Hudi**, or **Iceberg** table options, the required table properties for the corresponding table format are automatically provided for you in the **Edit in table** and **Edit in JSON** options. For more information about using these table formats, see [Use non-Hive table formats in Athena for Spark](notebooks-spark-table-formats.md).
     + To add or remove table properties for the **Custom** or other table types, use the **Edit in table** and **Edit in JSON** options.
     + For the **Edit in table** option, choose **Add property** to add a property, or **Remove** to remove a property. To enter property names and their values, use the **Key** and **Value** boxes.
     + For the **Edit in JSON** option, use the JSON text editor to edit the configuration directly.
       + To copy the JSON text to the clipboard, choose **Copy**.
       + To remove all text from the JSON editor, choose **Clear**.
       + To configure line wrapping or choose a color theme for the JSON editor, choose the settings (gear) icon.
   + **Turn on Spark encryption -** – Select this option to encrypt data that is written to disk and sent through Spark network nodes. For more information, see [Enable Apache Spark encryption](notebooks-spark-encryption.md).

1. In the **Session parameters** section, choose or enter values for the following options:
   + **Session idle timeout** - Choose or enter a value between 1 and 480 minutes. The default is 20.
   + **Coordinator size** - A *coordinator* is a special executor that orchestrates processing work and manages other executors in a notebook session. Currently, 1 DPU is the default and only possible value.
   + **Executor size** - An *executor* is the smallest unit of compute that a notebook session can request from Athena. Currently, 1 DPU is the default and only possible value.
   + **Max concurrent value** - The maximum number of DPUs that can run concurrently. The default is 20, the minimum is 3, and the maximum is 60. Increasing this value does not automatically allocate additional resources, but Athena will attempt to allocate up to the maximum specified when the compute load requires it and when resources are available.

1. Choose **Save**.

1. At the **Confirm edit** prompt, choose **Confirm**.

   Athena saves your notebook and starts a new session with the parameters that you specified. A banner in the notebook editor informs you that a new session has started with the modified parameters.
**Note**  
Athena remembers your session settings for the notebook. If you edit a session's parameters and then terminate the session, Athena uses the session parameters that you configured the next time you start a session for the notebook. 

## Step 5: View session and calculation details
<a name="notebooks-spark-getting-started-viewing-session-and-calculation-details"></a>

After you run the notebook, you can view your session and calculation details.

**To view session and calculation details**

1. From the **Session** menu on the upper right, choose **View details**.
   + The **Current session** tab shows information about the current session, including session ID, creation time, status, and workgroup.
   + The **History** tab lists the session IDs for previous sessions. To view details for a previous session, choose the **History** tab, and then choose a session ID in the list.
   + The **Calculations** section shows a list of calculations that ran in the session.

1. To view the details for a calculation, choose the calculation ID.

1. On the **Calculation details** page, you can do the following:
   + To view the code for the calculation, see the **Code** section.
   + To see the results for the calculation, choose the **Results** tab.
   + To download the results that you see in text format, choose **Download results**.
   + To view information about the calculation results in Amazon S3, choose **View in S3**.

## Step 6: Terminate the session
<a name="notebooks-spark-getting-started-terminating-a-session"></a>

**To end the notebook session**

1. In the notebook editor, from the **Session** menu on the upper right, choose **Terminate**.

1. At the **Confirm session termination** prompt, choose **Confirm**. Your notebook is saved and you are returned to the notebook editor.

**Note**  
Closing a notebook tab in the notebook editor does not by itself terminate the session for an active notebook. If you want to ensure that the session is terminated, use the **Session**, **Terminate** option.

## Step 7: Create your own notebook
<a name="notebooks-spark-getting-started-creating-your-own-notebook"></a>

After you have created a Spark enabled Athena workgroup, you can create your own notebook.

**To create a notebook**

1. If the console navigation pane is not visible, choose the expansion menu on the left.

1. In the Athena console navigation pane, choose **Notebook explorer** or **Notebook editor**.

1. Do one of the following:
   + In **Notebook explorer**, choose **Create notebook**.
   + In **Notebook editor**, choose **Create notebook**, or choose the plus icon (**\$1**) to add a notebook.

1. In the **Create notebook** dialog box, for **Notebook name**, enter a name.

1. (Optional) Expand **Spark properties**, and then choose or enter values for the following options:
   + **Additional table format** – Choose **Linux Foundation Delta Lake**, **Apache Hudi**, **Apache Iceberg**, or **Custom**.
     + For the **Delta**, **Hudi**, or **Iceberg** table options, the required table properties for the corresponding table format are automatically provided for you in the **Edit in table** and **Edit in JSON** options. For more information about using these table formats, see [Use non-Hive table formats in Athena for Spark](notebooks-spark-table-formats.md).
     + To add or remove table properties for the **Custom** or other table types, use the **Edit in table** and **Edit in JSON** options.
     + For the **Edit in table** option, choose **Add property** to add a property, or **Remove** to remove a property. To enter property names and their values, use the **Key** and **Value** boxes.
     + For the **Edit in JSON** option, use the JSON text editor to edit the configuration directly.
       + To copy the JSON text to the clipboard, choose **Copy**.
       + To remove all text from the JSON editor, choose **Clear**.
       + To configure line wrapping or choose a color theme for the JSON editor, choose the settings (gear) icon.
   + **Turn on Spark encryption -** – Select this option to encrypt data that is written to disk and sent through Spark network nodes. For more information, see [Enable Apache Spark encryption](notebooks-spark-encryption.md).

1. (Optional) Expand **Session parameters**, and then choose or enter values for the following options:
   + **Session idle timeout** - choose or enter a value between 1 and 480 minutes. The default is 20.
   + **Coordinator size** - A *coordinator* is a special executor that orchestrates processing work and manages other executors in a notebook session. Currently, 1 DPU is the default and only possible value. A DPU (data processing unit) is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory.
   + **Executor size** - An *executor* is the smallest unit of compute that a notebook session can request from Athena. Currently, 1 DPU is the default and only possible value.
   + **Max concurrent value** - The maximum number of DPUs that can run concurrently. The default is 20 and the maximum is 60. Increasing this value does not automatically allocate additional resources, but Athena will attempt to allocate up to the maximum specified when the compute load requires it and when resources are available.

1. Choose **Create**. Your notebook opens in a new session in the notebook editor.

For information about managing your notebook files, see [Manage notebook files](notebooks-spark-managing.md).

# Manage notebook files
<a name="notebooks-spark-managing"></a>

**Note**  
The Athena notebook editor is supported in the Pyspark engine version 3. For using notebooks with Apache Spark version 3.5, see [SageMaker Notebooks](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/what-is-sagemaker-unified-studio.html).

Besides using the notebook explorer to [create](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook) notebooks, you can also use it to open, rename, delete, export, or import notebooks, or view the session history for a notebook.

**To open a previously created notebook**

1. If the console navigation pane is not visible, choose the expansion menu on the left.

1. In the Athena console navigation pane, choose **Notebook editor** or **Notebook explorer**.

1. Do one of the following:
   + In **Notebook editor**, choose a notebook in the **Recent notebooks** or **Saved notebooks** list. The notebook opens in a new session.
   + In **Notebook explorer**, choose the name of a notebook in the list. The notebook opens in a new session.

**To rename a notebook**

1. [Terminate](notebooks-spark-getting-started.md#notebooks-spark-getting-started-terminating-a-session) any active sessions for the notebook that you want to rename. The notebook's active sessions must be terminated before you can rename the notebook.

1. Open **Notebook explorer**.

1. In the **Notebooks** list, select the option button for the notebook that you want to rename.

1. From the **Actions** menu, choose **Rename**.

1. At the **Rename notebook** prompt, enter the new name, and then choose **Save**. The new notebook name appears in the list of notebooks.

**To delete a notebook**

1. [Terminate](notebooks-spark-getting-started.md#notebooks-spark-getting-started-terminating-a-session) any active sessions for the notebook that you want to delete. The notebook's active sessions must be terminated before you can delete the notebook.

1. Open **Notebook explorer**.

1. In the **Notebooks** list, select the option button for the notebook that you want to delete.

1. From the **Actions** menu, choose **Delete**.

1. At the **Delete notebook?** prompt, enter the name of the notebook, and then choose **Delete** to confirm the deletion. The notebook name is removed from the list of notebooks.

**To export a notebook**

1. Open **Notebook explorer**.

1. In the **Notebooks** list, select the option button for the notebook that you want to export.

1. From the **Actions** menu, choose **Export file**.

**To import a notebook**

1. Open **Notebook explorer**.

1. Choose **Import file**.

1. Browse to the location on your local computer of the file that you want to import, and then choose **Open**. The imported notebook appears in the list of notebooks.

**To view the session history for a notebook**

1. Open **Notebook explorer**.

1. In the **Notebooks** list, select the option button for the notebook whose session history you want to view.

1. From the **Actions** menu, choose **Session history**.

1. On the **History** tab, choose a **Session ID** to view information about the session and its calculations.

# Use the Athena notebook editor
<a name="notebooks-spark-editor"></a>

**Note**  
The Athena notebook editor is supported in the Pyspark engine version 3. For using notebooks with Apache Spark version 3.5, see [SageMaker Notebooks](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/what-is-sagemaker-unified-studio.html).

You manage your notebooks in the Athena notebook explorer and edit and run them in sessions using the Athena notebook editor. You can configure DPU usage for your notebook sessions according to your requirements.

When you stop a notebook, you terminate the associated session. All files are saved, but changes underway in declared variables, functions and classes are lost. When you restart the notebook, Athena reloads the notebook files and you can run your code again.

The Athena notebook editor is an interactive environment for writing and running code. The following sections describe the features of the environment.

## Understand notebook sessions and calculations
<a name="notebooks-spark-sessions-and-calculations"></a>

Each notebook is associated with a single Python kernel and runs Python code. A notebook can have one or more cells that contain commands. To run the cells in a notebook, you first create a session for the notebook. Sessions keep track of the variables and state of notebooks. 

Running a cell in a notebook means running a calculation in the current session. Calculations progress the state of the notebook and may perform tasks like reading from Amazon S3 or writing to other data stores. As long as a session is running, calculations use and modify the state that is maintained for the notebook.

When you no longer need the state, you can end a session. When you end a session, the notebook remains, but the variables and other state information are destroyed. If you need to work on multiple projects at the same time, you can create a session for each project, and the sessions will be independent from each other.

Sessions have dedicated compute capacity, measured in DPU. When you create a session, you can assign the session a number of DPUs. Different sessions can have different capacities depending on the requirements of the task.

## Switch between command mode and edit mode
<a name="notebooks-spark-command-mode-vs-edit-mode"></a>

The notebook editor has a modal user interface: an edit mode for entering text into a cell, and a command mode for issuing commands to the editor itself like copy, paste, or run.

To use edit mode and command mode, you can perform the following tasks:
+ To enter edit mode, press **ENTER**, or choose a cell. When a cell is in edit mode, the cell has a green left margin.
+ To enter command mode, press **ESC**, or click outside of a cell. Note that commands typically apply only to the currently selected cell, not to all cells. When the editor is in command mode, the cell has a blue left margin.
+ In command mode, you can use keyboard shortcuts and the menu above the editor, but not enter text into individual cells.
+ To select a cell, choose the cell.
+ To select all cells, press **Ctrl\$1A** (Windows) or **Cmd\$1A** (Mac).

## Use actions in the notebook editor menu
<a name="notebooks-spark-notebook-editor-menu"></a>

The icons in the menu at the top of the notebook editor offer the following options:
+ **Save** – Saves the current state of the notebook.
+ **Insert cell below** – Adds a new (empty) cell below the currently selected one.
+ **Cut selected cells** – Removes the selected cell from its current location and copies the cell to memory.
+ **Copy selected cells** – Copies the selected cell to memory.
+ **Paste cells below** – Pastes the copied cell below the current cell.
+ **Move selected cells up** – Moves the current cell above the cell above.
+ **Move selected cells down** – Moves the current cell below the cell below.
+ **Run** – Runs the current (selected) cell. The output displays immediately below the current cell.
+ **Run all** – Runs all cells in the notebook. The output for each cell displays immediately below the cell.
+ **Stop (Interrupt the kernel)** – Stops the current notebook by interrupting the kernel.
+ **Format option** – Selects the cell format, which can be one of the following:
  + **Code** – Use for Python code (the default).
  + **Markdown** – Use for entering text in [GitHub-style markdown](https://docs.github.com/en/get-started/writing-on-github) format. To render the markdown, run the cell.
  + **Raw NBConvert** – Use for entering text in unmodified form. Cells marked as **Raw NBConvert** can be converted into a different format like HTML by the Jupyter [nbconvert](https://nbconvert.readthedocs.io/en/latest/usage.html) command line tool.
+ **Heading** – Use to change the heading level of the cell.
+ **Command palette** – Contains Jupyter notebook commands and their keyboard shortcuts. For more information about the keyboard shortcuts, see the sections later in this document.
+ **Session** – Use options in this menu to [view](notebooks-spark-getting-started.md#notebooks-spark-getting-started-viewing-session-and-calculation-details) the details for a session, [edit session parameters](notebooks-spark-getting-started.md#notebooks-spark-getting-started-editing-session-details), or [terminate](notebooks-spark-getting-started.md#notebooks-spark-getting-started-terminating-a-session) the session. 

## Use command mode keyboard shortcuts for productivity
<a name="notebooks-spark-command-mode-keyboard-shortcuts"></a>

The following are some common notebook editor command mode keyboard shortcuts. These shortcuts are available after pressing **ESC** to enter command mode. To see a full list of commands available in the editor, press **ESC \$1 H**.


****  

| Key | Action | 
| --- | --- | 
| 1 - 6 | Change the cell type to markdown and set the heading level to the number typed | 
| a | Create a cell above the current cell | 
| b | Create a cell below the current cell | 
| c | Copy the current cell to memory | 
| d d | Delete the current cell | 
| h | Display the keyboard shortcut help screen | 
| j | Go one cell down | 
| k | Go one cell up | 
| m | Change the current cell format to markdown | 
| r | Change the current cell format to raw | 
| s | Save the notebook | 
| v | Paste memory contents under the current cell | 
| x | Cut the selected cell or cells | 
| y | Change the cell format to code | 
| z | Undo | 
| Ctrl\$1Enter  | Run the current cell and enter command mode | 
| Shift\$1Enter or Alt\$1Enter | Run the current cell and create a new cell below the output, and enter the new cell in edit mode | 
| Space | Page down | 
| Shift\$1Space | Page up | 
| Shift \$1 L | Toggle the visibility of line numbers in cells | 

## Customize command mode shortcuts
<a name="notebooks-spark-editing-command-mode-shortcuts"></a>

The notebook editor has an option to customize command mode keyboard shortcuts.

**To edit command mode shortcuts**

1. From the notebook editor menu, the choose the **Command palette**.

1. From the command palette, choose the **Edit command mode keyboard shortcuts** command.

1. Use the **Edit command mode shortcuts** interface to map or remap commands that you want to the keyboard.

   To see instructions for editing command mode shortcuts, scroll to the bottom of the **Edit command mode shortcuts** screen.

For information about using magic commands in Athena for Apache Spark, see [Use magic commands](notebooks-spark-magics.md).

**Topics**
+ [

## Understand notebook sessions and calculations
](#notebooks-spark-sessions-and-calculations)
+ [

## Switch between command mode and edit mode
](#notebooks-spark-command-mode-vs-edit-mode)
+ [

## Use actions in the notebook editor menu
](#notebooks-spark-notebook-editor-menu)
+ [

## Use command mode keyboard shortcuts for productivity
](#notebooks-spark-command-mode-keyboard-shortcuts)
+ [

## Customize command mode shortcuts
](#notebooks-spark-editing-command-mode-shortcuts)
+ [

# Use magic commands
](notebooks-spark-magics.md)

# Use magic commands
<a name="notebooks-spark-magics"></a>

Magic commands, or magics, are special commands that you can run in a notebook cell. For example, `%env` shows the environment variables in a notebook session. Athena supports the magic functions in IPython 6.0.3. 

This section shows some key magic commands in Athena for Apache Spark.
+  To see a list of magic commands in Athena, run the command **%lsmagic** in a notebook cell. 
+ For information about using magics to create graphs in Athena notebooks, see [Use magics to create data graphs](notebooks-spark-magics-graphs.md).
+ For information about additional magic commands, see [Built-in magic commands](https://ipython.readthedocs.io/en/stable/interactive/magics.html) in the IPython documentation.

**Note**  
Currently, the `%pip` command fails when executed. This is a known issue. 

**Topics**
+ [Cell magics](notebooks-spark-magics-cell-magics.md)
+ [Line Magics](notebooks-spark-magics-line-magics.md)
+ [Graph magics](notebooks-spark-magics-graphs.md)

# Use cell magics
<a name="notebooks-spark-magics-cell-magics"></a>

Magics that are written on several lines are preceded by a double percent sign (`%%`) and are called cell magic functions or cell magics.

## %%sql
<a name="notebooks-spark-magics-sql"></a>

This cell magic allows to run SQL statements directly without having to decorate it with Spark SQL statement. The command also displays the output by implicitly calling `.show()` on the returned dataframe.

![\[Using %%sql.\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-magics-1.png)


The `%%sql` command auto truncates column outputs to a width of 20 characters. Currently, this setting is not configurable. To work around this limitation, use the following full syntax and modify the parameters of the `show` method accordingly. 

```
spark.sql("""YOUR_SQL""").show(n=number, truncate=number, vertical=bool)
```
+ **n** `int`, optional. The number of rows to show.
+ **truncate** – `bool` or `int`, optional – If `true`, truncates strings longer than 20 characters. When set to a number greater than 1, truncates long strings to the length specified and right aligns cells.
+ **vertical** – `bool`, optional. If `true`, prints output rows vertically (one line per column value).

# Use line magics
<a name="notebooks-spark-magics-line-magics"></a>

Magics that are on a single line are preceded by a percent sign (`%`) and are called line magic functions or line magics.

## %help
<a name="notebooks-spark-magics-help"></a>

Displays descriptions of the available magic commands.

![\[Using %help.\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-magics-2.png)


## %list\$1sessions
<a name="notebooks-spark-magics-list_sessions"></a>

Lists the sessions associated with the notebook. The information for each session includes the session ID, session status, and the date and time that the session started and ended.

![\[Using %list_sessions.\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-magics-3.png)


## %session\$1id
<a name="notebooks-spark-magics-session_id"></a>

Retrieves the current session ID.

![\[Using session_id.\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-magics-4.png)


## %set\$1log\$1level
<a name="notebooks-spark-magics-set_log_level"></a>

Sets or resets the logger to use the specified log level. Possible values are `DEBUG`, `ERROR`, `FATAL`,`INFO`, and `WARN` or `WARNING`. Values must be uppercase and must not be enclosed in single or double quotes.

![\[Using %set_log_level.\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-magics-5.png)


## %status
<a name="notebooks-spark-magics-status"></a>

Describes the current session. The output includes the session ID, session state, workgroup name, PySpark engine version, and session start time. This magic command requires an active session to retrieve session details.

Following are the possible values for status:

**CREATING** – The session is being started, including acquiring resources.

**CREATED** – The session has been started.

**IDLE** – The session is able to accept a calculation.

**BUSY** – The session is processing another task and is unable to accept a calculation.

**TERMINATING** – The session is in the process of shutting down.

**TERMINATED** – The session and its resources are no longer running.

**DEGRADED** – The session has no healthy coordinators.

**FAILED** – Due to a failure, the session and its resources are no longer running.

![\[Using %status.\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-magics-6.png)


# Use magics to create data graphs
<a name="notebooks-spark-magics-graphs"></a>

The line magics in this section specialize in rendering data for particular types of data or in conjunction with graphing libraries.

## %table
<a name="notebooks-spark-magics-graphs-table"></a>

You can use the `%table` magic command to display dataframe data in table format.

The following example creates a dataframe with two columns and three rows of data, then displays the data in table format.

![\[Using the %table magic command.\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-magics-graphs-1.png)


## %matplot
<a name="notebooks-spark-magics-graphs-matplot"></a>

[Matplotlib](https://matplotlib.org/) is a comprehensive library for creating static, animated, and interactive visualizations in Python. You can use the `%matplot` magic command to create a graph after you import the matplotlib library into a notebook cell.

The following example imports the matplotlib library, creates a set of x and y coordinates, and then uses the use the `%matplot` magic command to create a graph of the points.

```
import matplotlib.pyplot as plt 
x=[3,4,5,6,7,8,9,10,11,12] 
y= [9,16,25,36,49,64,81,100,121,144] 
plt.plot(x,y) 
%matplot plt
```

![\[Using the %matplot magic command.\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-magics-graphs-2.png)


### Use the matplotlib and seaborn libraries together
<a name="notebooks-spark-magics-graphs-using-the-matplotlib-and-seaborn-libraries-together"></a>

[Seaborn](https://seaborn.pydata.org/tutorial/introduction) is a library for making statistical graphics in Python. It builds on top of matplotlib and integrates closely with [pandas](https://pandas.pydata.org/) (Python data analysis) data structures. You can also use the `%matplot` magic command to render seaborn data.

The following example uses both the matplotlib and seaborn libraries to create a simple bar graph.

```
import matplotlib.pyplot as plt 
import seaborn as sns 

x = ['A', 'B', 'C'] 
y = [1, 5, 3] 

sns.barplot(x, y) 
%matplot plt
```

![\[Using %matplot to render seaborn data.\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-magics-graphs-3.png)


## %plotly
<a name="notebooks-spark-magics-graphs-plotly"></a>

[Plotly](https://plotly.com/python/) is an open source graphing library for Python that you can use to make interactive graphs. You use the `%ploty` magic command to render ploty data.

The following example uses the [StringIO](https://docs.python.org/3.13/library/io.html#io.StringIO), plotly, and pandas libraries on stock price data to create a graph of stock activity from February and March of 2015.

```
from io import StringIO 
csvString = """ 
Date,AAPL.Open,AAPL.High,AAPL.Low,AAPL.Close,AAPL.Volume,AAPL.Adjusted,dn,mavg,up,direction 
2015-02-17,127.489998,128.880005,126.919998,127.830002,63152400,122.905254,106.7410523,117.9276669,129.1142814,Increasing 
2015-02-18,127.629997,128.779999,127.449997,128.720001,44891700,123.760965,107.842423,118.9403335,130.0382439,Increasing 
2015-02-19,128.479996,129.029999,128.330002,128.449997,37362400,123.501363,108.8942449,119.8891668,130.8840887,Decreasing 
2015-02-20,128.619995,129.5,128.050003,129.5,48948400,124.510914,109.7854494,120.7635001,131.7415509,Increasing 
2015-02-23,130.020004,133,129.660004,133,70974100,127.876074,110.3725162,121.7201668,133.0678174,Increasing 
2015-02-24,132.940002,133.600006,131.169998,132.169998,69228100,127.078049,111.0948689,122.6648335,134.2347981,Decreasing 
2015-02-25,131.559998,131.600006,128.149994,128.789993,74711700,123.828261,113.2119183,123.6296667,134.0474151,Decreasing 
2015-02-26,128.789993,130.869995,126.610001,130.419998,91287500,125.395469,114.1652991,124.2823333,134.3993674,Increasing 
2015-02-27,130,130.570007,128.240005,128.460007,62014800,123.510987,114.9668484,124.8426669,134.7184854,Decreasing 
2015-03-02,129.25,130.279999,128.300003,129.089996,48096700,124.116706,115.8770904,125.4036668,134.9302432,Decreasing 
2015-03-03,128.960007,129.520004,128.089996,129.360001,37816300,124.376308,116.9535132,125.9551669,134.9568205,Increasing 
2015-03-04,129.100006,129.559998,128.320007,128.539993,31666300,123.587892,118.0874253,126.4730002,134.8585751,Decreasing 
2015-03-05,128.580002,128.75,125.760002,126.410004,56517100,121.539962,119.1048311,126.848667,134.5925029,Decreasing 
2015-03-06,128.399994,129.369995,126.260002,126.599998,72842100,121.722637,120.190797,127.2288335,134.26687,Decreasing 
2015-03-09,127.959999,129.570007,125.059998,127.139999,88528500,122.241834,121.6289771,127.631167,133.6333568,Decreasing 
2015-03-10,126.410004,127.220001,123.800003,124.510002,68856600,119.71316,123.1164763,127.9235004,132.7305246,Decreasing 
""" 
csvStringIO = StringIO(csvString) 
 
from io import StringIO 
import plotly.graph_objects as go 
import pandas as pd 
from datetime import datetime 
df = pd.read_csv(csvStringIO) 
fig = go.Figure(data=[go.Candlestick(x=df['Date'], 
open=df['AAPL.Open'], 
high=df['AAPL.High'], 
low=df['AAPL.Low'], 
close=df['AAPL.Close'])]) 
%plotly fig
```

![\[Using the %ploty magic command.\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-magics-graphs-4.png)


# Use non-Hive table formats in Athena for Spark
<a name="notebooks-spark-table-formats"></a>

**Note**  
This page refers to using Python libraries in the release version Pyspark engine version 3. Refer to [Amazon EMR 7.12](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-7120-release.html) for supported open table format versions.

When you work with sessions and notebooks in Athena for Spark, you can use Linux Foundation Delta Lake, Apache Hudi, and Apache Iceberg tables, in addition to Apache Hive tables.

## Considerations and limitations
<a name="notebooks-spark-table-formats-considerations-and-limitations"></a>

When you use table formats other than Apache Hive with Athena for Spark, consider the following points:
+ In addition to Apache Hive, only one table format is supported per notebook. To use multiple table formats in Athena for Spark, create a separate notebook for each table format. For information about creating notebooks in Athena for Spark, see [Step 7: Create your own notebook](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook).
+ The Delta Lake, Hudi, and Iceberg table formats have been tested on Athena for Spark by using AWS Glue as the metastore. You might be able to use other metastores, but such usage is not currently supported.
+ To use the additional table formats, override the default `spark_catalog` property, as indicated in the Athena console and in this documentation. These non-Hive catalogs can read Hive tables, in addition to their own table formats.

## Table versions
<a name="notebooks-spark-table-formats-versions"></a>

The following table shows supported non-Hive table versions in Amazon Athena for Apache Spark.


****  

| Table format | Supported version | 
| --- | --- | 
| Apache Iceberg | 1.2.1 | 
| Apache Hudi | 0.13 | 
| Linux Foundation Delta Lake | 2.0.2 | 

In Athena for Spark, these table format `.jar` files and their dependencies are loaded onto the classpath for Spark drivers and executors.

For an *AWS Big Data Blog* post that shows how to work with Iceberg, Hudi, and Delta Lake table formats using Spark SQL in Amazon Athena notebooks, see [Use Amazon Athena with Spark SQL for your open-source transactional table formats](https://aws.amazon.com/blogs/big-data/use-amazon-athena-with-spark-sql-for-your-open-source-transactional-table-formats/).

**Topics**
+ [

## Considerations and limitations
](#notebooks-spark-table-formats-considerations-and-limitations)
+ [

## Table versions
](#notebooks-spark-table-formats-versions)
+ [Iceberg](notebooks-spark-table-formats-apache-iceberg.md)
+ [Hudi](notebooks-spark-table-formats-apache-hudi.md)
+ [Delta Lake](notebooks-spark-table-formats-linux-foundation-delta-lake.md)

# Use Apache Iceberg tables in Athena for Spark
<a name="notebooks-spark-table-formats-apache-iceberg"></a>

[Apache Iceberg](https://iceberg.apache.org/) is an open table format for large datasets in Amazon Simple Storage Service (Amazon S3). It provides you with fast query performance over large tables, atomic commits, concurrent writes, and SQL-compatible table evolution.

To use Apache Iceberg tables in Athena for Spark, configure the following Spark properties. These properties are configured for you by default in the Athena for Spark console when you choose Apache Iceberg as the table format. For steps, see [Step 4: Edit session details](notebooks-spark-getting-started.md#notebooks-spark-getting-started-editing-session-details) or [Step 7: Create your own notebook](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook).

```
"spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkSessionCatalog",
"spark.sql.catalog.spark_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
"spark.sql.catalog.spark_catalog.io-impl": "org.apache.iceberg.aws.s3.S3FileIO",
"spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
```

The following procedure shows you how to use an Apache Iceberg table in an Athena for Spark notebook. Run each step in a new cell in the notebook.

**To use an Apache Iceberg table in Athena for Spark**

1. Define the constants to use in the notebook.

   ```
   DB_NAME = "NEW_DB_NAME"
   TABLE_NAME = "NEW_TABLE_NAME"
   TABLE_S3_LOCATION = "s3://amzn-s3-demo-bucket"
   ```

1. Create an Apache Spark [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html).

   ```
   columns = ["language","users_count"]
   data = [("Golang", 3000)]
   df = spark.createDataFrame(data, columns)
   ```

1. Create a database.

   ```
   spark.sql("CREATE DATABASE {} LOCATION '{}'".format(DB_NAME, TABLE_S3_LOCATION))
   ```

1. Create an empty Apache Iceberg table.

   ```
   spark.sql("""
   CREATE TABLE {}.{} (
   language string,
   users_count int
   ) USING ICEBERG
   """.format(DB_NAME, TABLE_NAME))
   ```

1. Insert a row of data into the table.

   ```
   spark.sql("""INSERT INTO {}.{} VALUES ('Golang', 3000)""".format(DB_NAME, TABLE_NAME))
   ```

1. Confirm that you can query the new table.

   ```
   spark.sql("SELECT * FROM {}.{}".format(DB_NAME, TABLE_NAME)).show()
   ```

For more information and examples on working with Spark DataFrames and Iceberg tables, see [Spark Queries](https://iceberg.apache.org/docs/latest/spark-queries/) in the Apache Iceberg documentation.

# Use Apache Hudi tables in Athena for Spark
<a name="notebooks-spark-table-formats-apache-hudi"></a>

[https://hudi.apache.org/](https://hudi.apache.org/) is an open-source data management framework that simplifies incremental data processing. Record-level insert, update, upsert, and delete actions are processed with greater precision, which reduces overhead.

To use Apache Hudi tables in Athena for Spark, configure the following Spark properties. These properties are configured for you by default in the Athena for Spark console when you choose Apache Hudi as the table format. For steps, see [Step 4: Edit session details](notebooks-spark-getting-started.md#notebooks-spark-getting-started-editing-session-details) or [Step 7: Create your own notebook](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook).

```
"spark.sql.catalog.spark_catalog": "org.apache.spark.sql.hudi.catalog.HoodieCatalog",
"spark.serializer": "org.apache.spark.serializer.KryoSerializer",
"spark.sql.extensions": "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
```

The following procedure shows you how to use an Apache Hudi table in an Athena for Spark notebook. Run each step in a new cell in the notebook.

**To use an Apache Hudi table in Athena for Spark**

1. Define the constants to use in the notebook.

   ```
   DB_NAME = "NEW_DB_NAME"
   TABLE_NAME = "NEW_TABLE_NAME"
   TABLE_S3_LOCATION = "s3://amzn-s3-demo-bucket"
   ```

1. Create an Apache Spark [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html).

   ```
   columns = ["language","users_count"]
   data = [("Golang", 3000)]
   df = spark.createDataFrame(data, columns)
   ```

1. Create a database.

   ```
   spark.sql("CREATE DATABASE {} LOCATION '{}'".format(DB_NAME, TABLE_S3_LOCATION))
   ```

1. Create an empty Apache Hudi table.

   ```
   spark.sql("""
   CREATE TABLE {}.{} (
   language string,
   users_count int
   ) USING HUDI
   TBLPROPERTIES (
   primaryKey = 'language',
   type = 'mor'
   );
   """.format(DB_NAME, TABLE_NAME))
   ```

1. Insert a row of data into the table.

   ```
   spark.sql("""INSERT INTO {}.{} VALUES ('Golang', 3000)""".format(DB_NAME,TABLE_NAME))
   ```

1. Confirm that you can query the new table.

   ```
   spark.sql("SELECT * FROM {}.{}".format(DB_NAME, TABLE_NAME)).show()
   ```

# Use Linux Foundation Delta Lake tables in Athena for Spark
<a name="notebooks-spark-table-formats-linux-foundation-delta-lake"></a>

[Linux Foundation Delta Lake](https://delta.io/) is a table format that you can use for big data analytics. You can use Athena for Spark to read Delta Lake tables stored in Amazon S3 directly.

To use Delta Lake tables in Athena for Spark, configure the following Spark properties. These properties are configured for you by default in the Athena for Spark console when you choose Delta Lake as the table format. For steps, see [Step 4: Edit session details](notebooks-spark-getting-started.md#notebooks-spark-getting-started-editing-session-details) or [Step 7: Create your own notebook](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook).

```
"spark.sql.catalog.spark_catalog" : "org.apache.spark.sql.delta.catalog.DeltaCatalog", 
"spark.sql.extensions" : "io.delta.sql.DeltaSparkSessionExtension"
```

The following procedure shows you how to use a Delta Lake table in an Athena for Spark notebook. Run each step in a new cell in the notebook.

**To use a Delta Lake table in Athena for Spark**

1. Define the constants to use in the notebook.

   ```
   DB_NAME = "NEW_DB_NAME" 
   TABLE_NAME = "NEW_TABLE_NAME" 
   TABLE_S3_LOCATION = "s3://amzn-s3-demo-bucket"
   ```

1. Create an Apache Spark [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html).

   ```
   columns = ["language","users_count"] 
   data = [("Golang", 3000)] 
   df = spark.createDataFrame(data, columns)
   ```

1. Create a database.

   ```
   spark.sql("CREATE DATABASE {} LOCATION '{}'".format(DB_NAME, TABLE_S3_LOCATION))
   ```

1. Create an empty Delta Lake table.

   ```
   spark.sql("""
   CREATE TABLE {}.{} ( 
     language string, 
     users_count int 
   ) USING DELTA 
   """.format(DB_NAME, TABLE_NAME))
   ```

1. Insert a row of data into the table.

   ```
   spark.sql("""INSERT INTO {}.{} VALUES ('Golang', 3000)""".format(DB_NAME, TABLE_NAME))
   ```

1. Confirm that you can query the new table.

   ```
   spark.sql("SELECT * FROM {}.{}".format(DB_NAME, TABLE_NAME)).show()
   ```

# Use Python libraries in Athena for Spark
<a name="notebooks-spark-python-library-support"></a>

**Note**  
This page refers to using Python libraries in the release version Pyspark engine version 3. The release version Apache Spark version 3.5 is based on [Amazon EMR 7.12](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-7120-release.html). Refer to EMR 7.12 for libraries included in this version.

This page describes the terminology used and lifecycle management followed for the runtimes, libraries, and packages used in Amazon Athena for Apache Spark.

## Definitions
<a name="notebooks-spark-python-library-support-definitions"></a>
+ **Amazon Athena for Apache Spark** is a customized version of open source Apache Spark. To see the current version, run the command `print(f'{spark.version}')` in a notebook cell. 
+ The **Athena runtime** is the environment in which your code runs. The environment includes a Python interpreter and PySpark libraries.
+ An **external library or package** is a Java or Scala JAR or Python library that is not part of the Athena runtime but can be included in Athena for Spark jobs. External packages can be built by Amazon or by you.
+ A **convenience package** is a collection of external packages selected by Athena that you can choose to include in your Spark applications.
+ A **bundle** combines the Athena runtime and a convenience package.
+ A **user library** is an external library or package that you explicitly add to your Athena for Spark job.
  + A user library is an external package that is not part of a convenience package. A user library requires loading and installation, as when you write some `.py` files, zip them up, and then add the `.zip` file to your application.
+ An **Athena for Spark application** is a job or query that you submit to Athena for Spark.

## Lifecycle management
<a name="notebooks-spark-python-library-support-lifecycle-management"></a>

The following sections describe the versioning and deprecation policies regarding the runtime and convenience packages used in Athena for Spark.

### Runtime versioning and deprecation
<a name="notebooks-spark-python-library-support-runtime-versioning-and-deprecation"></a>

The main component in the Athena runtime is the Python interpreter. Because Python is an evolving language, new versions are released regularly and support removed for older versions. Athena does not recommend that you run programs with deprecated Python interpreter versions and highly recommends that you use the latest Athena runtime whenever possible.

The Athena runtime deprecation schedule is as follows:

1. After Athena provides a new runtime, Athena will continue to support the previous runtime for 6 months. During that time, Athena will apply security patches and updates to the previous runtime.

1. After 6 months, Athena will end support for the previous runtime. Athena will no longer apply security patches and other updates to the previous runtime. Spark applications using the previous runtime will no longer be eligible for technical support.

1. After 12 months, you will no longer be able to update or edit Spark applications in a workgroup that uses the previous runtime. We recommend that you update your Spark applications before this time period ends. After the time period ends, you can still run existing notebooks, but any notebooks that still use the previous runtime will log a warning to that effect.

1. After 18 months, you will no longer be able to run jobs in the workgroup using the previous runtime.

### Convenience package versioning and deprecation
<a name="notebooks-spark-python-library-support-convenience-package-versioning-and-deprecation"></a>

The contents of convenience packages change over time. Athena occasionally adds, removes, or upgrades these convenience packages. 

Athena uses the following guidelines for convenience packages:
+ Convenience packages have a simple versioning scheme such as 1, 2, 3.
+ Each convenience package version includes specific versions of external packages. After Athena creates a convenience package, the convenience package's set of external packages and their corresponding versions do not change.
+ Athena creates a new convenience package version when it includes a new external package, removes an external package, or upgrades the version of one or more external packages.

Athena deprecates a convenience package when it deprecates the Athena runtime that the package uses. Athena can deprecate packages sooner to limit the number of bundles that it supports.

The convenience package deprecation schedule follows the Athena runtime deprecation schedule.

# List of preinstalled Python libraries
<a name="notebooks-spark-preinstalled-python-libraries"></a>

Preinstalled Python libraries include the following.

```
boto3==1.24.31
botocore==1.27.31
certifi==2022.6.15
charset-normalizer==2.1.0
cycler==0.11.0
cython==0.29.30
docutils==0.19
fonttools==4.34.4
idna==3.3
jmespath==1.0.1
joblib==1.1.0
kiwisolver==1.4.4
matplotlib==3.5.2
mpmath==1.2.1
numpy==1.23.1
packaging==21.3
pandas==1.4.3
patsy==0.5.2
pillow==9.2.0
plotly==5.9.0
pmdarima==1.8.5
pyathena==2.9.6
pyparsing==3.0.9
python-dateutil==2.8.2
pytz==2022.1
requests==2.28.1
s3transfer==0.6.0
scikit-learn==1.1.1
scipy==1.8.1
seaborn==0.11.2
six==1.16.0
statsmodels==0.13.2
sympy==1.10.1
tenacity==8.0.1
threadpoolctl==3.1.0
urllib3==1.26.10
pyarrow==9.0.0
```

## Notes
<a name="notebooks-spark-preinstalled-python-libraries-notes"></a>
+ MLlib (Apache Spark machine learning library) and the `pyspark.ml` package are not supported.
+ Currently, `pip install` is not supported in Athena for Spark sessions. 

For information on importing Python libraries to Amazon Athena for Apache Spark, see [Import files and Python libraries to Athena for Spark](notebooks-import-files-libraries.md).

# Import files and Python libraries to Athena for Spark
<a name="notebooks-import-files-libraries"></a>

This document provides examples of how to import files and Python libraries to Amazon Athena for Apache Spark.

## Considerations and Limitations
<a name="notebooks-import-files-libraries-considerations-limitations"></a>
+ **Python version** – Currently, Athena for Spark uses Python version 3.9.16. Note that Python packages are sensitive to minor Python versions.
+ **Athena for Spark architecture** – Athena for Spark uses Amazon Linux 2 on ARM64 architecture. Note that some Python libraries do not distribute binaries for this architecture.
+ **Binary shared objects (SOs)** – Because the SparkContext [addPyFile](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkContext.addPyFile.html) method does not detect binary shared objects, it cannot be used in Athena for Spark to add Python packages that depend on shared objects.
+ **Resilient Distributed Datasets (RDDs)** – [RDDs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.RDD.html) are not supported.
+ **Dataframe.foreach** – The PySpark [DataFrame.foreach](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.foreach.html) method is not supported.

## Examples
<a name="notebooks-import-files-libraries-examples"></a>

The examples use the following conventions.
+ The placeholder Amazon S3 location `s3://amzn-s3-demo-bucket`. Replace this with your own S3 bucket location.
+ All code blocks that execute from a Unix shell are shown as *directory\$1name* `$`. For example, the command `ls` in the directory `/tmp` and its output are displayed as follows:

  ```
  /tmp $ ls
  ```

  **Output**

  ```
  file1 file2
  ```

## Import text files for use in calculations
<a name="notebooks-import-files-libraries-importing-text-files"></a>

The examples in this section show how to import text files for use in calculations in your notebooks in Athena for Spark.

### Add a file to a notebook after you write it to local temporary directory
<a name="notebooks-import-files-libraries-adding-a-file-to-a-notebook-temporary-directory"></a>

The following example shows how to write a file to a local temporary directory, add it to a notebook, and test it.

```
import os
from pyspark import SparkFiles
tempdir = '/tmp/'
path = os.path.join(tempdir, "test.txt")
with open(path, "w") as testFile:
    _ = testFile.write("5")
sc.addFile(path)

def func(iterator):
    with open(SparkFiles.get("test.txt")) as testFile:
        fileVal = int(testFile.readline())
        return [x * fileVal for x in iterator]

#Test the file
from pyspark.sql.functions import udf
from pyspark.sql.functions import col

udf_with_import = udf(func)
df = spark.createDataFrame([(1, "a"), (2, "b")])
df.withColumn("col", udf_with_import(col('_2'))).show()
```

**Output**

```
Calculation completed.
+---+---+-------+
| _1| _2|    col|
+---+---+-------+
|  1|  a|[aaaaa]|
|  2|  b|[bbbbb]|
+---+---+-------+
```

### Import a file from Amazon S3
<a name="notebooks-import-files-libraries-importing-a-file-from-s3"></a>

The following example shows how to import a file from Amazon S3 into a notebook and test it.

**To import a file from Amazon S3 into a notebook**

1. Create a file named `test.txt` that has a single line that contains the value `5`.

1. Add the file to a bucket in Amazon S3. This example uses the location `s3://amzn-s3-demo-bucket`.

1. Use the following code to import the file to your notebook and test the file.

   ```
   from pyspark import SparkFiles
   sc.addFile('s3://amzn-s3-demo-bucket/test.txt')
   
   def func(iterator):
      with open(SparkFiles.get("test.txt")) as testFile:
          fileVal = int(testFile.readline())
          return [x * fileVal for x in iterator]
          
   #Test the file
   from pyspark.sql.functions import udf
   from pyspark.sql.functions import col
   
   udf_with_import = udf(func)
   df = spark.createDataFrame([(1, "a"), (2, "b")])
   df.withColumn("col", udf_with_import(col('_2'))).show()
   ```

   **Output**

   ```
   Calculation completed.
   +---+---+-------+
   | _1| _2|    col|
   +---+---+-------+
   |  1|  a|[aaaaa]|
   |  2|  b|[bbbbb]|
   +---+---+-------+
   ```

## Add Python files
<a name="notebooks-import-files-libraries-adding-python-files"></a>

The examples in this section show how to add Python files and libraries to your Spark notebooks in Athena.

### Add Python files and register a UDF
<a name="notebooks-import-files-libraries-adding-python-files-and-registering-a-udf"></a>

The following example shows how to add Python files from Amazon S3 to your notebook and register a UDF.

**To add Python files to your notebook and register a UDF**

1. Using your own Amazon S3 location, create the file `s3://amzn-s3-demo-bucket/file1.py` with the following contents:

   ```
   def xyz(input):
       return 'xyz  - udf ' + str(input);
   ```

1. In the same S3 location, create the file `s3://amzn-s3-demo-bucket/file2.py` with the following contents:

   ```
   from file1 import xyz
   def uvw(input):
       return 'uvw -> ' + xyz(input);
   ```

1. In your Athena for Spark notebook, run the following commands.

   ```
   sc.addPyFile('s3://amzn-s3-demo-bucket/file1.py')
   sc.addPyFile('s3://amzn-s3-demo-bucket/file2.py')
   
   def func(iterator):
       from file2 import uvw
       return [uvw(x) for x in iterator]
   
   from pyspark.sql.functions import udf
   from pyspark.sql.functions import col
   
   udf_with_import = udf(func)
   
   df = spark.createDataFrame([(1, "a"), (2, "b")])
   
   df.withColumn("col", udf_with_import(col('_2'))).show(10)
   ```

   **Output**

   ```
   Calculation started (calculation_id=1ec09e01-3dec-a096-00ea-57289cdb8ce7) in (session=c8c09e00-6f20-41e5-98bd-4024913d6cee). Checking calculation status...
   Calculation completed.
   +---+---+--------------------+
   | _1| _2|                 col|
   +---+---+--------------------+
   | 1 |  a|[uvw -> xyz - ud... |
   | 2 |  b|[uvw -> xyz - ud... |
   +---+---+--------------------+
   ```

### Import a Python .zip file
<a name="notebooks-import-files-libraries-importing-a-python-zip-file"></a>

You can use the Python `addPyFile` and `import` methods to import a Python .zip file to your notebook.

**Note**  
The `.zip` files that you import to Athena Spark may include only Python packages. For example, including packages with C-based files is not supported.

**To import a Python `.zip` file to your notebook**

1. On your local computer, in a desktop directory such as `\tmp`, create a directory called `moduletest`.

1. In the `moduletest` directory, create a file named `hello.py` with the following contents:

   ```
   def hi(input):
       return 'hi ' + str(input);
   ```

1. In the same directory, add an empty file with the name `__init__.py`.

   If you list the directory contents, they should now look like the following.

   ```
   /tmp $ ls moduletest
   __init__.py       hello.py
   ```

1. Use the `zip` command to place the two module files into a file called `moduletest.zip`.

   ```
   moduletest $ zip -r9 ../moduletest.zip *
   ```

1. Upload the `.zip` file to your bucket in Amazon S3.

1. Use the following code to import the Python`.zip` file into your notebook.

   ```
   sc.addPyFile('s3://amzn-s3-demo-bucket/moduletest.zip')
   
   from moduletest.hello import hi
   
   from pyspark.sql.functions import udf
   from pyspark.sql.functions import col
   
   hi_udf = udf(hi)
   
   df = spark.createDataFrame([(1, "a"), (2, "b")])
   
   df.withColumn("col", hi_udf(col('_2'))).show()
   ```

   **Output**

   ```
   Calculation started (calculation_id=6ec09e8c-6fe0-4547-5f1b-6b01adb2242c) in (session=dcc09e8c-3f80-9cdc-bfc5-7effa1686b76). Checking calculation status...
   Calculation completed.
   +---+---+----+
   | _1| _2| col|
   +---+---+----+
   |  1|  a|hi a|
   |  2|  b|hi b|
   +---+---+----+
   ```

### Import two versions of a Python library as separate modules
<a name="notebooks-import-files-libraries-importing-two-library-versions"></a>

The following code examples show how to add and import two different versions of a Python library from a location in Amazon S3 as two separate modules. The code adds each the library file from S3, imports it, and then prints the library version to verify the import.

```
sc.addPyFile('s3://amzn-s3-demo-bucket/python-third-party-libs-test/simplejson_v3_15.zip')
sc.addPyFile('s3://amzn-s3-demo-bucket/python-third-party-libs-test/simplejson_v3_17_6.zip')

import simplejson_v3_15
print(simplejson_v3_15.__version__)
```

**Output**

```
3.15.0
```

```
import simplejson_v3_17_6
print(simplejson_v3_17_6.__version__)
```

**Output**

```
3.17.6
```

### Import a Python .zip file from PyPI
<a name="notebooks-import-files-libraries-importing-a-python-zip-file-from-a-github-project"></a>

This example uses the `pip` command to download a Python .zip file of the [bpabel/piglatin](https://github.com/bpabel/piglatin) project from the [Python Package Index (PyPI)](https://pypi.org/).

**To import a Python .zip file from PyPI**

1. On your local desktop, use the following commands to create a directory called `testpiglatin` and create a virtual environment.

   ```
   /tmp $ mkdir testpiglatin
   /tmp $ cd testpiglatin
   testpiglatin $ virtualenv .
   ```

   **Output**

   ```
   created virtual environment CPython3.9.6.final.0-64 in 410ms
   creator CPython3Posix(dest=/private/tmp/testpiglatin, clear=False, no_vcs_ignore=False, global=False)
   seeder FromAppData(download=False, pip=bundle, setuptools=bundle, wheel=bundle, via=copy, app_data_dir=/Users/user1/Library/Application Support/virtualenv)
   added seed packages: pip==22.0.4, setuptools==62.1.0, wheel==0.37.1
   activators BashActivator,CShellActivator,FishActivator,NushellActivator,PowerShellActivator,PythonActivator
   ```

1. Create a subdirectory named `unpacked` to hold the project.

   ```
   testpiglatin $ mkdir unpacked
   ```

1. Use the `pip` command to install the project into the `unpacked` directory.

   ```
   testpiglatin $ bin/pip install -t $PWD/unpacked piglatin
   ```

   **Output**

   ```
   Collecting piglatin
   Using cached piglatin-1.0.6-py2.py3-none-any.whl (3.1 kB)
   Installing collected packages: piglatin
   Successfully installed piglatin-1.0.6
   ```

1. Check the contents of the directory.

   ```
   testpiglatin $ ls
   ```

   **Output**

   ```
   bin lib pyvenv.cfg unpacked
   ```

1. Change to the `unpacked` directory and display the contents.

   ```
   testpiglatin $ cd unpacked
   unpacked $ ls
   ```

   **Output**

   ```
   piglatin piglatin-1.0.6.dist-info
   ```

1. Use the `zip` command to place the contents of the piglatin project into a file called `library.zip`.

   ```
   unpacked $ zip -r9 ../library.zip *
   ```

   **Output**

   ```
   adding: piglatin/ (stored 0%)
   adding: piglatin/__init__.py (deflated 56%)
   adding: piglatin/__pycache__/ (stored 0%)
   adding: piglatin/__pycache__/__init__.cpython-39.pyc (deflated 31%)
   adding: piglatin-1.0.6.dist-info/ (stored 0%)
   adding: piglatin-1.0.6.dist-info/RECORD (deflated 39%)
   adding: piglatin-1.0.6.dist-info/LICENSE (deflated 41%)
   adding: piglatin-1.0.6.dist-info/WHEEL (deflated 15%)
   adding: piglatin-1.0.6.dist-info/REQUESTED (stored 0%)
   adding: piglatin-1.0.6.dist-info/INSTALLER (stored 0%)
   adding: piglatin-1.0.6.dist-info/METADATA (deflated 48%)
   ```

1. (Optional) Use the following commands to test the import locally.

   1. Set the Python path to the `library.zip` file location and start Python.

      ```
      /home $ PYTHONPATH=/tmp/testpiglatin/library.zip 
      /home $ python3
      ```

      **Output**

      ```
      Python 3.9.6 (default, Jun 29 2021, 06:20:32)
      [Clang 12.0.0 (clang-1200.0.32.29)] on darwin
      Type "help", "copyright", "credits" or "license" for more information.
      ```

   1. Import the library and run a test command.

      ```
      >>> import piglatin
      >>> piglatin.translate('hello')
      ```

      **Output**

      ```
      'ello-hay'
      ```

1. Use commands like the following to add the `.zip` file from Amazon S3, import it into your notebook in Athena, and test it.

   ```
   sc.addPyFile('s3://amzn-s3-demo-bucket/library.zip')
   
   import piglatin
   piglatin.translate('hello')
   
   from pyspark.sql.functions import udf
   from pyspark.sql.functions import col
   
   hi_udf = udf(piglatin.translate)
   
   df = spark.createDataFrame([(1, "hello"), (2, "world")])
   
   df.withColumn("col", hi_udf(col('_2'))).show()
   ```

   **Output**

   ```
   Calculation started (calculation_id=e2c0a06e-f45d-d96d-9b8c-ff6a58b2a525) in (session=82c0a06d-d60e-8c66-5d12-23bcd55a6457). Checking calculation status...
   Calculation completed.
   +---+-----+--------+
   | _1|   _2|     col|
   +---+-----+--------+
   |  1|hello|ello-hay|
   |  2|world|orld-way|
   +---+-----+--------+
   ```

### Import a Python .zip file from PyPI that has dependencies
<a name="notebooks-import-files-libraries-importing-a-python-zip-file-with-dependencies"></a>

This example imports the [md2gemini](https://github.com/makeworld-the-better-one/md2gemini) package, which converts text in markdown to [Gemini](https://gemini.circumlunar.space/) text format, from PyPI. The package has the following [dependencies](https://libraries.io/pypi/md2gemini):

```
cjkwrap
mistune
wcwidth
```

**To import a Python .zip file that has dependencies**

1. On your local computer, use the following commands to create a directory called `testmd2gemini` and create a virtual environment.

   ```
   /tmp $ mkdir testmd2gemini
   /tmp $ cd testmd2gemini
   testmd2gemini$ virtualenv .
   ```

1. Create a subdirectory named `unpacked` to hold the project.

   ```
   testmd2gemini $ mkdir unpacked
   ```

1. Use the `pip` command to install the project into the `unpacked` directory.

   ```
   /testmd2gemini $ bin/pip install -t $PWD/unpacked md2gemini
   ```

   **Output**

   ```
   Collecting md2gemini
     Downloading md2gemini-1.9.0-py3-none-any.whl (31 kB)
   Collecting wcwidth
     Downloading wcwidth-0.2.5-py2.py3-none-any.whl (30 kB)
   Collecting mistune<3,>=2.0.0
     Downloading mistune-2.0.2-py2.py3-none-any.whl (24 kB)
   Collecting cjkwrap
     Downloading CJKwrap-2.2-py2.py3-none-any.whl (4.3 kB)
   Installing collected packages: wcwidth, mistune, cjkwrap, md2gemini
   Successfully installed cjkwrap-2.2 md2gemini-1.9.0 mistune-2.0.2 wcwidth-0.2.5
   ...
   ```

1. Change to the `unpacked` directory and check the contents.

   ```
   testmd2gemini $ cd unpacked
   unpacked $ ls -lah
   ```

   **Output**

   ```
   total 16
   drwxr-xr-x  13 user1  wheel   416B Jun  7 18:43 .
   drwxr-xr-x   8 user1  wheel   256B Jun  7 18:44 ..
   drwxr-xr-x   9 user1  staff   288B Jun  7 18:43 CJKwrap-2.2.dist-info
   drwxr-xr-x   3 user1  staff    96B Jun  7 18:43 __pycache__
   drwxr-xr-x   3 user1  staff    96B Jun  7 18:43 bin
   -rw-r--r--   1 user1  staff   5.0K Jun  7 18:43 cjkwrap.py
   drwxr-xr-x   7 user1  staff   224B Jun  7 18:43 md2gemini
   drwxr-xr-x  10 user1  staff   320B Jun  7 18:43 md2gemini-1.9.0.dist-info
   drwxr-xr-x  12 user1  staff   384B Jun  7 18:43 mistune
   drwxr-xr-x   8 user1  staff   256B Jun  7 18:43 mistune-2.0.2.dist-info
   drwxr-xr-x  16 user1  staff   512B Jun  7 18:43 tests
   drwxr-xr-x  10 user1  staff   320B Jun  7 18:43 wcwidth
   drwxr-xr-x   9 user1  staff   288B Jun  7 18:43 wcwidth-0.2.5.dist-info
   ```

1. Use the `zip` command to place the contents of the md2gemini project into a file called `md2gemini.zip`.

   ```
   unpacked $ zip -r9 ../md2gemini *
   ```

   **Output**

   ```
     adding: CJKwrap-2.2.dist-info/ (stored 0%)
     adding: CJKwrap-2.2.dist-info/RECORD (deflated 37%)
     ....
     adding: wcwidth-0.2.5.dist-info/INSTALLER (stored 0%)
     adding: wcwidth-0.2.5.dist-info/METADATA (deflated 62%)
   ```

1. (Optional) Use the following commands to test that the library works on your local computer.

   1. Set the Python path to the `md2gemini.zip` file location and start Python.

      ```
      /home $ PYTHONPATH=/tmp/testmd2gemini/md2gemini.zip 
      /home python3
      ```

   1. Import the library and run a test.

      ```
      >>> from md2gemini import md2gemini
      >>> print(md2gemini('[abc](https://abc.def)'))
      ```

      **Output**

      ```
      https://abc.def abc
      ```

1. Use the following commands to add the `.zip` file from Amazon S3, import it into your notebook in Athena, and perform a non UDF test.

   ```
   # (non udf test)
   sc.addPyFile('s3://amzn-s3-demo-bucket/md2gemini.zip')
   from md2gemini import md2gemini
   print(md2gemini('[abc](https://abc.def)'))
   ```

   **Output**

   ```
   Calculation started (calculation_id=0ac0a082-6c3f-5a8f-eb6e-f8e9a5f9bc44) in (session=36c0a082-5338-3755-9f41-0cc954c55b35). Checking calculation status...
   Calculation completed.
   => https://abc.def (https://abc.def/) abc
   ```

1. Use the following commands to perform a UDF test.

   ```
   # (udf test)
   
   from pyspark.sql.functions import udf
   from pyspark.sql.functions import col
   from md2gemini import md2gemini
   
   
   hi_udf = udf(md2gemini)
   df = spark.createDataFrame([(1, "[first website](https://abc.def)"), (2, "[second website](https://aws.com)")])
   df.withColumn("col", hi_udf(col('_2'))).show()
   ```

   **Output**

   ```
   Calculation started (calculation_id=60c0a082-f04d-41c1-a10d-d5d365ef5157) in (session=36c0a082-5338-3755-9f41-0cc954c55b35). Checking calculation status...
   Calculation completed.
   +---+--------------------+--------------------+
   | _1|                  _2|                 col|
   +---+--------------------+--------------------+
   |  1|[first website](h...|=> https://abc.de...|
   |  2|[second website](...|=> https://aws.co...|
   +---+--------------------+--------------------+
   ```

# Use Spark properties to specify custom configuration
<a name="notebooks-spark-custom-jar-cfg"></a>

When you create or edit a session in Amazon Athena for Apache Spark, you can use [Spark properties](https://spark.apache.org/docs/latest/configuration.html#spark-properties) to specify `.jar` files, packages, or another custom configuration for the session. To specify your Spark properties, you can use the Athena console, the AWS CLI, or the Athena API.

## Use the Athena console to specify Spark properties
<a name="notebooks-spark-custom-jar-cfg-console"></a>

In the Athena console, you can specify your Spark properties when you [create a notebook](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook) or [edit a current session](notebooks-spark-getting-started.md#notebooks-spark-getting-started-editing-session-details).

**To add properties in the **Create notebook** or **Edit session details** dialog box**

1. Expand **Spark properties**.

1. To add your properties, use the **Edit in table** or **Edit in JSON** option.
   + For the **Edit in table** option, choose **Add property** to add a property, or choose **Remove** to remove a property. Use the **Key** and **Value** boxes to enter property names and their values.
     + To add a custom `.jar` file, use the `spark.jars` property.
     + To specify a package file, use the `spark.jars.packages` property.
   + To enter and edit your configuration directly, choose the **Edit in JSON** option. In the JSON text editor, you can perform the following tasks:
     + Choose **Copy** to copy the JSON text to the clipboard.
     + Choose **Clear** to remove all text from the JSON editor.
     + Choose the settings (gear) icon to configure line wrapping or choose a color theme for the JSON editor.

### Notes
<a name="notebooks-spark-custom-jar-cfg-notes"></a>
+ You can set properties in Athena for Spark, which is the same as setting [Spark properties](https://spark.apache.org/docs/latest/configuration.html#spark-properties) directly on a [SparkConf](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.SparkConf.html) object.
+ Start all Spark properties with the `spark.` prefix. Properties with other prefixes are ignored.
+ Not all Spark properties are available for custom configuration on Athena. If you submit a `StartSession` request that has a restricted configuration, the session fails to start.
  + You cannot use the `spark.athena.` prefix because it is reserved.

## Use the AWS CLI or Athena API to provide custom configuration
<a name="notebooks-spark-custom-jar-cfg-cli-or-api"></a>

To use the AWS CLI or Athena API to provide your session configuration, use the [StartSession](https://docs.aws.amazon.com/athena/latest/APIReference/API_StartSession.html) API action or the [start-session](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/athena/start-session.html) CLI command. In your `StartSession` request, use the `SparkProperties` field of [EngineConfiguration](https://docs.aws.amazon.com/athena/latest/APIReference/API_EngineConfiguration.html) object to pass your configuration information in JSON format. This starts a session with your specified configuration.

To specify custom Spark properties from the AWS CLI, use `engine-configuration` configuration when you start an interactive session.

```
aws athena start-session \ 
--region "REGION"
--work-group "WORKGROUP" \
--engine-configuration '{
    "Classifications": [{
      "Name": "spark-defaults",
      "Properties": {
        "spark.dynamicAllocation.minExecutors": "1",
        "spark.dynamicAllocation.initialExecutors": "2",
        "spark.dynamicAllocation.maxExecutors": "10",
        "spark.dynamicAllocation.executorIdleTimeout": "300"
      }
    }]
  }'
```

You can also specify configuration defaults at the Workgroup level using the `CreateWorkgroup` API action or the `UpdateWorkgroup` API action. Configuration defaults defined at the workgroup applies to all Sessions started for that workgroup.

To specify default Spark properties from the AWS CLI for a workgroup, use the `engine-configuration` configuration when creating a new Workgroup:

```
aws athena create-work-group \
  --region "REGION" \
  --name "WORKGROUP_NAME" \
  --configuration '{
    "EngineVersion": {
      "SelectedEngineVersion": "Apache Spark version 3.5"
    },
    "ExecutionRole": "EXECUTION_ROLE",
    "EngineConfiguration": {
      "Classifications": [
        {
          "Name": "spark-defaults",
          "Properties": {
            "spark.dynamicAllocation.minExecutors": "1",
            "spark.dynamicAllocation.initialExecutors": "2",
            "spark.dynamicAllocation.maxExecutors": "10",
            "spark.dynamicAllocation.executorIdleTimeout": "300"
          }
        }
      ]
    }
  }'
```

To modify defaults Spark properties from the AWS CLI for a workgroup, use the `engine-configuration` configuration when updating a Workgroup. The changes apply to new interactive sessions going forward.

```
aws athena update-work-group \
  --region "REGION" \
  --work-group "WORKGROUP_NAME" \
  --configuration-updates '{
    "EngineVersion": {
      "SelectedEngineVersion": "Apache Spark version 3.5"
    },
    "ExecutionRole": "EXECUTION_ROLE",
    "EngineConfiguration": {
      "Classifications": [
        {
          "Name": "spark-defaults",
          "Properties": {
            "spark.dynamicAllocation.minExecutors": "1",
            "spark.dynamicAllocation.initialExecutors": "2",
            "spark.dynamicAllocation.maxExecutors": "12",
            "spark.dynamicAllocation.executorIdleTimeout": "300"
          }
        }
      ]
    }
  }'
```

# Supported data and storage formats
<a name="notebooks-spark-data-and-storage-formats"></a>

The following table shows formats that are supported natively in Athena for Apache Spark.


****  

| **Data format** | **Read** | **Write** | **Write compression** | 
| --- | --- | --- | --- | 
| parquet | yes | yes | none, uncompressed, snappy, gzip | 
| orc | yes | yes | none, snappy, zlib, lzo | 
| json | yes | yes | bzip2, gzip, deflate | 
| csv | yes | yes | bzip2, gzip, deflate | 
| text | yes | yes | none, bzip2, gzip, deflate | 
| binary file | yes | N/A | N/A | 

# Monitor Apache Spark with CloudWatch metrics
<a name="notebooks-spark-metrics"></a>

Athena publishes calculation-related metrics to Amazon CloudWatch when the **[Publish CloudWatch metrics](notebooks-spark-getting-started.md#notebook-gs-metrics)** option for your Spark-enabled workgroup is selected. You can create custom dashboards, set alarms and triggers on metrics in the CloudWatch console. 

Athena publishes the following metric to the CloudWatch console under the `AmazonAthenaForApacheSpark` namespace:
+ `DPUCount` – number of DPUs consumed during the session to execute the calculations.

This metric has the following dimensions:
+ `SessionId` – The ID of the session in which the calculations are submitted.
+ `WorkGroup` – Name of the workgroup.

**To view metrics for Spark-enabled workgroups in the Amazon CloudWatch console**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Metrics**, **All metrics**.

1. Select the **AmazonAthenaForApacheSpark** namespace.

**To view metrics with the CLI**
+ Do one of the following:
  + To list the metrics for Athena Spark-enabled workgroups, open a command prompt, and use the following command:

    ```
    aws cloudwatch list-metrics --namespace "AmazonAthenaForApacheSpark"
    ```
  + To list all available metrics, use the following command:

    ```
    aws cloudwatch list-metrics
    ```

## List of CloudWatch metrics and dimensions for Apache Spark calculations in Athena
<a name="notebooks-spark-metrics-metrics-table"></a>

If you've enabled CloudWatch metrics in your Spark-enabled Athena workgroup, Athena sends the following metric to CloudWatch per workgroup. The metric uses the `AmazonAthenaForApacheSpark` namespace.


****  

| Metric name | Description | 
| --- | --- | 
| DPUCount  | Number of DPUs (data processing units) consumed during the session to execute the calculations. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. | 

This metric has the following dimensions.


| Dimension | Description | 
| --- | --- | 
| SessionId |  The ID of the session in which the calculations are submitted.  | 
| WorkGroup |  The name of the workgroup.  | 

## List of CloudWatch metrics and dimensions for Athena Spark interactive sessions
<a name="notebooks-spark-metrics-interactive-sessions"></a>

In the release version Apache Spark version 3.5, if you've enabled CloudWatch metrics in your Athena Spark workgroup, Athena sends the following metric to CloudWatch. The metric uses the `AmazonAthenaForApacheSpark` namespace.


****  

| Name | Description | 
| --- | --- | 
| DPUConsumed | The number of DPUs actively consumed by queries in a RUNNING state at a given time in the workgroup. | 

This metric has the following dimensions.


| Dimension | Description | 
| --- | --- | 
| Account |  The AWS account ID.  | 
| WorkGroup |  The name of the workgroup.  | 

# Session level cost attribution
<a name="notebooks-spark-cost-attribution"></a>

From Apache Spark version 3.5 release version onward, Athena allows tracking costs for each session. You can define cost allocation tags when starting a session and the reported costs for a session will appear in Cost Explorer or on AWS Billing cost allocation reports. You can also apply cost allocation tags at the Workgroup level and those get copied over to any sessions started on that workgroup.

## Using Session Level Cost Attribution
<a name="notebooks-spark-cost-attribution-usage"></a>

By default, any cost allocation tags specified at the workgroup level is copied over to interactive sessions started on that workgroup.

To disable tags to be copied from the Workgroup when starting an interactive session from the AWS CLI:

```
aws athena start-session \
  --region "REGION" \
  --work-group "WORKGROUP" \
  --tags '[
    {
      "Key": "tag_key",
      "Value": "tag_value"
    }
  ]' \
  --no-copy-work-group-tags
```

To enable tags to be copied from the Workgroup when starting an interactive session from the AWS CLI:

```
aws athena start-session \
  --region "REGION" \
  --work-group "WORKGROUP" \
  --copy-work-group-tags
```

## Considerations and Limitations
<a name="notebooks-spark-cost-attribution-considerations"></a>
+ Session tags overrides workgroup tags with the same keys.

# Logging and monitoring for Apache Spark sessions
<a name="notebooks-spark-logging-monitoring"></a>

From the release Apache Spark version 3.5 onwards, you can specify managed, Amazon S3, or CloudWatch logging as your logging options.

With managed logging and S3 logging, the following table lists the log locations and UI availability that you can expect if you choose those options.


****  

| Option | Event logs | Container logs | Application UI | 
| --- | --- | --- | --- | 
| Managed logging (default) | Stored in managed S3 bucket | Stored in managed S3 bucket | Supported | 
| Both managed logging and S3 bucket | Stored in both places | Stored in S3 bucket | Supported | 
| Amazon S3 bucket | Stored in S3 bucket | Stored in S3 bucket | Not supported1 | 

1 We suggest that you keep the Managed logging option selected. Otherwise, you can't use the built-in application UIs.

## Managed logging
<a name="notebooks-spark-logging-monitoring-managed"></a>

By default, Athena Spark workgroups stores application logs securely in service-managed S3 buckets for a maximum of 30 days.

You can optionally provide a KMS key (key id, arn, alias, or alias arn) that the service will use to encrypt the managed logs.

```
aws athena start-session \
  --work-group "WORKGROUP" \
  --monitoring-configuration '{
    "ManagedLoggingConfiguration": {
        "Enabled": true,
        "KmsKey": "KMS_KEY"
    },
  }'
  --engine-configuration ''
```

**Note**  
If you turn off managed logging, Athena can't troubleshoot your sessions on your behalf. Example: You will not be access the Spark-UI from Amazon SageMaker AI Studio Notebooks or using the `GetResourceDashboard` API.

To turn off this option from the AWS CLI, use the `ManagedLoggingConfiguration` configuration when you start an interactive session.

```
aws athena start-session \
  --work-group "WORKGROUP" \
  --monitoring-configuration '{
    "ManagedLoggingConfiguration": {
      "Enabled": false
    },
  }'
  --engine-configuration ''
```

### Required permissions for managed logging
<a name="notebooks-spark-logging-monitoring-managed-permissions"></a>

If you provided a KMS key, you will need the following permissions in the permissions policy for the execution role.

```
{
    "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
        "kms:DescribeKey"
    ],
    "Resource": "*",
    "Effect": "Allow"
}
```

## Amazon S3 logging
<a name="notebooks-spark-logging-monitoring-s3"></a>

You can configure log delivery to Amazon S3 buckets.

To enable S3 log delivery from the AWS CLI, use the `S3LoggingConfiguration` configuration when you start an interactive session.

```
aws athena start-session \
  --work-group "WORKGROUP" \
  --monitoring-configuration '{
    "S3LoggingConfiguration": {
      "Enabled":true,
      "LogLocation": "s3://bucket/",
    },
  }'
  --engine-configuration ''
```

You can optionally provide a KMS key (key id, arn, alias, or alias arn) that the service will use to encrypt the S3 logs.

```
aws athena start-session \
  --work-group "WORKGROUP" \
  --monitoring-configuration '{
    "S3LoggingConfiguration": {
      "Enabled":true,
      "LogLocation": "s3://bucket/",
      "KmsKey": "KMS_KEY"
    },
  }'
  --engine-configuration ''
```

### Required permissions for log delivery to Amazon S3
<a name="notebooks-spark-logging-monitoring-s3-permissions"></a>

Before your sessions can deliver logs to Amazon S3 buckets, include the following permissions in the permissions policy for the execution role.

```
{
    "Action": "s3:*",
    "Resource": "*",
    "Effect": "Allow"
}
```

If you provided a KMS key, you will also need the following permissions in the permissions policy for the execution role.

```
{
    "Action": [
        "kms:Encrypt",
        "kms:Decrypt",
        "kms:ReEncrypt*",
        "kms:GenerateDataKey*",
        "kms:DescribeKey"
    ],
    "Resource": "*",
    "Effect": "Allow"
}
```

If kms key and bucket are not from same account then KMS needs to allow S3 service principal.

```
{
  "Effect": "Allow",
  "Principal": { "Service": "s3.amazonaws.com" },
  "Action": [
    "kms:Encrypt",
    "kms:Decrypt",
    "kms:GenerateDataKey*"
  ],
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:SourceAccount": "ACCOUNT_HAVING_KMS_KEY"
    }
  }
}
```

## CloudWatch logging
<a name="notebooks-spark-logging-monitoring-cloudwatch"></a>

You can configure log delivery to CloudWatch log groups.

To enable S3 log delivery from the AWS CLI, use the `CloudWatchLoggingConfiguration` configuration when you start an interactive session.

```
aws athena start-session \
  --work-group "WORKGROUP" \
  --monitoring-configuration '{
    "CloudWatchLoggingConfiguration": {
      "Enabled": true,
      "LogGroup": "/aws/athena/sessions/${workgroup}",
      "LogStreamNamePrefix": "session-"
    }
  }'
  --engine-configuration ''
```

All logs will be delivered by default, but you can optionally specify which log types to include.

```
aws athena start-session \
  --work-group "WORKGROUP" \
  --monitoring-configuration '{
    "CloudWatchLoggingConfiguration": {
      "Enabled": true,
      "LogGroup": "/aws/athena/sessions/${workgroup}",
      "LogStreamNamePrefix": "session-",
      "LogTypes": {
          "SPARK_DRIVER": [
              "STDOUT",
              "STDERR"
          ],
          "SPARK_EXECUTOR": [
              "STDOUT",
              "STDERR"
          ]
       }
    }
  }'
  --engine-configuration ''
```

### Required permissions for log delivery to CloudWatch
<a name="notebooks-spark-logging-monitoring-cloudwatch-permissions"></a>

Before your sessions can deliver logs to CloudWatch log groups, include the following permissions in the permissions policy for the execution role.

```
{
    "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:DescribeLogGroups",
        "logs:DescribeLogStreams"
    ],
    "Resource": "*",
    "Effect": "Allow"
}
```

And following permission to KMS key resource policy.

```
{
  "Effect": "Allow",
  "Principal": {
    "Service": "logs.<region>.amazonaws.com"
  },
  "Action": [
    "kms:Encrypt",
    "kms:Decrypt",
    "kms:ReEncrypt*",
    "kms:GenerateDataKey*",
    "kms:DescribeKey"
  ],
  "Resource": "*"
}
```

## Configuring logging defaults at the Workgroup
<a name="notebooks-spark-logging-monitoring-workgroup-defaults"></a>

You can also specify default logging options at the Workgroup level.

To specify default logging options from the AWS CLI for a workgroup, use the `monitoring-configuration` configuration when creating a new Workgroup:

```
aws athena create-work-group \
  --region "REGION" \
  --name "WORKGROUP_NAME" \
  --monitoring-configuration '{
      "CloudWatchLoggingConfiguration": {
          "Enabled": true,
          "LogGroup": "/aws/athena/sessions/${workgroup}",
          "LogStreamNamePrefix": "session-",
          "LogTypes": {
              "SPARK_DRIVER": [
                  "STDOUT",
                  "STDERR"
              ],
              "SPARK_EXECUTOR": [
                  "STDOUT",
                  "STDERR"
              ]
          }
        },
        "ManagedLoggingConfiguration": {
            "Enabled": true,
            "KmsKey": "KMS_KEY"
        },
        "S3LoggingConfiguration": {
            "Enabled": true,
            "KmsKey": "KMS_KEY"
            "LogLocation": "s3://bucket/",
            "LogTypes": {
                "SPARK_DRIVER": [
                    "STDOUT",
                    "STDERR"
                ],
                "SPARK_EXECUTOR": [
                    "STDOUT",
                    "STDERR"
                ]
            }
        }
    }'
```

To modify defaults logging options from the AWS CLI for a workgroup, use the `monitoring-configuration` configuration when updating a Workgroup. The changes apply to new interactive sessions going forward.

```
aws athena update-work-group \
  --region "REGION" \
  --work-group "WORKGROUP_NAME" 
  --monitoring-configuration '{
      "CloudWatchLoggingConfiguration": {
          "Enabled": true,
          "LogGroup": "/aws/athena/sessions/${workgroup}",
          "LogStreamNamePrefix": "session-",
          "LogTypes": {
              "SPARK_DRIVER": [
                  "STDOUT",
                  "STDERR"
              ],
              "SPARK_EXECUTOR": [
                  "STDOUT",
                  "STDERR"
              ]
          }
        },
        "ManagedLoggingConfiguration": {
            "Enabled": true,
            "KmsKey": "KMS_KEY"
        },
        "S3LoggingConfiguration": {
            "Enabled": true,
            "KmsKey": "KMS_KEY"
            "LogLocation": "s3://bucket/",
            "LogTypes": {
                "SPARK_DRIVER": [
                    "STDOUT",
                    "STDERR"
                ],
                "SPARK_EXECUTOR": [
                    "STDOUT",
                    "STDERR"
                ]
            }
        }
    }'
```

# Accessing the Spark UI
<a name="notebooks-spark-ui-access"></a>

The Apache Spark UIs present visual interfaces with detailed information about your running and completed Spark jobs. You can monitor and debug interactive sessions in Athena Spark using native Apache Spark UIs, where you can dive into job-specific metrics and information about event timelines, stages, tasks, and executors for each Spark job.

## Accessing the Spark UI
<a name="notebooks-spark-ui-access-methods"></a>

After you start an Athena Spark interactive session, you can view the real-time Spark UI for running sessions from the Amazon SageMaker AI Unified Studio notebooks or request a secure URL using the `GetResourceDashboard` API. For completed sessions, you can view the Spark History Server from the Amazon SageMaker AI Unified Studio notebooks, Amazon Athena Console or using the same API.

```
aws athena get-resource-dashboard \
  --region "REGION" \
  --session-id "SESSION_ID"
```

## Required permissions for accessing the Spark UI
<a name="notebooks-spark-ui-access-permissions"></a>

Before you can access the Spark UI, include the following permissions in the permissions policy for the user or role.

```
{
    "Action": "athena:GetResourceDashboard",
    "Resource": "WORKGROUP",
    "Effect": "Allow"
}
```

# Spark Connect support
<a name="notebooks-spark-connect"></a>

Spark Connect is a client-server architecture for Apache Spark that decouples the application client from the Spark cluster's driver process, allowing remote connectivity to Spark from supported clients. Spark Connect also enables interactive debugging during development directly from your favorite IDEs/clients.

From Apache Spark version 3.5 release version onward, Athena supports Spark Connect as an AWS endpoint accessible using the `GetSessionEndpoint` API.

## API/CLI examples (GetSessionEndpoint)
<a name="notebooks-spark-connect-api-examples"></a>

You can use the `GetSessionEndpoint` API to get the Spark Connect endpoint for an interactive session.

```
aws athena get-session-endpoint \
  --region "REGION" \
  --session-id "SESSION_ID"
```

This API returns the Spark Connect endpoint URL for that session.

```
{
  "EndpointUrl": "ENDPOINT_URL",
  "AuthToken": "AUTH_TOKEN",
  "AuthTokenExpirationTime": "AUTH_TOKEN_EXPIRY_TIME"
}
```

## Connecting from self-managed clients
<a name="notebooks-spark-connect-self-managed"></a>

You can connect to an Athena Spark Interactive Session from self-managed clients.

### Pre-requisites
<a name="notebooks-spark-connect-prerequisites"></a>

Install the pyspark-connect client for Spark 3.5.6 and the AWS SDK for Python.

```
pip install --user pyspark[connect]==3.5.6
pip install --user boto3
```

The following is a sample Python script to send requests directly to the session endpoint:

```
import boto3
import time
from pyspark.sql import SparkSession

client = boto3.client('athena', region_name='<REGION>')

# start the session
response = client.start_session(
    WorkGroup='<WORKGROUP_NAME>',
    EngineConfiguration={}
)

# wait for the session endpoint to be ready
time.sleep(5)
response = client.get_session_endpoint(SessionId=session_id)

# construct the authenticated remote url
authtoken=response['AuthToken']
endpoint_url=response['EndpointUrl']
endpoint_url=endpoint_url.replace("https", "sc")+":443/;use_ssl=true;"
url_with_headers = (
    f"{endpoint_url}"
    f"x-aws-proxy-auth={authtoken}"
)

# start the Spark session
start_time = time.time()
spark = SparkSession.builder\
    .remote(url_with_headers)\
    .getOrCreate()
 
spark.version 

#
# Enter your spark code here
#

# stop the Spark session
spark.stop()
```

The following is a sample Python script to access the live Spark UI or Spark History Server for a session:

```
Region='<REGION>'
WorkGroupName='<WORKGROUP_NAME>'
SessionId='<SESSION_ID>'
Partition='aws'
Account='<ACCOUNT_NUMBER>'

SessionARN=f"arn:{Partition}:athena:{Region}:{Account}:workgroup/{WorkGroupName}/session/{SessionId}"

# invoke the API to get the live UI/persistence UI for a session
response = client.get_resource_dashboard(
    ResourceARN=SessionARN
)
response['Url']
```

# Enable requester pays Amazon S3 buckets in Athena for Spark
<a name="notebooks-spark-requester-pays"></a>

When an Amazon S3 bucket is configured as requester pays, the account of the user running the query is charged for data access and data transfer fees associated with the query. For more information, see [Using Requester Pays buckets for storage transfers and usage](https://docs.aws.amazon.com/AmazonS3/latest/userguide/RequesterPaysBuckets.html) in the *Amazon S3 User Guide*.

In Athena for Spark, requester pays buckets are enabled per session, not per workgroup. At a high level, enabling requester pays buckets includes the following steps:

1. In the Amazon S3 console, enable requester pays on the properties for the bucket and add a bucket policy to specify access.

1. In the IAM console, create an IAM policy to allow access to the bucket, and then attach the policy to the IAM role that will be used to access the requester pays bucket.

1. In Athena for Spark, add a session property to enable the requester pays feature.

## Step 1: Enable requester pays on an Amazon S3 bucket and add a bucket policy
<a name="notebooks-spark-requester-pays-enable-requester-pays-on-an-amazon-s3-bucket"></a>

**To enable requester pays on an Amazon S3 bucket**

1. Open the Amazon S3 console at [https://console.aws.amazon.com/s3/](https://console.aws.amazon.com/s3/).

1. In the list of buckets, choose the link for the bucket that you want to enable requester pays for.

1. On the bucket page, choose the **Properties** tab.

1. Scroll down to the **Requester pays** section, and then choose **Edit**.

1. On the **Edit requester pays** page, choose **Enable**, and then choose **Save changes**.

1. Choose the **Permissions** tab.

1. In the **Bucket policy** section, choose **Edit**.

1. On the **Edit bucket policy** page, apply the bucket policy that you want to the source bucket. The following example policy gives access to all AWS principals (`"AWS": "*"` ), but your access can be more granular. For example, you might want to specify only a specific IAM role in another account.

------
#### [ JSON ]

****  

   ```
   { "Version":"2012-10-17",		 	 	  "Statement": [ { "Sid": "Statement1", "Effect": "Allow",
       "Principal": { "AWS": "arn:aws:iam::111122223333:root" },
       "Action": "s3:*", "Resource": [
           "arn:aws:s3:::111122223333-us-east-1-amzn-s3-demo-bucket",
           "arn:aws:s3:::555555555555-us-east-1-amzn-s3-demo-bucket/*"
       ] } ] }
   ```

------

## Step 2: Create an IAM policy and attach it to an IAM role
<a name="notebooks-spark-requester-pays-create-an-iam-policy-and-attach-it-to-an-iam-role"></a>

Next, you create an IAM policy to allow access to the bucket. Then you attach the policy to the role that will be used to access the requester pays bucket.

**To create an IAM policy for the requester pays bucket and attach the policy to a role**

1. Open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. In the IAM console navigation pane, choose **Policies**.

1. Choose **Create policy**.

1. Choose **JSON**.

1. In the **Policy editor**, add a policy like the following:

------
#### [ JSON ]

****  

   ```
   { "Version":"2012-10-17",		 	 	  "Statement": [ { "Action": [ "s3:*" ], "Effect": "Allow",
       "Resource": [
           "arn:aws:s3:::111122223333-us-east-1-amzn-s3-demo-bucket",
           "arn:aws:s3:::111122223333-us-east-1-amzn-s3-demo-bucket/*"
       ] } ] }
   ```

------

1. Choose **Next**.

1. On the **Review and create** page, enter a name for the policy and an optional description, and then choose **Create policy**.

1. In the navigation pane, choose **Roles**.

1. On the **Roles** page, find the role that you want to use, and then choose the role name link.

1. In the **Permissions policies** section, choose **Add permissions**, **Attach policies**.

1. In the **Other permissions policies** section, select the check box for the policy that you created, and then choose **Add permissions**.

## Step 3: Add an Athena for Spark session property
<a name="notebooks-spark-requester-pays-add-a-session-property"></a>

After you have configured the Amazon S3 bucket and associated permissions for requester pays, you can enable the feature in an Athena for Spark session.

**To enable requester pays buckets in an Athena for Spark session**

1. In the notebook editor, from the **Session** menu on the upper right, choose **Edit session**.

1. Expand **Spark properties**. 

1. Choose **Edit in JSON**. 

1. In the JSON text editor, enter the following:

   ```
   {
     "spark.hadoop.fs.s3.useRequesterPaysHeader":"true"
   }
   ```

1. Choose **Save**.

# Using Lake Formation with Athena Spark workgroups
<a name="notebooks-spark-lakeformation"></a>

With the release version Apache Spark version 3.5, you can leverage AWS Lake Formation with AWS Glue Data Catalog where the session execution role has full table permissions. This capability allows you to read and write to tables that are protected by Lake Formation from your Athena Spark interactive sessions. See the following sections to learn more about Lake Formation and how to use it with Athena Spark.

## Step 1: Enable Full Table Access in Lake Formation
<a name="notebooks-spark-lakeformation-enable-fta"></a>

To use Full Table Access (FTA) mode, you must allow Athena Spark to access data without the IAM session tag validation in AWS Lake Formation. To enable, follow the steps in [Application integration for full table access](https://docs.aws.amazon.com//lake-formation/latest/dg/fta-app-integration.html).

### Step 1.1: Register data locations in Lake Formation using user defined role
<a name="notebooks-spark-lakeformation-register-locations"></a>

You must use a user-defined role to register data locations in AWS Lake Formation. See [Requirements for roles used to register locations](https://docs.aws.amazon.com//lake-formation/latest/dg/registration-role.html) for details.

## Step 2: Setup IAM permissions for the execution role for the session
<a name="notebooks-spark-lakeformation-iam-permissions"></a>

For read or write access to underlying data, in addition to Lake Formation permissions, the execution role needs the `lakeformation:GetDataAccess` IAM permission. With this permission, Lake Formation grants the request for temporary credentials to access the data.

The following is an example policy of how to provide IAM permissions to access a script in Amazon S3, uploading logs to S3, AWS Glue API permissions, and permission to access Lake Formation.

### Step 2.1: Configure Lake Formation permissions
<a name="notebooks-spark-lakeformation-configure-permissions"></a>
+ Spark jobs that read data from S3 require Lake Formation `SELECT` permission.
+ Spark jobs that write/delete data in S3 require Lake Formation `ALL (SUPER)` permission.
+ Spark jobs that interact with AWS Glue Data catalog require `DESCRIBE`, `ALTER`, `DROP` permission as appropriate.

## Step 3: Initialize a Spark session for Full Table Access using Lake Formation
<a name="notebooks-spark-lakeformation-initialize-session"></a>

### Prerequisites
<a name="notebooks-spark-lakeformation-prerequisites"></a>

AWS Glue Data Catalog must be configured as a metastore to access Lake Formation tables.

Set the following settings to configure AWS Glue catalog as a metastore:

```
{
  "spark.hadoop.glue.catalogid": "ACCOUNT_ID",
  "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
  "spark.hadoop.hive.metastore.glue.catalogid": "ACCOUNT_ID",
  "spark.sql.catalogImplementation": "hive"
}
```

To access tables registered with AWS Lake Formation, the following configurations need to be set during Spark initialization to configure Spark to use AWS Lake Formation credentials.

### Hive
<a name="notebooks-spark-lakeformation-hive-config"></a>

```
{
  "spark.hadoop.fs.s3.credentialsResolverClass": "com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver",
  "spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject": "true",
  "spark.hadoop.fs.s3.folderObject.autoAction.disabled": "true",
  "spark.sql.catalog.skipLocationValidationOnCreateTable.enabled": "true",
  "spark.sql.catalog.createDirectoryAfterTable.enabled": "true",
  "spark.sql.catalog.dropDirectoryBeforeTable.enabled": "true"
}
```

### Apache Iceberg
<a name="notebooks-spark-lakeformation-iceberg-config"></a>

```
{
  "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension",
  "spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkSessionCatalog",
  "spark.sql.catalog.spark_catalog.warehouse": "s3://your-bucket/warehouse/",
  "spark.sql.catalog.spark_catalog.client.region": "REGION",
  "spark.sql.catalog.spark_catalog.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
  "spark.sql.catalog.spark_catalog.glue.account-id": "ACCOUNT_ID",
  "spark.sql.catalog.spark_catalog.glue.lakeformation-enabled": "true"
}
```

### Amazon S3 Tables
<a name="notebooks-spark-lakeformation-s3tables-config"></a>

```
{
  "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension",
  "spark.sql.catalog.{catalogName}": "org.apache.iceberg.spark.SparkCatalog",
  "spark.sql.catalog.{catalogName}.warehouse": "arn:aws:s3tables:{region}:{accountId}:bucket/{bucketName}",
  "spark.sql.catalog.{catalogName}.catalog-impl": "org.apache.iceberg.aws.glue.GlueCatalog",
  "spark.sql.catalog.{catalogName}.glue.id": "{accountId}:s3tablescatalog/{bucketName}",
  "spark.sql.catalog.{catalogName}.glue.lakeformation-enabled": "true",
  "spark.sql.catalog.{catalogName}.client.region": "REGION",
  "spark.sql.catalog.{catalogName}.glue.account-id": "ACCOUNT_ID"
}
```

### Delta Lake
<a name="notebooks-spark-lakeformation-deltalake-config"></a>

```
{
  "spark.sql.extensions": "io.delta.sql.DeltaSparkSessionExtension",
  "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.delta.catalog.DeltaCatalog",
  "spark.hadoop.fs.s3.credentialsResolverClass": "com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver",
  "spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject": "true",
  "spark.hadoop.fs.s3.folderObject.autoAction.disabled": "true",
  "spark.sql.catalog.skipLocationValidationOnCreateTable.enabled": "true",
  "spark.sql.catalog.createDirectoryAfterTable.enabled": "true",
  "spark.sql.catalog.dropDirectoryBeforeTable.enabled": "true"
}
```

## Considerations and Limitations
<a name="notebooks-spark-lakeformation-considerations"></a>
+ Full Table Access is supported for Hive, Iceberg, Amazon S3 Tables and Delta tables. Hudi tables do not support full table access.
+ To add new catalogs to an active session use `spark.conf.set` with new catalog configs.
+ Catalog configs are immutable. If you want to update a catalog config create new catalog using `spark.conf.set`.
+ Add only catalogs you need to the spark session.
+ To change default catalog: `spark.catalog.setCurrentCatalog("s3tablesbucket")`
+ If you have special characters in your catalog name like `-` then escape it in your query like:

  ```
  SELECT sales_amount as nums FROM `my-s3-tables-bucket`.`s3namespace`.`daily_sales` LIMIT 100
  ```

# Enable Apache Spark encryption
<a name="notebooks-spark-encryption"></a>

You can enable Apache Spark encryption on Athena. Doing so encrypts data in transit between Spark nodes and also encrypts data at rest stored locally by Spark. To enhance security for this data, Athena uses the following encryption configuration:

```
spark.io.encryption.keySizeBits="256" 
spark.io.encryption.keygen.algorithm="HmacSHA384"
```

To enable Spark encryption, you can use the Athena console, the AWS CLI, or the Athena API.

## Use the Athena console to enable Spark encryption in a new notebook
<a name="notebooks-spark-encryption-athena-console-new-notebook"></a>

**To create a new notebook that has Spark encryption enabled**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. If the console navigation pane is not visible, choose the expansion menu on the left.

1. Do one of the following:
   + In **Notebook explorer**, choose **Create notebook**.
   + In **Notebook editor**, choose **Create notebook**, or choose the plus icon (**\$1**) to add a notebook.

1. For **Notebook name**, enter a name for the notebook.

1. Expand the **Spark properties** option.

1. Select **Turn on Spark encryption**.

1. Choose **Create**.

The notebook session that you create is encrypted. Use the new notebook as you normally would. When you later launch new sessions that use the notebook, the new sessions will also be encrypted.

## Use the Athena console to enable Spark encryption for an existing notebook
<a name="notebooks-spark-encryption-athena-console-existing-notebook"></a>

You can also use the Athena console to enable Spark encryption for an existing notebook.

**To enable encryption for an existing notebook**

1. [Open a new session](notebooks-spark-managing.md#opening-a-previously-created-notebook) for a previously created notebook.

1. In the notebook editor, from the **Session** menu on the upper right, choose **Edit session**.

1. In the **Edit session details** dialog box, expand **Spark properties**.

1. Select **Turn on Spark encryption**.

1. Choose **Save**.

The console launches a new session that has encryption enabled. Later sessions that you create for this notebook will also have encryption enabled.

## Use the AWS CLI to enable Spark encryption
<a name="notebooks-spark-encryption-cli"></a>

You can use the AWS CLI to enable encryption when you launch a session by specifying the appropriate Spark properties.

**To use the AWS CLI to enable Spark encryption**

1. Use a command like the following to create an engine configuration JSON object that specifies Spark encryption properties.

   ```
   ENGINE_CONFIGURATION_JSON=$( 
     cat <<EOF 
   { 
       "CoordinatorDpuSize": 1, 
       "MaxConcurrentDpus": 20, 
       "DefaultExecutorDpuSize": 1, 
       "SparkProperties": { 
         "spark.authenticate": "true", 
         "spark.io.encryption.enabled": "true", 
         "spark.network.crypto.enabled": "true" 
       } 
   } 
   EOF 
   )
   ```

1. In the AWS CLI, use the `athena start-session` command and pass in the JSON object that you created to the `--engine-configuration` argument, as in the following example:

   ```
   aws athena start-session \ 
      --region "region" \ 
      --work-group "your-work-group" \ 
      --engine-configuration "$ENGINE_CONFIGURATION_JSON"
   ```

## Use the Athena API to enable Spark encryption
<a name="notebooks-spark-encryption-api"></a>

To enable Spark encryption with the Athena API, use the [StartSession](https://docs.aws.amazon.com/athena/latest/APIReference/API_StartSession.html) action and its [EngineConfiguration](https://docs.aws.amazon.com/athena/latest/APIReference/API_EngineConfiguration.html) `SparkProperties` parameter to specify the encryption configuration in your `StartSession` request.

# Configure cross-account AWS Glue access in Athena for Spark
<a name="spark-notebooks-cross-account-glue"></a>

This topic shows how consumer account *666666666666* and owner account *999999999999* can be configured for cross-account AWS Glue access. When the accounts are configured, the consumer account can run queries from Athena for Spark on the owner's AWS Glue databases and tables.

## Step 1: In AWS Glue, provide access to consumer roles
<a name="spark-notebooks-cross-account-glue-in-aws-glue-provide-access-to-the-consumer-account"></a>

In AWS Glue, the owner creates a policy that provides the consumer's roles access to the owner's AWS Glue data catalog.

**To add a AWS Glue policy that allows a consumer role access to the owner's data catalog**

1. Using the catalog owner's account, sign in to the AWS Management Console.

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, expand **Data Catalog**, and then choose **Catalog settings**.

1. On the **Data catalog settings** page, in the **Permissions** section, add a policy like the following. This policy provides roles for the consumer account *666666666666* access to the data catalog in the owner account *999999999999*.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "Cataloguers",
               "Effect": "Allow",
               "Principal": {
                   "AWS": [
                       "arn:aws:iam::666666666666:role/Admin",
                       "arn:aws:iam::666666666666:role/AWSAthenaSparkExecutionRole"
                   ]
               },
               "Action": "glue:*",
               "Resource": [
                   "arn:aws:glue:us-west-2:999999999999:catalog",
                   "arn:aws:glue:us-west-2:999999999999:database/*",
                   "arn:aws:glue:us-west-2:999999999999:table/*"
               ]
           }
       ]
   }
   ```

------

## Step 2: Configure the consumer account for access
<a name="spark-notebooks-cross-account-glue-configure-the-consumer-account-for-access"></a>

In the consumer account, create a policy to allow access to the owner's AWS Glue Data Catalog, databases, and tables, and attach the policy to a role. The following example uses consumer account *666666666666*.

**To create a AWS Glue policy for access to the owner's AWS Glue Data Catalog**

1. Using the consumer account, sign into the AWS Management Console.

1. Open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. In the navigation pane, expand **Access management**, and then choose **Policies**.

1. Choose **Create policy**.

1. On the **Specify permissions** page, choose **JSON**.

1. In the **Policy editor**, enter a JSON statement like the following that allows AWS Glue actions on the owner account's data catalog.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": "glue:*",
               "Resource": [
                   "arn:aws:glue:us-east-1:999999999999:catalog",
                   "arn:aws:glue:us-east-1:999999999999:database/*",
                   "arn:aws:glue:us-east-1:999999999999:table/*"
               ]
           }
       ]
   }
   ```

------

1. Choose **Next**.

1. On the **Review and create** page, for **Policy name**, enter a name for the policy.

1. Choose **Create policy**.

Next, you use IAM console in the consumer account to attach the policy that you just created to the IAM role or roles that the consumer account will use to access the owner's data catalog.

**To attach the AWS Glue policy to the roles in the consumer account**

1. In the consumer account IAM console navigation pane, choose **Roles**.

1. On the **Roles** page, find the role that you want to attach the policy to.

1. Choose **Add permissions**, and then choose **Attach policies**.

1. Find the policy that you just created.

1. Select the check box for the policy, and then choose **Add permissions**.

1. Repeat the steps to add the policy to other roles that you want to use.

## Step 3: Configure a session and create a query
<a name="spark-notebooks-cross-account-glue-configure-a-session-and-create-a-query"></a>

In Athena Spark, in the requester account, using the role specified, create a session to test access by [creating a notebook](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook) or [editing a current session](notebooks-spark-getting-started.md#notebooks-spark-getting-started-editing-session-details). When you [configure the session properties](notebooks-spark-custom-jar-cfg.md#notebooks-spark-custom-jar-cfg-console), specify one of the following:
+ **The AWS Glue catalog separator** – With this approach, you include the owner account ID in your queries. Use this method if you are going to use the session to query data catalogs from different owners.
+ **The AWS Glue catalog ID** – With this approach, you query the database directly. This method is more convenient if you are going to use the session to query only a single owner's data catalog.

### Use the AWS Glue catalog separator
<a name="spark-notebooks-cross-account-glue-using-the-glue-catalog-separator-approach"></a>

When you edit the session properties, add the following:

```
{ 
    "spark.hadoop.aws.glue.catalog.separator": "/" 
}
```

When you run a query in a cell, use syntax like that in the following example. Note that in the `FROM` clause, the catalog ID and separator are required before the database name.

```
df = spark.sql('SELECT requestip, uri, method, status FROM `999999999999/mydatabase`.cloudfront_logs LIMIT 5') 
df.show()
```

### Use the AWS Glue catalog ID
<a name="spark-notebooks-cross-account-glue-using-the-glue-catalog-id-approach"></a>

When you edit the session properties, enter the following property. Replace *999999999999* with the owner account ID.

```
{ 
    "spark.hadoop.hive.metastore.glue.catalogid": "999999999999" 
}
```

When you run a query in a cell, use syntax like the following. Note that in the `FROM` clause, the catalog ID and separator are not required before the database name.

```
df = spark.sql('SELECT * FROM mydatabase.cloudfront_logs LIMIT 10') 
df.show()
```

## Additional resources
<a name="spark-notebooks-cross-account-glue-additional-resources"></a>

[Configure cross-account access to AWS Glue data catalogs](security-iam-cross-account-glue-catalog-access.md)

[Managing cross-account permissions using both AWS Glue and Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/hybrid-cross-account.html) in the *AWS Lake Formation Developer Guide*.

[Configure cross-account access to a shared AWS Glue Data Catalog using Amazon Athena](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/configure-cross-account-access-to-a-shared-aws-glue-data-catalog-using-amazon-athena.html) in *AWS Prescriptive Guidance Patterns*.

# Understand service quotas for Athena for Spark
<a name="notebooks-spark-quotas"></a>

*Service quotas*, also known as *limits*, are the maximum number of service resources or operations that your AWS account can use. For more information about the service quotas for other AWS services that you can use with Amazon Athena for Spark, see [AWS service quotas](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html) in the *Amazon Web Services General Reference*.

**Note**  
The default values are the initial quotas set by AWS, which are separate from the actual applied quota value and maximum possible service quota. New AWS accounts might have initial lower quotas that can increase over time. Amazon Athena for Apache Spark monitors account usage within each AWS Region, and then automatically increases the quotas based on your usage. If your requirements exceed the stated limits, contact customer support.

The following table lists the service quotas for Amazon Athena for Apache Spark.


****  

| Name | Default | Adjustable | Version | Description | 
| --- | --- | --- | --- | --- | 
| Apache Spark DPU concurrency | 160 | No | PySpark Version 3 | The maximum number of data processing units (DPUs) that you can consume concurrently for Apache Spark calculations for a single account in the current AWS Region. A DPU is a relative measure of processing power that consists of 4 vCPUs of compute capacity and 16 GB of memory. | 
| Apache Spark session DPU concurrency | 60 | No | PySpark Version 3 | The maximum number of DPUs you can consume concurrently for an Apache Spark calculation within a session. | 
| On-Demand DPUs | 4 | No | Apache Spark Version 3.5 | The maximum number of data processing units (DPUs) that you can consume concurrently for Apache Spark interactive sessions in the current AWS Region. | 

# Use Athena Spark APIs
<a name="notebooks-spark-api-list"></a>

**Note**  
Athena notebooks and Calculations APIs are available in the release version Pyspark engine version 3. Notebooks and Calculations APIs are not supported in the release version Apache Spark version 3.5.

The following list contains reference links to the Athena notebook API actions. For data structures and other Athena API actions, see the [https://docs.aws.amazon.com/athena/latest/APIReference/](https://docs.aws.amazon.com/athena/latest/APIReference/). 
+  [CreateNotebook](https://docs.aws.amazon.com/athena/latest/APIReference/API_CreateNotebook.html) 
+  [CreatePresignedNotebookUrl](https://docs.aws.amazon.com/athena/latest/APIReference/API_CreatePresignedNotebookUrl.html) 
+  [DeleteNotebook](https://docs.aws.amazon.com/athena/latest/APIReference/API_DeleteNotebook.html) 
+  [ExportNotebook](https://docs.aws.amazon.com/athena/latest/APIReference/API_ExportNotebook.html) 
+  [GetCalculationExecution](https://docs.aws.amazon.com/athena/latest/APIReference/API_GetCalculationExecution.html) 
+  [GetCalculationExecutionCode](https://docs.aws.amazon.com/athena/latest/APIReference/API_GetCalculationExecutionCode.html) 
+  [GetCalculationExecutionStatus](https://docs.aws.amazon.com/athena/latest/APIReference/API_GetCalculationExecutionStatus.html) 
+  [GetNotebookMetadata](https://docs.aws.amazon.com/athena/latest/APIReference/API_GetNotebookMetadata.html) 
+  [GetSession](https://docs.aws.amazon.com/athena/latest/APIReference/API_GetSession.html) 
+  [GetSessionStatus](https://docs.aws.amazon.com/athena/latest/APIReference/API_GetSessionStatus.html) 
+  [ImportNotebook](https://docs.aws.amazon.com/athena/latest/APIReference/API_ImportNotebook.html) 
+  [ListApplicationDPUSizes](https://docs.aws.amazon.com/athena/latest/APIReference/API_ListApplicationDPUSizes.html) 
+  [ListCalculationExecutions](https://docs.aws.amazon.com/athena/latest/APIReference/API_ListCalculationExecutions.html) 
+  [ListExecutors](https://docs.aws.amazon.com/athena/latest/APIReference/API_ListExecutors.html) 
+  [ListNotebookMetadata](https://docs.aws.amazon.com/athena/latest/APIReference/API_ListNotebookMetadata.html) 
+  [ListNotebookSessions](https://docs.aws.amazon.com/athena/latest/APIReference/API_ListNotebookSessions.html) 
+  [ListSessions](https://docs.aws.amazon.com/athena/latest/APIReference/API_ListSessions.html) 
+  [StartCalculationExecution](https://docs.aws.amazon.com/athena/latest/APIReference/API_StartCalculationExecution.html) 
+  [StartSession](https://docs.aws.amazon.com/athena/latest/APIReference/API_StartSession.html) 
+  [StopCalculationExecution](https://docs.aws.amazon.com/athena/latest/APIReference/API_StopCalculationExecution.html) 
+  [TerminateSession](https://docs.aws.amazon.com/athena/latest/APIReference/API_TerminateSession.html) 
+  [UpdateNotebook](https://docs.aws.amazon.com/athena/latest/APIReference/API_UpdateNotebook.html) 
+  [UpdateNotebookMetadata](https://docs.aws.amazon.com/athena/latest/APIReference/API_UpdateNotebookMetadata.html) 

# Troubleshoot Athena for Spark
<a name="notebooks-spark-troubleshooting"></a>

Use the following information to troubleshoot issues you may have when using notebooks and sessions on Athena.

**Topics**
+ [

# Learn about known issues in Athena for Spark
](notebooks-spark-known-issues.md)
+ [

# Troubleshoot Spark-enabled workgroups
](notebooks-spark-troubleshooting-workgroups.md)
+ [

# Use the Spark EXPLAIN statement to troubleshoot Spark SQL
](notebooks-spark-troubleshooting-explain.md)
+ [

# Log Spark application events in Athena
](notebooks-spark-logging.md)
+ [

# Use CloudTrail to troubleshoot Athena notebook API calls
](notebooks-spark-troubleshooting-cloudtrail.md)
+ [

# Overcome the 68k code block size limit
](notebooks-spark-troubleshooting-code-block-size-limit.md)
+ [

# Troubleshoot session errors
](notebooks-spark-troubleshooting-sessions.md)
+ [

# Troubleshoot table errors
](notebooks-spark-troubleshooting-tables.md)
+ [

# Get support
](notebooks-spark-troubleshooting-support.md)

# Learn about known issues in Athena for Spark
<a name="notebooks-spark-known-issues"></a>

This page documents some of the known issues in Athena for Apache Spark.

## Illegal argument exception when creating a table
<a name="notebooks-spark-known-issues-illegal-argument-exception"></a>

Although Spark does not allow databases to be created with an empty location property, databases in AWS Glue can have an empty `LOCATION` property if they are created outside of Spark.

If you create a table and specify a AWS Glue database that has an empty `LOCATION` field, an exception like the following can occur: IllegalArgumentException: Cannot create a path from an empty string.

For example, the following command throws an exception if the default database in AWS Glue contains an empty `LOCATION` field:

```
spark.sql("create table testTable (firstName STRING)")
```

**Suggested solution A** – Use AWS Glue to add a location to the database that you are using.

**To add a location to an AWS Glue database**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Databases**.

1. In the list of databases, choose the database that you want to edit.

1. On the details page for the database, choose **Edit**.

1. On the **Update a database** page, for **Location**, enter an Amazon S3 location.

1. Choose **Update Database**.

**Suggested solution B** – Use a different AWS Glue database that has an existing, valid location in Amazon S3. For example, if you have a database named `dbWithLocation`, use the command `spark.sql("use dbWithLocation")` to switch to that database.

**Suggested solution C** – When you use Spark SQL to create the table, specify a value for `location`, as in the following example.

```
spark.sql("create table testTable (firstName STRING) 
       location 's3://amzn-s3-demo-bucket/'").
```

**Suggested solution D** – If you specified a location when you created the table, but the issue still occurs, make sure the Amazon S3 path you provide has a trailing forward slash. For example, the following command throws an illegal argument exception:

```
spark.sql("create table testTable (firstName STRING) 
       location 's3://amzn-s3-demo-bucket'")
```

To correct this, add a trailing slash to the location (for example, `'s3://amzn-s3-demo-bucket/'`).

## Database created in a workgroup location
<a name="notebooks-spark-known-issues-database-created-in-a-workgroup-location"></a>

If you use a command like `spark.sql('create database db')` to create a database and do not specify a location for the database, Athena creates a subdirectory in your workgroup location and uses that location for the newly created database.

## Issues with Hive managed tables in the AWS Glue default database
<a name="notebooks-spark-known-issues-managed-tables"></a>

If the `Location` property of your default database in AWS Glue is nonempty and specifies a valid location in Amazon S3, and you use Athena for Spark to create a Hive managed table in your AWS Glue default database, data are written to the Amazon S3 location specified in your Athena Spark workgroup instead of to the location specified by the AWS Glue database.

This issue occurs because of how Apache Hive handles its default database. Apache Hive creates table data in the Hive warehouse root location, which can be different from the actual default database location.

When you use Athena for Spark to create a Hive managed table under the default database in AWS Glue, the AWS Glue table metadata can point to two different locations. This can cause unexpected behavior when you attempt an `INSERT` or `DROP TABLE` operation.

The steps to reproduce the issue are the following:

1. In Athena for Spark, you use one of the following methods to create or save a Hive managed table:
   + A SQL statement like `CREATE TABLE $tableName`
   + A PySpark command like `df.write.mode("overwrite").saveAsTable($tableName)` that does not specify the `path` option in the Dataframe API.

   At this point, the AWS Glue console may show an incorrect location in Amazon S3 for the table.

1. In Athena for Spark, you use the `DROP TABLE $table_name` statement to drop the table that you created.

1. After you run the `DROP TABLE` statement, you notice that the underlying files in Amazon S3 are still present.

To resolve this issue, do one of the following:

**Solution A** – Use a different AWS Glue database when you create Hive managed tables.

**Solution B** – Specify an empty location for the default database in AWS Glue. Then, create your managed tables in the default database.

## CSV and JSON file format incompatibility between Athena for Spark and Athena SQL
<a name="notebooks-spark-known-issues-csv-and-json-file-format-incompatibility"></a>

Due to a known issue with open source Spark, when you create a table in Athena for Spark on CSV or JSON data, the table might not be readable from Athena SQL, and vice versa. 

For example, you might create a table in Athena for Spark in one of the following ways: 
+ With the following `USING csv` syntax: 

  ```
  spark.sql('''CREATE EXTERNAL TABLE $tableName ( 
  $colName1 $colType1, 
  $colName2 $colType2, 
  $colName3 $colType3) 
  USING csv 
  PARTITIONED BY ($colName1) 
  LOCATION $s3_location''')
  ```
+  With the following [DataFrame](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html) API syntax: 

  ```
  df.write.format('csv').saveAsTable($table_name)
  ```

Due to the known issue with open source Spark, queries from Athena SQL on the resulting tables might not succeed. 

**Suggested solution** – Try creating the table in Athena for Spark using Apache Hive syntax. For more information, see [CREATE HIVEFORMAT TABLE](https://spark.apache.org/docs/latest/sql-ref-syntax-ddl-create-table-hiveformat.html) in the Apache Spark documentation. 

# Troubleshoot Spark-enabled workgroups
<a name="notebooks-spark-troubleshooting-workgroups"></a>

Use the following information to troubleshoot Spark-enabled workgroups in Athena.

## Session stops responding when using an existing IAM role
<a name="notebooks-spark-troubleshooting-workgroups-existing-role"></a>

If you did not create a new `AWSAthenaSparkExecutionRole` for your Spark enabled workgroup and instead updated or chose an existing IAM role, your session might stop responding. In this case, you may need to add the following trust and permissions policies to your Spark enabled workgroup execution role.

Add the following example trust policy. The policy includes a confused deputy check for the execution role. Replace the values for `111122223333`, `aws-region`, and `workgroup-name` with the AWS account ID, AWS Region, and workgroup that you are using.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "athena.amazonaws.com"
            },
            "Action": "sts:AssumeRole",
            "Condition": {
                "StringEquals": {
                    "aws:SourceAccount": "111122223333"
                },
                "ArnLike": {
                    "aws:SourceArn": "arn:aws:athena:us-east-1:111122223333:workgroup/workgroup-name"
                }
            }
        }
    ]
}
```

------

Add a permissions policy like the following default policy for notebook enabled workgroups. Modify the placeholder Amazon S3 locations and AWS account IDs to correspond to the ones that you are using. Replace the values for `amzn-s3-demo-bucket`, `aws-region`, `111122223333`, and `workgroup-name` with the Amazon S3 bucket, AWS Region, AWS account ID, and workgroup that you are using.

------
#### [ JSON ]

****  

```
{ "Version":"2012-10-17",		 	 	  "Statement": [ { "Effect": "Allow", "Action": [
    "s3:PutObject", "s3:ListBucket", "s3:DeleteObject", "s3:GetObject" ], "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket/*",
        "arn:aws:s3:::amzn-s3-demo-bucket" ] }, { "Effect": "Allow",
    "Action": [ "athena:GetWorkGroup", "athena:CreatePresignedNotebookUrl",
    "athena:TerminateSession", "athena:GetSession", "athena:GetSessionStatus",
    "athena:ListSessions", "athena:StartCalculationExecution", "athena:GetCalculationExecutionCode",
    "athena:StopCalculationExecution", "athena:ListCalculationExecutions",
    "athena:GetCalculationExecution", "athena:GetCalculationExecutionStatus",
    "athena:ListExecutors", "athena:ExportNotebook", "athena:UpdateNotebook" ], "Resource":
            "arn:aws:athena:us-east-1:111122223333:workgroup/workgroup-name"
    }, { "Effect": "Allow", "Action": [ "logs:CreateLogStream", "logs:DescribeLogStreams",
    "logs:CreateLogGroup", "logs:PutLogEvents" ], "Resource": [
            "arn:aws:logs:us-east-1:111122223333:log-group:/aws-athena:*",
            "arn:aws:logs:us-east-1:111122223333:log-group:/aws-athena*:log-stream:*"
    ] }, { "Effect": "Allow", "Action": "logs:DescribeLogGroups", "Resource":
            "arn:aws:logs:us-east-1:111122223333:log-group:*"
    }, { "Effect": "Allow", "Action": [ "cloudwatch:PutMetricData" ], "Resource": "*", "Condition":
    { "StringEquals": { "cloudwatch:namespace": "AmazonAthenaForApacheSpark" } } } ] }
```

------

# Use the Spark EXPLAIN statement to troubleshoot Spark SQL
<a name="notebooks-spark-troubleshooting-explain"></a>

You can use the Spark `EXPLAIN` statement with Spark SQL to troubleshoot your Spark code. The following code and output examples show this usage.

**Example – Spark SELECT statement**  

```
spark.sql("select * from select_taxi_table").explain(True)
```
**Output**  

```
Calculation started (calculation_id=20c1ebd0-1ccf-ef14-db35-7c1844876a7e) in 
(session=24c1ebcb-57a8-861e-1023-736f5ae55386). 
Checking calculation status...

Calculation completed.
== Parsed Logical Plan ==
'Project [*]
+- 'UnresolvedRelation [select_taxi_table], [], false

== Analyzed Logical Plan ==
VendorID: bigint, passenger_count: bigint, count: bigint
Project [VendorID#202L, passenger_count#203L, count#204L]
+- SubqueryAlias spark_catalog.spark_demo_database.select_taxi_table
   +- Relation spark_demo_database.select_taxi_table[VendorID#202L,
       passenger_count#203L,count#204L] csv

== Optimized Logical Plan ==
Relation spark_demo_database.select_taxi_table[VendorID#202L,
passenger_count#203L,count#204L] csv

== Physical Plan ==
FileScan csv spark_demo_database.select_taxi_table[VendorID#202L,
passenger_count#203L,count#204L] 
Batched: false, DataFilters: [], Format: CSV, 
Location: InMemoryFileIndex(1 paths)
[s3://amzn-s3-demo-bucket/select_taxi], 
PartitionFilters: [], PushedFilters: [], 
ReadSchema: struct<VendorID:bigint,passenger_count:bigint,count:bigint>
```

**Example – Spark data frame**  
The following example shows how to use `EXPLAIN` with a Spark data frame.  

```
taxi1_df=taxi_df.groupBy("VendorID", "passenger_count").count()
taxi1_df.explain("extended")
```
**Output**  

```
Calculation started (calculation_id=d2c1ebd1-f9f0-db25-8477-3effc001b309) in 
(session=24c1ebcb-57a8-861e-1023-736f5ae55386). 
Checking calculation status...

Calculation completed.
== Parsed Logical Plan ==
'Aggregate ['VendorID, 'passenger_count], 
['VendorID, 'passenger_count, count(1) AS count#321L]
+- Relation [VendorID#49L,tpep_pickup_datetime#50,tpep_dropoff_datetime#51,
passenger_count#52L,trip_distance#53,RatecodeID#54L,store_and_fwd_flag#55,
PULocationID#56L,DOLocationID#57L,payment_type#58L,fare_amount#59,
extra#60,mta_tax#61,tip_amount#62,tolls_amount#63,improvement_surcharge#64,
total_amount#65,congestion_surcharge#66,airport_fee#67] parquet

== Analyzed Logical Plan ==
VendorID: bigint, passenger_count: bigint, count: bigint
Aggregate [VendorID#49L, passenger_count#52L], 
[VendorID#49L, passenger_count#52L, count(1) AS count#321L]
+- Relation [VendorID#49L,tpep_pickup_datetime#50,tpep_dropoff_datetime#51,
passenger_count#52L,trip_distance#53,RatecodeID#54L,store_and_fwd_flag#55,
PULocationID#56L,DOLocationID#57L,payment_type#58L,fare_amount#59,extra#60,
mta_tax#61,tip_amount#62,tolls_amount#63,improvement_surcharge#64,
total_amount#65,congestion_surcharge#66,airport_fee#67] parquet

== Optimized Logical Plan ==
Aggregate [VendorID#49L, passenger_count#52L], 
[VendorID#49L, passenger_count#52L, count(1) AS count#321L]
+- Project [VendorID#49L, passenger_count#52L]
   +- Relation [VendorID#49L,tpep_pickup_datetime#50,tpep_dropoff_datetime#51,
passenger_count#52L,trip_distance#53,RatecodeID#54L,store_and_fwd_flag#55,
PULocationID#56L,DOLocationID#57L,payment_type#58L,fare_amount#59,extra#60,
mta_tax#61,tip_amount#62,tolls_amount#63,improvement_surcharge#64,
total_amount#65,congestion_surcharge#66,airport_fee#67] parquet

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[VendorID#49L, passenger_count#52L], functions=[count(1)], 
output=[VendorID#49L, passenger_count#52L, count#321L])
   +- Exchange hashpartitioning(VendorID#49L, passenger_count#52L, 1000), 
      ENSURE_REQUIREMENTS, [id=#531]
      +- HashAggregate(keys=[VendorID#49L, passenger_count#52L], 
         functions=[partial_count(1)], output=[VendorID#49L, 
         passenger_count#52L, count#326L])
         +- FileScan parquet [VendorID#49L,passenger_count#52L] Batched: true, 
            DataFilters: [], Format: Parquet, 
            Location: InMemoryFileIndex(1 paths)[s3://amzn-s3-demo-bucket/
            notebooks/yellow_tripdata_2016-01.parquet], PartitionFilters: [], 
            PushedFilters: [], 
            ReadSchema: struct<VendorID:bigint,passenger_count:bigint>
```

# Log Spark application events in Athena
<a name="notebooks-spark-logging"></a>

The Athena notebook editor allows for standard Jupyter, Spark, and Python logging. You can use `df.show()` to display PySpark DataFrame contents or use `print("Output")` to display values in the cell output. The `stdout`, `stderr`, and `results` outputs for your calculations are written to your query results bucket location in Amazon S3.

## Log Spark application events to Amazon CloudWatch
<a name="notebooks-spark-logging-logging-spark-application-events-to-amazon-cloudwatch"></a>

Your Athena sessions can also write logs to [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html) in the account that you are using.

### Understand log streams and log groups
<a name="notebooks-spark-logging-understanding-log-streams-and-log-groups"></a>

CloudWatch organizes log activity into log streams and log groups.

**Log streams** – A CloudWatch log stream is a sequence of log events that share the same source. Each separate source of logs in CloudWatch Logs makes up a separate log stream.

**Log groups** – In CloudWatch Logs, a log group is a group of log streams that share the same retention, monitoring, and access control settings.

There is no limit on the number of log streams that can belong to one log group.

In Athena, when you start a notebook session for the first time, Athena creates a log group in CloudWatch that uses the name of your Spark-enabled workgroup, as in the following example.

```
/aws-athena/workgroup-name
```

This log group receives one log stream for each executor in your session that produces at least one log event. An executor is the smallest unit of compute that a notebook session can request from Athena. In CloudWatch, the name of the log stream begins with the session ID and executor ID.

For more information about CloudWatch log groups and log streams, see [Working with log groups and log streams](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/Working-with-log-groups-and-streams.html) in the Amazon CloudWatch Logs User Guide.

### Use standard logger objects in Athena for Spark
<a name="notebooks-spark-logging-using-standard-logger-objects-in-athena-for-spark"></a>

In an Athena for Spark session, you can use the following two global standard logger objects to write logs to Amazon CloudWatch:
+ **athena\$1user\$1logger** – Sends logs to CloudWatch only. Use this object when you want to log information your Spark applications directly to CloudWatch, as in the following example.

  ```
  athena_user_logger.info("CloudWatch log line.")
  ```

  The example writes a log event to CloudWatch like the following:

  ```
  AthenaForApacheSpark: 2022-01-01 12:00:00,000 INFO builtins: CloudWatch log line.
  ```
+ **athena\$1shared\$1logger** – Sends the same log both to CloudWatch and to AWS for support purposes. You can use this object to share logs with AWS service teams for troubleshooting, as in the following example.

  ```
  athena_shared_logger.info("Customer debug line.")
  var = [...some variable holding customer data...]
  athena_shared_logger.info(var)
  ```

  The example logs the `debug` line and the value of the `var` variable to CloudWatch Logs and sends a copy of each line to Support.
**Note**  
For your privacy, your calculation code and results are not shared with AWS. Make sure that your calls to `athena_shared_logger` write only the information that you want to make visible to Support.

The provided loggers write events through [Apache Log4j](https://logging.apache.org/log4j/) and inherit the logging levels of this interface. Possible log level values are `DEBUG`, `ERROR`, `FATAL`, `INFO`, and `WARN` or `WARNING`. You can use the corresponding named function on the logger to produce these values.

**Note**  
Do not rebind the names `athena_user_logger` or `athena_shared_logger`. Doing so makes the logging objects unable to write to CloudWatch for the remainder of the session.

### Example: Log notebook events to CloudWatch
<a name="notebooks-spark-logging-example-logging-notebook-events-to-cloudwatch"></a>

The following procedure shows how to log Athena notebook events to Amazon CloudWatch Logs.

**To log Athena notebook events to Amazon CloudWatch Logs**

1. Follow [Get started with Apache Spark on Amazon Athena](notebooks-spark-getting-started.md) to create a Spark enabled workgroup in Athena with a unique name. This tutorial uses the workgroup name `athena-spark-example`.

1. Follow the steps in [Step 7: Create your own notebook](notebooks-spark-getting-started.md#notebooks-spark-getting-started-creating-your-own-notebook) to create a notebook and launch a new session.

1. In the Athena notebook editor, in a new notebook cell, enter the following command:

   ```
   athena_user_logger.info("Hello world.")         
   ```

1. Run the cell.

1. Retrieve the current session ID by doing one of the following:
   + View the cell output (for example, `... session=72c24e73-2c24-8b22-14bd-443bdcd72de4`).
   + In a new cell, run the [magic](notebooks-spark-magics.md) command `%session_id`.

1. Save the session ID.

1. With the same AWS account that you are using to run the notebook session, open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the CloudWatch console navigation pane, choose **Log groups**.

1. In the list of log groups, choose the log group that has the name of your Spark-enabled Athena workgroup, as in the following example.

   ```
   /aws-athena/athena-spark-example
   ```

   The **Log streams** section contains a list of one or more log stream links for the workgroup. Each log stream name contains the session ID, executor ID, and unique UUID separated by forward slash characters.

   For example, if the session ID is `5ac22d11-9fd8-ded7-6542-0412133d3177` and the executor ID is `f8c22d11-9fd8-ab13-8aba-c4100bfba7e2`, the name of the log stream resembles the following example.

   ```
   5ac22d11-9fd8-ded7-6542-0412133d3177/f8c22d11-9fd8-ab13-8aba-c4100bfba7e2/f012d7cb-cefd-40b1-90b9-67358f003d0b
   ```

1. Choose the log stream link for your session.

1. On the **Log events** page, view the **Message** column.

   The log event for the cell that you ran resembles the following:

   ```
   AthenaForApacheSpark: 2022-01-01 12:00:00,000 INFO builtins: Hello world.
   ```

1. Return to the Athena notebook editor.

1. In a new cell, enter the following code. The code logs a variable to CloudWatch:

   ```
   x = 6
   athena_user_logger.warn(x)
   ```

1. Run the cell.

1. Return to the CloudWatch console **Log events** page for the same log stream.

1. The log stream now contains a log event entry with a message like the following:

   ```
   AthenaForApacheSpark: 2022-01-01 12:00:00,000 WARN builtins: 6
   ```

# Use CloudTrail to troubleshoot Athena notebook API calls
<a name="notebooks-spark-troubleshooting-cloudtrail"></a>

To troubleshoot notebook API calls, you can examine Athena CloudTrail logs to investigate anomalies or discover actions initiated by users. For detailed information about using CloudTrail with Athena, see [Log Amazon Athena API calls with AWS CloudTrail](monitor-with-cloudtrail.md).

The following examples demonstrate CloudTrail log entries for Athena notebook APIs.

## StartSession
<a name="notebooks-spark-troubleshooting-cloudtrail-startsession"></a>

The following example shows the CloudTrail log for a notebook [StartSession](https://docs.aws.amazon.com/athena/latest/APIReference/API_StartSession.html) event.

```
{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "EXAMPLE_PRINCIPAL_ID:alias",
        "arn": "arn:aws:sts::123456789012:assumed-role/Admin/alias",
        "accountId": "123456789012",
        "accessKeyId": "EXAMPLE_KEY_ID",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "EXAMPLE_PRINCIPAL_ID",
                "arn": "arn:aws:iam::123456789012:role/Admin",
                "accountId": "123456789012",
                "userName": "Admin"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2022-10-14T16:41:51Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2022-10-14T17:05:36Z",
    "eventSource": "athena.amazonaws.com",
    "eventName": "StartSession",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "203.0.113.10",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
    "requestParameters": {
        "workGroup": "notebook-workgroup",
        "engineConfiguration": {
            "coordinatorDpuSize": 1,
            "maxConcurrentDpus": 20,
            "defaultExecutorDpuSize": 1,
            "additionalConfigs": {
                "NotebookId": "b8f5854b-1042-4b90-9d82-51d3c2fd5c04",
                "NotebookIframeParentUrl": "https://us-east-1.console.aws.amazon.com"
            }
        },
        "notebookVersion": "KeplerJupyter-1.x",
        "sessionIdleTimeoutInMinutes": 20,
        "clientRequestToken": "d646ff46-32d2-42f0-94d1-d060ec3e5d78"
    },
    "responseElements": {
        "sessionId": "a2c1ebba-ad01-865f-ed2d-a142b7451f7e",
        "state": "CREATED"
    },
    "requestID": "d646ff46-32d2-42f0-94d1-d060ec3e5d78",
    "eventID": "b58ce998-eb89-43e9-8d67-d3d8e30561c9",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "123456789012",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.2",
        "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
        "clientProvidedHostHeader": "athena.us-east-1.amazonaws.com"
    },
    "sessionCredentialFromConsole": "true"
}
```

## TerminateSession
<a name="notebooks-spark-troubleshooting-cloudtrail-terminatesession"></a>

The following example shows the CloudTrail log for a notebook [TerminateSession](https://docs.aws.amazon.com/athena/latest/APIReference/API_TerminateSession.html) event.

```
{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "EXAMPLE_PRINCIPAL_ID:alias",
        "arn": "arn:aws:sts::123456789012:assumed-role/Admin/alias",
        "accountId": "123456789012",
        "accessKeyId": "EXAMPLE_KEY_ID",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "EXAMPLE_PRINCIPAL_ID",
                "arn": "arn:aws:iam::123456789012:role/Admin",
                "accountId": "123456789012",
                "userName": "Admin"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2022-10-14T16:41:51Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2022-10-14T17:21:03Z",
    "eventSource": "athena.amazonaws.com",
    "eventName": "TerminateSession",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "203.0.113.11",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
    "requestParameters": {
        "sessionId": "a2c1ebba-ad01-865f-ed2d-a142b7451f7e"
    },
    "responseElements": {
        "state": "TERMINATING"
    },
    "requestID": "438ea37e-b704-4cb3-9a76-391997cf42ee",
    "eventID": "49026c5a-bf58-4cdb-86ca-978e711ad238",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "123456789012",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.2",
        "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
        "clientProvidedHostHeader": "athena.us-east-1.amazonaws.com"
    },
    "sessionCredentialFromConsole": "true"
}
```

## ImportNotebook
<a name="notebooks-spark-troubleshooting-cloudtrail-importnotebook"></a>

The following example shows the CloudTrail log for a notebook [ImportNotebook](https://docs.aws.amazon.com/athena/latest/APIReference/API_ImportNotebook.html) event. For security, some content is hidden.

```
{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "EXAMPLE_PRINCIPAL_ID:alias",
        "arn": "arn:aws:sts::123456789012:assumed-role/Admin/alias",
        "accountId": "123456789012",
        "accessKeyId": "EXAMPLE_KEY_ID",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "EXAMPLE_PRINCIPAL_ID",
                "arn": "arn:aws:iam::123456789012:role/Admin",
                "accountId": "123456789012",
                "userName": "Admin"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2022-10-14T16:41:51Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2022-10-14T17:08:54Z",
    "eventSource": "athena.amazonaws.com",
    "eventName": "ImportNotebook",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "203.0.113.12",
    "userAgent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
    "requestParameters": {
        "workGroup": "notebook-workgroup",
        "name": "example-notebook-name",
        "payload": "HIDDEN_FOR_SECURITY_REASONS",
        "type": "IPYNB",
        "contentMD5": "HIDDEN_FOR_SECURITY_REASONS"
    },
    "responseElements": {
        "notebookId": "05f6225d-bdcc-4935-bc25-a8e19434652d"
    },
    "requestID": "813e777f-6dac-41f4-82a7-e99b7b33f319",
    "eventID": "4abec837-143b-4458-9c1f-fa9fb88ab69b",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "123456789012",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.2",
        "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
        "clientProvidedHostHeader": "athena.us-east-1.amazonaws.com"
    },
    "sessionCredentialFromConsole": "true"
}
```

## UpdateNotebook
<a name="notebooks-spark-troubleshooting-cloudtrail-updatenotebook"></a>

The following example shows the CloudTrail log for a notebook [UpdateNotebook](https://docs.aws.amazon.com/athena/latest/APIReference/API_UpdateNotebook.html) event. For security, some content is hidden.

```
{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "EXAMPLE_PRINCIPAL_ID:AthenaExecutor-9cc1ebb2-aac5-b1ca-8247-5d827bd8232f",
        "arn": "arn:aws:sts::123456789012:assumed-role/AWSAthenaSparkExecutionRole-om0yj71w5l/AthenaExecutor-9cc1ebb2-aac5-b1ca-8247-5d827bd8232f",
        "accountId": "123456789012",
        "accessKeyId": "EXAMPLE_KEY_ID",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "EXAMPLE_PRINCIPAL_ID",
                "arn": "arn:aws:iam::123456789012:role/service-role/AWSAthenaSparkExecutionRole-om0yj71w5l",
                "accountId": "123456789012",
                "userName": "AWSAthenaSparkExecutionRole-om0yj71w5l"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2022-10-14T16:48:06Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2022-10-14T16:52:22Z",
    "eventSource": "athena.amazonaws.com",
    "eventName": "UpdateNotebook",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "203.0.113.13",
    "userAgent": "Boto3/1.24.84 Python/3.8.14 Linux/4.14.225-175.364.amzn2.aarch64 Botocore/1.27.84",
    "requestParameters": {
        "notebookId": "c87553ff-e740-44b5-884f-a70e575e08b9",
        "payload": "HIDDEN_FOR_SECURITY_REASONS",
        "type": "IPYNB",
        "contentMD5": "HIDDEN_FOR_SECURITY_REASONS",
        "sessionId": "9cc1ebb2-aac5-b1ca-8247-5d827bd8232f"
    },
    "responseElements": null,
    "requestID": "baaba1d2-f73d-4df1-a82b-71501e7374f1",
    "eventID": "745cdd6f-645d-4250-8831-d0ffd2fe3847",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "123456789012",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.2",
        "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
        "clientProvidedHostHeader": "athena.us-east-1.amazonaws.com"
    }
}
```

## StartCalculationExecution
<a name="notebooks-spark-troubleshooting-cloudtrail-startcalculationexecution"></a>

The following example shows the CloudTrail log for a notebook [StartCalculationExecution](https://docs.aws.amazon.com/athena/latest/APIReference/API_StartCalculationExecution.html) event. For security, some content is hidden.

```
{
    "eventVersion": "1.08",
    "userIdentity": {
        "type": "AssumedRole",
        "principalId": "EXAMPLE_PRINCIPAL_ID:AthenaExecutor-9cc1ebb2-aac5-b1ca-8247-5d827bd8232f",
        "arn": "arn:aws:sts::123456789012:assumed-role/AWSAthenaSparkExecutionRole-om0yj71w5l/AthenaExecutor-9cc1ebb2-aac5-b1ca-8247-5d827bd8232f",
        "accountId": "123456789012",
        "accessKeyId": "EXAMPLE_KEY_ID",
        "sessionContext": {
            "sessionIssuer": {
                "type": "Role",
                "principalId": "EXAMPLE_PRINCIPAL_ID",
                "arn": "arn:aws:iam::123456789012:role/service-role/AWSAthenaSparkExecutionRole-om0yj71w5l",
                "accountId": "123456789012",
                "userName": "AWSAthenaSparkExecutionRole-om0yj71w5l"
            },
            "webIdFederationData": {},
            "attributes": {
                "creationDate": "2022-10-14T16:48:06Z",
                "mfaAuthenticated": "false"
            }
        }
    },
    "eventTime": "2022-10-14T16:52:37Z",
    "eventSource": "athena.amazonaws.com",
    "eventName": "StartCalculationExecution",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "203.0.113.14",
    "userAgent": "Boto3/1.24.84 Python/3.8.14 Linux/4.14.225-175.364.amzn2.aarch64 Botocore/1.27.84",
    "requestParameters": {
        "sessionId": "9cc1ebb2-aac5-b1ca-8247-5d827bd8232f",
        "description": "Calculation started via Jupyter notebook",
        "codeBlock": "HIDDEN_FOR_SECURITY_REASONS",
        "clientRequestToken": "0111cd63-4fd0-4ad8-a738-fd350115fc21"
    },
    "responseElements": {
        "calculationExecutionId": "82c1ebb4-bd08-e4c3-5631-a662fb2ff2c5",
        "state": "CREATING"
    },
    "requestID": "1a107461-3f1b-481e-b8a2-7fbd524e2373",
    "eventID": "b74dbd00-e839-4bd1-a1da-b75fbc70ab9a",
    "readOnly": false,
    "eventType": "AwsApiCall",
    "managementEvent": true,
    "recipientAccountId": "123456789012",
    "eventCategory": "Management",
    "tlsDetails": {
        "tlsVersion": "TLSv1.2",
        "cipherSuite": "ECDHE-RSA-AES128-GCM-SHA256",
        "clientProvidedHostHeader": "athena.us-east-1.amazonaws.com"
    }
}
```

# Overcome the 68k code block size limit
<a name="notebooks-spark-troubleshooting-code-block-size-limit"></a>

Athena for Spark has a known calculation code block size limit of 68000 characters. When you run a calculation with a code block over this limit, you can receive the following error message:

 '...' at 'codeBlock' failed to satisfy constraint: Member must have length less than or equal to 68000

The following image shows this error in the Athena console notebook editor.

![\[Code block size error message in the Athena notebook editor\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-troubleshooting-code-block-size-limit-1.png)


The same error can occur when you use the AWS CLI to run a calculation that has a large code block, as in the following example.

```
aws athena start-calculation-execution \ 
    --session-id "{SESSION_ID}" \ 
    --description "{SESSION_DESCRIPTION}" \ 
    --code-block "{LARGE_CODE_BLOCK}"
```

The command gives the following error message:

*\$1LARGE\$1CODE\$1BLOCK\$1* at 'codeBlock' failed to satisfy constraint: Member must have length less than or equal to 68000

## Workaround
<a name="notebooks-spark-troubleshooting-code-block-size-limit-workaround"></a>

To work around this issue, upload the file that has your query or calculation code to Amazon S3. Then, use boto3 to read the file and run your SQL or code.

The following examples assume that you have already uploaded the file that has your SQL query or Python code to Amazon S3.

### SQL example
<a name="notebooks-spark-troubleshooting-code-block-size-limit-sql-example"></a>

The following example code reads the `large_sql_query.sql` file from an Amazon S3 bucket and then runs the large query that the file contains.

```
s3 = boto3.resource('s3') 
def read_s3_content(bucket_name, key): 
    response = s3.Object(bucket_name, key).get() 
    return response['Body'].read() 

# SQL 
sql = read_s3_content('bucket_name', 'large_sql_query.sql') 
df = spark.sql(sql)
```

### PySpark example
<a name="notebooks-spark-troubleshooting-code-block-size-limit-pyspark-example"></a>

The following code example reads the `large_py_spark.py` file from Amazon S3 and then runs the large code block that is in the file.

```
s3 = boto3.resource('s3') 
 
def read_s3_content(bucket_name, key): 
    response = s3.Object(bucket_name, key).get() 
    return response['Body'].read() 
     
# PySpark 
py_spark_code = read_s3_content('bucket_name', 'large_py_spark.py') 
exec(py_spark_code)
```

# Troubleshoot session errors
<a name="notebooks-spark-troubleshooting-sessions"></a>

Use the information in this section to troubleshoot session issues.

When a custom configuration error occurs during a session start, the Athena for Spark console shows an error message banner. To troubleshoot session start errors, you can check session state change or logging information.

## View session state change information
<a name="notebooks-spark-troubleshooting-sessions-viewing-session-state-change"></a>

You can get details about a session state change from the Athena notebook editor or from the Athena API.

**To view session state information in the Athena console**

1. In the Athena notebook editor, from the **Session** menu on the upper right, choose **View details**.

1. View the **Current session** tab. The **Session information** section shows you information like session ID, workgroup, status, and state change reason.

   The following screen capture example shows information in the **State change reason** section of the **Session information** dialog box for a Spark session error in Athena.  
![\[Viewing session state change information in the Athena for Spark console.\]](http://docs.aws.amazon.com/athena/latest/ug/images/notebooks-spark-custom-jar-cfg-1.jpeg)

**To view session state information using the Athena API**
+ In the Athena API, you can find session state change information in the `StateChangeReason` field of [SessionStatus](https://docs.aws.amazon.com/athena/latest/APIReference/API_SessionStatus.html) object.

**Note**  
After you manually stop a session, or if the session stops after an idle timeout (the default is 20 minutes), the value of **StateChangeReason** changes to Session was terminated per request.

## Use logging to troubleshoot session start errors
<a name="notebooks-spark-troubleshooting-sessions-using-logging"></a>

Custom configuration errors that occur during a session start are logged by [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html). In your CloudWatch Logs, search for error messages from `AthenaSparkSessionErrorLogger` to troubleshoot a failed session start.

For more information about Spark logging, see [Log Spark application events in Athena](notebooks-spark-logging.md).

For more information about troubleshooting sessions in Athena for Spark, see [Troubleshoot session errors](#notebooks-spark-troubleshooting-sessions).

## Specific session issues
<a name="notebooks-spark-troubleshooting-sessions-specific-error-messages"></a>

Use the information in this section to troubleshoot some specific session issues.

### Session in unhealthy state
<a name="notebooks-spark-troubleshooting-sessions-unhealthy"></a>

If you receive the error message Session in unhealthy state. Please create a new session, terminate your existing session and create a new one.

### A connection to the notebook server could not be established
<a name="notebooks-spark-troubleshooting-sessions-wss-blocked"></a>

When you open a notebook, you may see the following error message:

```
A connection to the notebook server could not be established.  
The notebook will continue trying to reconnect.  
Check your network connection or notebook server configuration.
```

#### Cause
<a name="notebooks-spark-troubleshooting-sessions-wss-blocked-cause"></a>

When Athena opens a notebook, Athena creates a session and connects to the notebook using a pre-signed notebook URL. The connection to the notebook uses the WSS ([WebSocket Secure](https://en.wikipedia.org/wiki/WebSocket)) protocol.

The error can occur for the following reasons:
+ A local firewall (for example, a company-wide firewall) is blocking WSS traffic.
+ Proxy or anti-virus software on your local computer is blocking the WSS connection.

#### Solution
<a name="notebooks-spark-troubleshooting-sessions-wss-blocked-solution"></a>

Assume you have a WSS connection in the `us-east-1` Region like the following:

```
wss://94c2bcdf-66f9-4d17-9da6-7e7338060183.analytics-gateway.us-east-1.amazonaws.com/
api/kernels/33c78c82-b8d2-4631-bd22-1565dc6ec152/channels?session_id=
7f96a3a048ab4917b6376895ea8d7535
```

To resolve the error, use one of the following strategies.
+ Use wild card pattern syntax to allow list WSS traffic on port `443` across AWS Regions and AWS accounts.

  ```
  wss://*amazonaws.com
  ```
+ Use wild card pattern syntax to allow list WSS traffic on port `443` in one AWS Region and across AWS accounts in the AWS Region that you specify. The following example uses `us-east-1`.

  ```
  wss://*analytics-gateway.us-east-1.amazonaws.com
  ```

# Troubleshoot table errors
<a name="notebooks-spark-troubleshooting-tables"></a>

Use the information in this section to troubleshoot Athena for Spark table errors.

## Cannot create a path error when creating a table
<a name="notebooks-spark-troubleshooting-tables-illegal-argument-exception"></a>

**Error message**: IllegalArgumentException: Cannot create a path from an empty string.

**Cause**: This error can occur when you use Apache Spark in Athena to create a table in an AWS Glue database, and the database has an empty `LOCATION` property. 

**Suggested Solution**: For more information and solutions, see [Illegal argument exception when creating a table](notebooks-spark-known-issues.md#notebooks-spark-known-issues-illegal-argument-exception).

## AccessDeniedException when querying AWS Glue tables
<a name="notebooks-spark-troubleshooting-tables-glue-access-denied"></a>

**Error message**: pyspark.sql.utils.AnalysisException: Unable to verify existence of default database: com.amazonaws.services.glue.model.AccessDeniedException: User: arn:aws:sts::*aws-account-id*:assumed-role/AWSAthenaSparkExecutionRole-*unique-identifier*/AthenaExecutor-*unique-identifier* is not authorized to perform: glue:GetDatabase on resource: arn:aws:glue:*aws-region*:*aws-account-id*:catalog because no identity-based policy allows the glue:GetDatabase action (Service: AWSGlue; Status Code: 400; Error Code: AccessDeniedException; Request ID: *request-id*; Proxy: null)

**Cause**: The execution role for your Spark-enabled workgroup is missing permissions to access AWS Glue resources.

**Suggested Solution**: To resolve this issue, grant your execution role access to AWS Glue resources, and then edit your Amazon S3 bucket policy to grant access to your execution role.

The following procedure describes these steps in greater detail.

**To grant your execution role access to AWS Glue resources**

1. Open the Athena console at [https://console.aws.amazon.com/athena/](https://console.aws.amazon.com/athena/home).

1. If the console navigation pane is not visible, choose the expansion menu on the left.  
![\[Choose the expansion menu.\]](http://docs.aws.amazon.com/athena/latest/ug/images/nav-pane-expansion.png)

1. In the Athena console navigation pane, choose **Workgroups**.

1. On the **Workgroups** page, choose the link of the workgroup that you want to view.

1. On the **Overview Details** page for the workgroup, choose the **Role ARN** link. The link opens the Spark execution role in the IAM console.

1. In the **Permissions policies** section, choose the linked role policy name.

1. Choose **Edit policy**, and then choose **JSON**.

1. Add AWS Glue access to the role. Typically, you add permissions for the `glue:GetDatabase` and `glue:GetTable` actions. For more information on configuring IAM roles, see [Adding and removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html) in the IAM User Guide. 

1. Choose **Review policy**, and then choose **Save changes**.

1. Edit your Amazon S3 bucket policy to grant access to the execution role. Note that you must grant the role access to both the bucket and the objects in the bucket. For steps, see [Adding a bucket policy using the Amazon S3 console](https://docs.aws.amazon.com/AmazonS3/latest/userguide/add-bucket-policy.html) in the Amazon Simple Storage Service User Guide.

# Get support
<a name="notebooks-spark-troubleshooting-support"></a>

For assistance from AWS, choose **Support**, **Support Center** from the AWS Management Console. To facilitate your experience, please have the following information ready:
+ Athena query ID
+ Session ID
+ Calculation ID