

# Spark Compute
<a name="notebooks-spark-connect"></a>

## Overview
<a name="spark-connect-overview"></a>

Amazon SageMaker Unified Studio notebooks support Spark Connect, a protocol that enables running PySpark and Spark SQL commands directly in notebook cells using different Spark compute engines. You can select your Spark runtime from the notebook side panel and start writing Spark commands with no additional setup or special syntax.

The following Spark runtimes are available:
+ **[Amazon Athena for Apache Spark](https://docs.aws.amazon.com/athena/latest/ug/notebooks-spark.html)** – The default Spark runtime. Amazon Athena for Apache Spark makes it easy to interactively run data analytics using Apache Spark without the need to plan for, configure, or manage resources. Available automatically in notebooks.
+ **[Amazon EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html)** – An alternate Spark runtime. Amazon EMR Serverless lets you run Apache Spark applications without configuring, managing, or scaling clusters. You pay only for the resources your application uses. Requires an EMR Serverless application with Interactive Sessions enabled.

**Note**  
Spark Connect on Amazon EMR Serverless is available with EMR release 7.13 and above.

## Select a Spark runtime
<a name="spark-connect-select-runtime"></a>

To select or switch your Spark runtime:

1. Open a notebook in your project.

1. Open the compute pane in the notebook side panel.

1. Expand the **Spark** section to see available connections.

1. Select a Spark Connect connection.

1. Choose **Apply**.

The selected runtime applies to both Python and SQL cells. When you switch to a different Spark runtime, the current session restarts and any unsaved variables or state are lost. A confirmation dialog appears before the switch.

![The notebook Compute environment side panel showing the Spark section with a Connection dropdown where you select your EMR Serverless Spark Connect connection and choose Apply.](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/spark-connect/add-connection-final.png)


In SQL cells, the connection dropdown shows the Spark runtime selected at the notebook level.

## Spark UI monitoring
<a name="spark-connect-spark-ui"></a>

Spark UI is available for all supported Spark engines. After executing Spark code in a Python or SQL cell, you can access the Spark UI and Spark Driver Logs from the kernel footer at the bottom of the notebook. Click the links to open them in a separate tab for visibility into job execution and performance.

## Use EMR Serverless with Spark Connect
<a name="spark-connect-emr-serverless"></a>

Amazon EMR Serverless provides a pay-per-use Spark compute option with no cluster management. When configured, EMR Serverless applications are provisioned with pre-initialized capacity (1 driver, 3 executors) for approximately 30-second session start times. Pre-initialized capacity launches when a session starts and shuts down after 15 minutes of idle time.

**Note**  
If your domain uses network isolation, ensure that the required Amazon VPC endpoints for Amazon EMR are configured. For more information, see [Network isolation](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/network-isolation.html) in the *Amazon SageMaker Unified Studio Admin Guide*.

### Prerequisites
<a name="spark-connect-emr-serverless-prereqs"></a>
+ EMR Serverless application with release label `emr-7.13.0` or above
+ Interactive Sessions enabled on the EMR Serverless application
+ Compatibility permission mode (fine-grained access control and trusted identity propagation are not supported)

### Set up EMR Serverless in IAM domains
<a name="spark-connect-emr-serverless-iam"></a>

In IAM domains, you set up EMR Serverless Spark Connect by creating the application yourself and adding it as a connection.

1. Create an EMR Serverless application with Interactive Sessions enabled and release label `emr-7.13.0` or above. You can create the application from Amazon EMR Studio or the AWS CLI. Make sure to select **Enable Interactive Sessions** to enable Spark Connect endpoints. For more information, see [Interactive workloads](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/interactive-workloads.html) in the *Amazon EMR Serverless User Guide*.

1. In your Amazon SageMaker Unified Studio project, navigate to **Connections** > **Create Connection** > **EMR Serverless**. Provide a connection name, select the EMR Serverless application ARN, and choose **Compatibility** permission mode.  
![The Create connection page in SageMaker Unified Studio showing the EMR Serverless type, connection name field, EMR Serverless application ARN field, Access role ARN field, and Permission mode with Compatibility selected.](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/spark-connect/createconnection.png)

If the connection does not appear or is not working, verify the following:
+ The EMR Serverless application has Interactive Sessions enabled.
+ Permission mode is set to compatibility.
+ The domain does not have trusted identity propagation enabled.

### Set up EMR Serverless in IAM Identity Center domains (new projects)
<a name="spark-connect-emr-serverless-idc-new"></a>

In IAM Identity Center domains, you can use the EMR Serverless blueprint to automatically provision EMR Serverless Spark Connect applications.

1. As an administrator, update your project profile to include the `EmrServerless` blueprint. Set the blueprint to `ON-DEMAND` or `ON-CREATE` based on your preference. For more information, see [Update project profiles](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/update-project-profiles.html) in the *Amazon SageMaker Unified Studio Admin Guide*.

1. Create a new project using the project profile. Set the release label to `emr-7.13.0` and permission mode to compatibility.

1. Verify the EMR Serverless application is created with Interactive Sessions enabled.

### Set up EMR Serverless in IAM Identity Center domains (existing projects)
<a name="spark-connect-emr-serverless-idc-existing"></a>

1. As an administrator, update the project profile to add the `EmrServerless` blueprint in `ON-DEMAND` mode.

1. As a user, go to the project and use **Add Compute** to add an EMR Serverless environment.

1. Select a release label of `emr-7.13.0` or above and choose compatibility permission mode.

1. Verify the EMR Serverless application is created with Interactive Sessions enabled.

### Limitations
<a name="spark-connect-emr-serverless-limitations"></a>
+ Fine-grained access control (FGAC) is not supported. Only full-table access is available.
+ Trusted identity propagation (TIP) is not supported.
+ Switching runtimes restarts the session. Any unsaved variables or state are lost.