View a markdown version of this page

Spark Compute - Amazon SageMaker Unified Studio

Spark Compute

Overview

Amazon SageMaker Unified Studio notebooks support Spark Connect, a protocol that enables running PySpark and Spark SQL commands directly in notebook cells using different Spark compute engines. You can select your Spark runtime from the notebook side panel and start writing Spark commands with no additional setup or special syntax.

The following Spark runtimes are available:

  • Amazon Athena for Apache Spark – The default Spark runtime. Amazon Athena for Apache Spark makes it easy to interactively run data analytics using Apache Spark without the need to plan for, configure, or manage resources. Available automatically in notebooks.

  • Amazon EMR Serverless – An alternate Spark runtime. Amazon EMR Serverless lets you run Apache Spark applications without configuring, managing, or scaling clusters. You pay only for the resources your application uses. Requires an EMR Serverless application with Interactive Sessions enabled.

Note

Spark Connect on Amazon EMR Serverless is available with EMR release 7.13 and above.

Select a Spark runtime

To select or switch your Spark runtime:

  1. Open a notebook in your project.

  2. Open the compute pane in the notebook side panel.

  3. Expand the Spark section to see available connections.

  4. Select a Spark Connect connection.

  5. Choose Apply.

The selected runtime applies to both Python and SQL cells. When you switch to a different Spark runtime, the current session restarts and any unsaved variables or state are lost. A confirmation dialog appears before the switch.

The notebook Compute environment side panel showing the Spark section with a Connection dropdown where you select your EMR Serverless Spark Connect connection and choose Apply.

In SQL cells, the connection dropdown shows the Spark runtime selected at the notebook level.

Spark UI monitoring

Spark UI is available for all supported Spark engines. After executing Spark code in a Python or SQL cell, you can access the Spark UI and Spark Driver Logs from the kernel footer at the bottom of the notebook. Click the links to open them in a separate tab for visibility into job execution and performance.

Use EMR Serverless with Spark Connect

Amazon EMR Serverless provides a pay-per-use Spark compute option with no cluster management. When configured, EMR Serverless applications are provisioned with pre-initialized capacity (1 driver, 3 executors) for approximately 30-second session start times. Pre-initialized capacity launches when a session starts and shuts down after 15 minutes of idle time.

Note

If your domain uses network isolation, ensure that the required Amazon VPC endpoints for Amazon EMR are configured. For more information, see Network isolation in the Amazon SageMaker Unified Studio Admin Guide.

Prerequisites

  • EMR Serverless application with release label emr-7.13.0 or above

  • Interactive Sessions enabled on the EMR Serverless application

  • Compatibility permission mode (fine-grained access control and trusted identity propagation are not supported)

Set up EMR Serverless in IAM domains

In IAM domains, you set up EMR Serverless Spark Connect by creating the application yourself and adding it as a connection.

  1. Create an EMR Serverless application with Interactive Sessions enabled and release label emr-7.13.0 or above. You can create the application from Amazon EMR Studio or the AWS CLI. Make sure to select Enable Interactive Sessions to enable Spark Connect endpoints. For more information, see Interactive workloads in the Amazon EMR Serverless User Guide.

  2. In your Amazon SageMaker Unified Studio project, navigate to Connections > Create Connection > EMR Serverless. Provide a connection name, select the EMR Serverless application ARN, and choose Compatibility permission mode.

    The Create connection page in SageMaker Unified Studio showing the EMR Serverless type, connection name field, EMR Serverless application ARN field, Access role ARN field, and Permission mode with Compatibility selected.

If the connection does not appear or is not working, verify the following:

  • The EMR Serverless application has Interactive Sessions enabled.

  • Permission mode is set to compatibility.

  • The domain does not have trusted identity propagation enabled.

Set up EMR Serverless in IAM Identity Center domains (new projects)

In IAM Identity Center domains, you can use the EMR Serverless blueprint to automatically provision EMR Serverless Spark Connect applications.

  1. As an administrator, update your project profile to include the EmrServerless blueprint. Set the blueprint to ON-DEMAND or ON-CREATE based on your preference. For more information, see Update project profiles in the Amazon SageMaker Unified Studio Admin Guide.

  2. Create a new project using the project profile. Set the release label to emr-7.13.0 and permission mode to compatibility.

  3. Verify the EMR Serverless application is created with Interactive Sessions enabled.

Set up EMR Serverless in IAM Identity Center domains (existing projects)

  1. As an administrator, update the project profile to add the EmrServerless blueprint in ON-DEMAND mode.

  2. As a user, go to the project and use Add Compute to add an EMR Serverless environment.

  3. Select a release label of emr-7.13.0 or above and choose compatibility permission mode.

  4. Verify the EMR Serverless application is created with Interactive Sessions enabled.

Limitations

  • Fine-grained access control (FGAC) is not supported. Only full-table access is available.

  • Trusted identity propagation (TIP) is not supported.

  • Switching runtimes restarts the session. Any unsaved variables or state are lost.