Spark Utilities - Amazon SageMaker Unified Studio

Spark Utilities

The Spark utilities module provides a simple interface for working with Spark Connect sessions and managing Spark configurations for various data sources within Amazon SageMaker Unified Studio. When no connection is specified, a Spark Connect session is created using the default Amazon Athena Spark connection.

Basic Usage

Import the Spark utilities:

from sagemaker_studio import sparkutils

Initialize Spark Session

Supported connection types:

  • Spark connect

Optional Parameters:

  • connection_name (str): Name of the connection to execute query against (e.g., "my_redshift_connection")

When no connection is specified, a default Amazon Athena Spark session is created:

# Default session spark = sparkutils.init() # Session with specific connection spark = sparkutils.init(connection_name="my_spark_connection")

Working with Spark Options

Supported connection types:

  • Amazon DocumentDB

  • Amazon DynamoDB

  • Amazon Redshift

  • Aurora MySQL

  • Aurora PostgreSQL

  • Azure SQL

  • Google BigQuery

  • Microsoft SQL Server

  • MySQL

  • PostgreSQL

  • Oracle

  • Snowflake

Required Inputs:

  • connection_name (str): Name of the connection to get Spark options for (e.g., "my_redshift_connection")

Get formatted Spark options for connecting to data sources:

# Get options for Redshift connection options = sparkutils.get_spark_options("my_redshift_connection")

Examples by Operation Type

Reading and Writing Data

# Create sample DataFrame df_to_write = spark.createDataFrame( [(1, "Alice"), (2, "Bob")], ["id", "name"] ) # Get spark options for Redshift connection spark_options = sparkutils.get_spark_options("my_redshift_connection") # Write DataFrame using JDBC df_to_write.write \ .format("jdbc") \ .options(**spark_options) \ .option("dbtable", "sample_table") \ .save() # Read DataFrame using JDBC df_to_read = spark.read \ .format('jdbc') \ .options(**spark_options) \ .option('dbtable', 'sample_table') \ .load() # Display results df_to_read.show()

Notes

  • Spark sessions are automatically configured for Amazon Athena spark compute

  • Connection credentials are managed through Amazon SageMaker Unified Studio project connections

  • The module handles session management and cleanup automatically

  • Spark options are formatted appropriately for each supported data source

  • When get_spark_options is used in EMR-S or EMR-on-EC2 compute, and the connection has EnforceSSL enabled, the formatted spark options will not have the sslrootcert value and hence that would need to be passed explicitly.