Basic Usage Initialize Spark Session Working with Spark Options Examples by Operation Type Notes

Spark Utilities

The Spark utilities module provides a simple interface for working with Spark Connect sessions and managing Spark configurations for various data sources within Amazon SageMaker Unified Studio. When no connection is specified, a Spark Connect session is created using the default Amazon Athena Spark connection.

Basic Usage

Import the Spark utilities:



from sagemaker_studio import sparkutils

Initialize Spark Session

Supported connection types:

Spark connect

Optional Parameters:

connection_name (str): Name of the connection to execute query against (e.g., "my_redshift_connection")

When no connection is specified, a default Amazon Athena Spark session is created. Other options include EMR-Serverless (version 7.13 and onwards). EMR-Serverless connection can be created and used:



# Default session
spark = sparkutils.init()

# Session with specific connection
spark = sparkutils.init(connection_name="my_spark_connection")

Working with Spark Options

Supported connection types:

Amazon DocumentDB
Amazon DynamoDB
Amazon Redshift
Aurora MySQL
Aurora PostgreSQL
Azure SQL
Google BigQuery
Microsoft SQL Server
MySQL
PostgreSQL
Oracle
Snowflake

Required Inputs:

connection_name (str): Name of the connection to get Spark options for (e.g., "my_redshift_connection")

Get formatted Spark options for connecting to data sources:



# Get options for Redshift connection
options = sparkutils.get_spark_options("my_redshift_connection")

Examples by Operation Type

Reading and Writing Data



# Create sample DataFrame
df_to_write = spark.createDataFrame(
    [(1, "Alice"), (2, "Bob")],
    ["id", "name"]
)

# Get spark options for Redshift connection
spark_options = sparkutils.get_spark_options("my_redshift_connection")

# Write DataFrame using JDBC
df_to_write.write \
    .format("jdbc") \
    .options(**spark_options) \
    .option("dbtable", "sample_table") \
    .save()

# Read DataFrame using JDBC
df_to_read = spark.read \
    .format('jdbc') \
    .options(**spark_options) \
    .option('dbtable', 'sample_table') \
    .load()

# Display results
df_to_read.show()

Notes

Spark sessions are automatically configured for Amazon Athena spark compute
Connection credentials are managed through Amazon SageMaker Unified Studio project connections
The module handles session management and cleanup automatically
Spark options are formatted appropriately for each supported data source
When get_spark_options is used in EMR-S or EMR-on-EC2 compute, and the connection has EnforceSSL enabled, the formatted spark options will not have the sslrootcert value and hence that would need to be passed explicitly.

Warning Javascript is disabled or is unavailable in your browser.

To use the Amazon Web Services Documentation, Javascript must be enabled. Please refer to your browser's Help pages for instructions.

Document Conventions

DataFrame Utilities

Execution APIs