

# Spark Utilities
<a name="spark-utilities"></a>

The Spark utilities module provides a simple interface for working with Spark Connect sessions and managing Spark configurations for various data sources within Amazon SageMaker Unified Studio. When no connection is specified, a Spark Connect session is created using the default Amazon Athena Spark connection.

## Basic Usage
<a name="spark-basic-usage"></a>

Import the Spark utilities:

```
from sagemaker_studio import sparkutils
```

## Initialize Spark Session
<a name="initialize-spark-session"></a>

Supported connection types:
+ Spark connect

Optional Parameters:
+ connection\_name (str): Name of the connection to execute query against (e.g., "my\_redshift\_connection")

When no connection is specified, a default Amazon Athena Spark session is created:

```
# Default session
spark = sparkutils.init()

# Session with specific connection
spark = sparkutils.init(connection_name="my_spark_connection")
```

## Working with Spark Options
<a name="working-with-spark-options"></a>

Supported connection types:
+ Amazon DocumentDB
+ Amazon DynamoDB
+ Amazon Redshift
+ Aurora MySQL
+ Aurora PostgreSQL
+ Azure SQL
+ Google BigQuery
+ Microsoft SQL Server
+ MySQL
+ PostgreSQL
+ Oracle
+ Snowflake

Required Inputs:
+ connection\_name (str): Name of the connection to get Spark options for (e.g., "my\_redshift\_connection")

Get formatted Spark options for connecting to data sources:

```
# Get options for Redshift connection
options = sparkutils.get_spark_options("my_redshift_connection")
```

## Examples by Operation Type
<a name="spark-examples-by-operation"></a>

### Reading and Writing Data
<a name="reading-and-writing-data"></a>

```
# Create sample DataFrame
df_to_write = spark.createDataFrame(
    [(1, "Alice"), (2, "Bob")],
    ["id", "name"]
)

# Get spark options for Redshift connection
spark_options = sparkutils.get_spark_options("my_redshift_connection")

# Write DataFrame using JDBC
df_to_write.write \
    .format("jdbc") \
    .options(**spark_options) \
    .option("dbtable", "sample_table") \
    .save()

# Read DataFrame using JDBC
df_to_read = spark.read \
    .format('jdbc') \
    .options(**spark_options) \
    .option('dbtable', 'sample_table') \
    .load()

# Display results
df_to_read.show()
```

## Notes
<a name="spark-notes"></a>
+ Spark sessions are automatically configured for Amazon Athena spark compute
+ Connection credentials are managed through Amazon SageMaker Unified Studio project connections
+ The module handles session management and cleanup automatically
+ Spark options are formatted appropriately for each supported data source
+ When get\_spark\_options is used in EMR-S or EMR-on-EC2 compute, and the connection has EnforceSSL enabled, the formatted spark options will not have the sslrootcert value and hence that would need to be passed explicitly.