

# Using Amazon Redshift integration for Apache Spark on Amazon EMR on EKS
Using Spark on Redshift

With Amazon EMR release 6.9.0 and later, every release image includes a connector between [Apache Spark](https://aws.amazon.com/emr/features/spark/) and Amazon Redshift. This way, you can use Spark on Amazon EMR on EKS to process data stored in Amazon Redshift. The integration is based on the [`spark-redshift` open-source connector](https://github.com/spark-redshift-community/spark-redshift#readme). For Amazon EMR on EKS, the [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html) is included as a native integration.

**Topics**
+ [

# Launching a Spark application using the Amazon Redshift integration for Apache Spark
](emr-spark-redshift-launch.md)
+ [

# Authenticating with the Amazon Redshift integration for Apache Spark
](emr-spark-redshift-auth.md)
+ [

# Reading and writing from and to Amazon Redshift
](emr-spark-redshift-readwrite.md)
+ [

# Considerations and limitations when using the Spark connector
](emr-spark-redshift-considerations.md)

# Launching a Spark application using the Amazon Redshift integration for Apache Spark
Launch a Spark application

To use the integration, you must pass the required Spark Redshift dependencies with your Spark job. You must use `--jars` to include Redshift connector-related libraries. To see other file locations supported by the `--jars` option, see the [Advanced Dependency Management](https://spark.apache.org/docs/latest/submitting-applications.html#advanced-dependency-management) section of the Apache Spark documentation. 
+ `spark-redshift.jar`
+ `spark-avro.jar`
+ `RedshiftJDBC.jar`
+ `minimal-json.jar`

To launch a Spark application with the Amazon Redshift integration for Apache Spark on Amazon EMR on EKS release 6.9.0 or later, use the following example command. Note that the paths listed with the `--conf spark.jars` option are the default paths for the JAR files.

```
aws emr-containers start-job-run \

--virtual-cluster-id cluster_id \
--execution-role-arn arn \
--release-label emr-6.9.0-latest\
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "s3://script_path", 
            "sparkSubmitParameters":
            "--conf spark.kubernetes.file.upload.path=s3://upload_path 
             --conf spark.jars=
                /usr/share/aws/redshift/jdbc/RedshiftJDBC.jar,
                /usr/share/aws/redshift/spark-redshift/lib/spark-redshift.jar,
                /usr/share/aws/redshift/spark-redshift/lib/spark-avro.jar,
                /usr/share/aws/redshift/spark-redshift/lib/minimal-json.jar"
                            }
            }'
```

# Authenticating with the Amazon Redshift integration for Apache Spark
Authenticate to Amazon Redshift

The following sections show authentication options with Amazon Redshift when you're integrating with Apache Spark. The sections show how to retrieve login credentials and also details regarding using the JDBC driver with IAM authentication.

## Use AWS Secrets Manager to retrieve credentials and connect to Amazon Redshift
Retrieve credentials from Secrets Manager

You can store credentials in Secrets Manager to authenticate securely to Amazon Redshift. You can have your Spark job call the `GetSecretValue` API to fetch the credentials:

```
from pyspark.sql import SQLContextimport boto3

sc = # existing SparkContext
sql_context = SQLContext(sc)

secretsmanager_client = boto3.client('secretsmanager', region_name=os.getenv('AWS_REGION'))
secret_manager_response = secretsmanager_client.get_secret_value(
    SecretId='string',
    VersionId='string',
    VersionStage='string'
)
username = # get username from secret_manager_response
password = # get password from secret_manager_response
url = "jdbc:redshift://redshifthost:5439/database?user=" + username + "&password=" + password

# Access to Redshift cluster using Spark
```

## Use IAM based authentication with Amazon EMR on EKS job execution role
Use IAM with job execution

Starting with Amazon EMR on EKS release 6.9.0, the Amazon Redshift JDBC driver version 2.1 or higher is packaged into the environment. With JDBC driver 2.1 and higher, you can specify the JDBC URL and not include the raw username and password. Instead, you can specify `jdbc:redshift:iam://` scheme. This commands the JDBC driver to use your Amazon EMR on EKS job execution role to fetch the credentials automatically.

See [Configure a JDBC or ODBC connection to use IAM credentials](https://docs.aws.amazon.com/redshift/latest/mgmt/generating-iam-credentials-configure-jdbc-odbc.html) in the *Amazon Redshift Management Guide* for more information.

The following example URL uses a `jdbc:redshift:iam://` scheme.

```
jdbc:redshift:iam://examplecluster.abc123xyz789.us-west-2.redshift.amazonaws.com:5439/dev
```

The following permissions are required for your job execution role when it meets the provided conditions.


| Permission | Conditions when required for job execution role | 
| --- | --- | 
|  redshift:GetClusterCredentials  | Required for JDBC driver to fetch the credentials from Amazon Redshift | 
|  redshift:DescribeCluster  | Required if you specify the Amazon Redshift cluster and AWS Region in the JDBC URL instead of endpoint | 
|  redshift-serverless:GetCredentials  | Required for JDBC driver to fetch the credentials from Amazon Redshift Serverless | 
|  redshift-serverless:GetWorkgroup  | Required if you are using Amazon Redshift Serverless and you specify the URL in terms of workgroup name and Region | 

Your job execution role policy should have the following permissions.

```
{
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials",
                "redshift:DescribeCluster",
                "redshift-serverless:GetCredentials",
                "redshift-serverless:GetWorkgroup"
            ],
            "Resource": [
                "arn:aws:redshift:AWS_REGION:ACCOUNT_ID:dbname:CLUSTER_NAME/DATABASE_NAME",
                "arn:aws:redshift:AWS_REGION:ACCOUNT_ID:dbuser:DATABASE_NAME/USER_NAME"
            ]
        }
```

## Authenticate to Amazon Redshift with a JDBC driver
Authenticate to Redshift with JDBC

**Set username and password inside the JDBC URL**

To authenticate a Spark job to an Amazon Redshift cluster, you can specify the Amazon Redshift database name and password in the JDBC URL. 

**Note**  
If you pass the database credentials in the URL, anyone who has access to the URL can also access the credentials. This method isn't generally recommended because it's not a secure option.

If security isn't a concern for your application, you can use the following format to set the username and password in the JDBC URL:

```
jdbc:redshift://redshifthost:5439/database?user=username&password=password
```

# Reading and writing from and to Amazon Redshift
Read and write to Amazon Redshift

The following code examples use PySpark to read and write sample data from and to an Amazon Redshift database with a data source API and with SparkSQL.

------
#### [ Data source API ]

Use PySpark to read and write sample data from and to an Amazon Redshift database with a data source API.

```
import boto3
from pyspark.sql import SQLContext

sc = # existing SparkContext
sql_context = SQLContext(sc)

url = "jdbc:redshift:iam://redshifthost:5439/database"
aws_iam_role_arn = "arn:aws:iam::accountID:role/roleName"

df = sql_context.read \
    .format("io.github.spark_redshift_community.spark.redshift") \
    .option("url", url) \
    .option("dbtable", "tableName") \
    .option("tempdir", "s3://path/for/temp/data") \
    .option("aws_iam_role", "aws_iam_role_arn") \
    .load()

df.write \
    .format("io.github.spark_redshift_community.spark.redshift") \
    .option("url", url) \
    .option("dbtable", "tableName_copy") \
    .option("tempdir", "s3://path/for/temp/data") \
    .option("aws_iam_role", "aws_iam_role_arn") \
    .mode("error") \
    .save()
```

------
#### [ SparkSQL ]

Use PySpark to read and write sample data from and to an Amazon Redshift database using SparkSQL.

```
import boto3
import json
import sys
import os
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .enableHiveSupport() \
    .getOrCreate()
    
url = "jdbc:redshift:iam://redshifthost:5439/database"
aws_iam_role_arn = "arn:aws:iam::accountID:role/roleName"
    
bucket = "s3://path/for/temp/data"
tableName = "tableName" # Redshift table name

s = f"""CREATE TABLE IF NOT EXISTS {tableName} (country string, data string) 
    USING io.github.spark_redshift_community.spark.redshift 
    OPTIONS (dbtable '{tableName}', tempdir '{bucket}', url '{url}', aws_iam_role '{aws_iam_role_arn}' ); """

spark.sql(s)
         
columns = ["country" ,"data"]
data = [("test-country","test-data")]
df = spark.sparkContext.parallelize(data).toDF(columns)

# Insert data into table
df.write.insertInto(tableName, overwrite=False)
df = spark.sql(f"SELECT * FROM {tableName}")
df.show()
```

------

# Considerations and limitations when using the Spark connector
Considerations

The Spark connector supports a variety of ways to manage credentials, to configure security, and to connect with other AWS services. Get familiar with the recommendations in this list in order to configure a functional and resilient connection.
+ We recommend that you activate SSL for the JDBC connection from Spark on Amazon EMR to Amazon Redshift.
+ We recommend that you manage the credentials for the Amazon Redshift cluster in AWS Secrets Manager as a best practice. See [Using AWS Secrets Manager to retrieve credentials for connecting to Amazon Redshift](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-redshift-secrets.html) for an example.
+ We recommend that you pass an IAM role with the parameter `aws_iam_role` for the Amazon Redshift authentication parameter.
+ The parameter `tempformat` currently doesn't support the Parquet format.
+ The `tempdir` URI points to an Amazon S3 location. This temp directory isn't cleaned up automatically and therefore could add additional cost.
+ Consider the following recommendations for Amazon Redshift:
  + We recommend that you block public access to the Amazon Redshift cluster.
  + We recommend that you turn on [Amazon Redshift audit logging](https://docs.aws.amazon.com/redshift/latest/mgmt/db-auditing.html).
  + We recommend turn on [Amazon Redshift at-rest encryption](https://docs.aws.amazon.com/redshift/latest/mgmt/security-server-side-encryption.html).
+ Consider the following recommendations for Amazon S3:
  + We recommend [blocking public access to Amazon S3 buckets](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-control-block-public-access.html).
  + We recommend that you use [Amazon S3 server-side encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html) to encrypt the S3 buckets that you use.
  + We recommend that you use [Amazon S3 lifecycle policies](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) to define the retention rules for the S3 bucket.
  + Amazon EMR always verifies code imported from open-source into the image. For security, we don't support encoding AWS access keys in the `tempdir` URI as an authentication method from Spark to Amazon S3.

For more information on using the connector and its supported parameters, see the following resources:
+ [Amazon Redshift integration for Apache Spark](https://docs.aws.amazon.com/redshift/latest/mgmt/spark-redshift-connector.html) in the *Amazon Redshift Management Guide*
+ The [`spark-redshift` community repository](https://github.com/spark-redshift-community/spark-redshift#readme) on Github