

# Setting up IAM roles for service accounts (IRSA) for spark-submit
<a name="spark-submit-security-irsa"></a>

The following sections explain how to set up IAM roles for service accounts (IRSA) to authenticate and authorize Kubernetes service accounts so you can run Spark applications stored in Amazon S3.

## Prerequisites
<a name="spark-submit-security-irsa-prereqs"></a>

Before trying any of the examples in this documentation, make sure that you have completed the following prerequisites:
+ [Finished setting up spark-submit](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-setup.html)
+ [Created an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) and [uploaded](https://docs.aws.amazon.com/AmazonS3/latest/userguide/uploading-an-object-bucket.html) the spark application jar

## Configuring a Kubernetes service account to assume an IAM role
<a name="spark-submit-security-irsa-configure-kubernetes"></a>

The following steps cover how to configure a Kubernetes service account to assume an AWS Identity and Access Management (IAM) role. After you configure the pods to use the service account, they can then access any AWS service that the role has permissions to access.

1. Create a policy file to allow read-only access to the Amazon S3 object you [uploaded](https://docs.aws.amazon.com/AmazonS3/latest/userguide/uploading-an-object-bucket.html):

   ```
   cat >my-policy.json <<EOF
   {
       "Version": "2012-10-17",		 	 	 
       "Statement": [
           {
               "Effect": "Allow",
               "Action": [
                   "s3:GetObject",
                   "s3:ListBucket"
               ],
               "Resource": [
                   "arn:aws:s3:::<{{my-spark-jar-bucket}}>",
                   "arn:aws:s3:::<{{my-spark-jar-bucket}}>/*"
               ]
           }
       ]
   }
   EOF
   ```

1. Create the IAM policy.

   ```
   aws iam create-policy --policy-name my-policy --policy-document file://my-policy.json
   ```

1. Create an IAM role and associate it with a Kubernetes service account for the Spark driver

   ```
   eksctl create iamserviceaccount --name my-spark-driver-sa --namespace spark-operator \
   --cluster my-cluster --role-name "my-role" \
   --attach-policy-arn arn:aws:iam::111122223333:policy/my-policy --approve
   ```

1. Create a YAML file with the required [permissions](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-security.html) for the Spark driver service account:

   ```
   cat >spark-rbac.yaml <<EOF
   apiVersion: rbac.authorization.k8s.io/v1
   kind: Role
   metadata:
     namespace: default
     name: emr-containers-role-spark
   rules:
   - apiGroups:
     - ""
     resources:
     - pods
     verbs:
     - "*"
   - apiGroups:
     - ""
     resources:
     - services
     verbs:
     - "*"
   - apiGroups:
     - ""
     resources:
     - configmaps
     verbs:
     - "*"
   - apiGroups:
     - ""
     resources:
     - persistentvolumeclaims
     verbs:
     - "*"
   ---
   apiVersion: rbac.authorization.k8s.io/v1
   kind: RoleBinding
   metadata:
     name: spark-role-binding
     namespace: default
   roleRef:
     apiGroup: rbac.authorization.k8s.io
     kind: Role
     name: emr-containers-role-spark
   subjects:
   - kind: ServiceAccount
     name: emr-containers-sa-spark
     namespace: default
   EOF
   ```

1. Apply the cluster role binding configurations.

   ```
   kubectl apply -f spark-rbac.yaml
   ```

1. The `kubectl` command should return confirmation of the created account.

   ```
   serviceaccount/emr-containers-sa-spark created
   clusterrolebinding.rbac.authorization.k8s.io/emr-containers-role-spark configured
   ```

## Running the Spark application
<a name="spark-submit-security-irsa-app-run"></a>

Amazon EMR 6.10.0 and higher supports spark-submit for running Spark applications on an Amazon EKS cluster. To run the Spark application, follow these steps:

1. Make sure that you have completed the steps in [ Setting up spark-submit for Amazon EMR on EKS](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-setup.html).

1. Set the values for the following environment variables:

   ```
   export SPARK_HOME=spark-home
   export MASTER_URL=k8s://Amazon EKS-cluster-endpoint
   ```

1. Now, submit the Spark application with the following command:

   ```
   $SPARK_HOME/bin/spark-submit \
    --class org.apache.spark.examples.SparkPi \
    --master $MASTER_URL \
    --conf spark.kubernetes.container.image=895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.15.0:latest \
    --conf spark.kubernetes.authenticate.driver.serviceAccountName=emr-containers-sa-spark \
    --deploy-mode cluster \
    --conf spark.kubernetes.namespace=default \
    --conf "spark.driver.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \
    --conf "spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \
    --conf "spark.executor.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \
    --conf "spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \
    --conf spark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \
    --conf spark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem \
    --conf spark.hadoop.fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3.EMRFSDelegate \
    --conf spark.hadoop.fs.s3.buffer.dir=/mnt/s3 \
    --conf spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds="2000" \
    --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem="2" \
    --conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem="true" \
    s3://my-pod-bucket/spark-examples.jar 20
   ```

1. After the spark driver finishes the Spark job, you should see a log line at the end of the submission indicating that the Spark job has finished.

   ```
   23/11/24 17:02:14 INFO LoggingPodStatusWatcherImpl: Application org.apache.spark.examples.SparkPi with submission ID default:org-apache-spark-examples-sparkpi-4980808c03ff3115-driver finished
   23/11/24 17:02:14 INFO ShutdownHookManager: Shutdown hook called
   ```

## Cleanup
<a name="spark-submit-security-irsa-cleanup"></a>

When you're done running your applications, you can perform cleanup with the following command.

```
kubectl delete -f spark-rbac.yaml
```