# Setting up IAM roles for service accounts (IRSA) for spark-submit The following sections explain how to set up IAM roles for service accounts (IRSA) to authenticate and authorize Kubernetes service accounts so you can run Spark applications stored in Amazon S3. ## Prerequisites Before trying any of the examples in this documentation, make sure that you have completed the following prerequisites: + [Finished setting up spark-submit](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-setup.html) + [Created an S3 bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/creating-bucket.html) and [uploaded](https://docs.aws.amazon.com/AmazonS3/latest/userguide/uploading-an-object-bucket.html) the spark application jar ## Configuring a Kubernetes service account to assume an IAM role The following steps cover how to configure a Kubernetes service account to assume an AWS Identity and Access Management (IAM) role. After you configure the pods to use the service account, they can then access any AWS service that the role has permissions to access. 1. Create a policy file to allow read-only access to the Amazon S3 object you [uploaded](https://docs.aws.amazon.com/AmazonS3/latest/userguide/uploading-an-object-bucket.html): ``` cat >my-policy.json <", "arn:aws:s3:::<{{my-spark-jar-bucket}}>/*" ] } ] } EOF ``` 1. Create the IAM policy. ``` aws iam create-policy --policy-name my-policy --policy-document file://my-policy.json ``` 1. Create an IAM role and associate it with a Kubernetes service account for the Spark driver ``` eksctl create iamserviceaccount --name my-spark-driver-sa --namespace spark-operator \ --cluster my-cluster --role-name "my-role" \ --attach-policy-arn arn:aws:iam::111122223333:policy/my-policy --approve ``` 1. Create a YAML file with the required [permissions](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-security.html) for the Spark driver service account: ``` cat >spark-rbac.yaml < Amazon EMR 6.10.0 and higher supports spark-submit for running Spark applications on an Amazon EKS cluster. To run the Spark application, follow these steps: 1. Make sure that you have completed the steps in [ Setting up spark-submit for Amazon EMR on EKS](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/spark-submit-setup.html). 1. Set the values for the following environment variables: ``` export SPARK_HOME=spark-home export MASTER_URL=k8s://Amazon EKS-cluster-endpoint ``` 1. Now, submit the Spark application with the following command: ``` $SPARK_HOME/bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master $MASTER_URL \ --conf spark.kubernetes.container.image=895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-6.15.0:latest \ --conf spark.kubernetes.authenticate.driver.serviceAccountName=emr-containers-sa-spark \ --deploy-mode cluster \ --conf spark.kubernetes.namespace=default \ --conf "spark.driver.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \ --conf "spark.driver.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \ --conf "spark.executor.extraClassPath=/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/usr/share/aws/hmclient/lib/aws-glue-datacatalog-spark-client.jar:/usr/share/java/Hive-JSON-Serde/hive-openx-serde.jar:/usr/share/aws/sagemaker-spark-sdk/lib/sagemaker-spark-sdk.jar:/home/hadoop/extrajars/*" \ --conf "spark.executor.extraLibraryPath=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native" \ --conf spark.hadoop.fs.s3.customAWSCredentialsProvider=com.amazonaws.auth.WebIdentityTokenCredentialsProvider \ --conf spark.hadoop.fs.s3.impl=com.amazon.ws.emr.hadoop.fs.EmrFileSystem \ --conf spark.hadoop.fs.AbstractFileSystem.s3.impl=org.apache.hadoop.fs.s3.EMRFSDelegate \ --conf spark.hadoop.fs.s3.buffer.dir=/mnt/s3 \ --conf spark.hadoop.fs.s3.getObject.initialSocketTimeoutMilliseconds="2000" \ --conf spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version.emr_internal_use_only.EmrFileSystem="2" \ --conf spark.hadoop.mapreduce.fileoutputcommitter.cleanup-failures.ignored.emr_internal_use_only.EmrFileSystem="true" \ s3://my-pod-bucket/spark-examples.jar 20 ``` 1. After the spark driver finishes the Spark job, you should see a log line at the end of the submission indicating that the Spark job has finished. ``` 23/11/24 17:02:14 INFO LoggingPodStatusWatcherImpl: Application org.apache.spark.examples.SparkPi with submission ID default:org-apache-spark-examples-sparkpi-4980808c03ff3115-driver finished 23/11/24 17:02:14 INFO ShutdownHookManager: Shutdown hook called ``` ## Cleanup When you're done running your applications, you can perform cleanup with the following command. ``` kubectl delete -f spark-rbac.yaml ```