

# Using Lake Formation with EMR Serverless


You can configure EMR Serverless applications to use Lake Formation with either full table access or fine-grained access control. For details on supported features in each access mode, review the following table.

## Feature availability



****  

| Feature | Available from | 
| --- | --- | 
| Read operations (SELECT, DESCRIBE) for Hive, Iceberg tables | EMR 7.2\$1 | 
| Multi-dialect views | EMR 7.6\$1 | 
| Read operations (SELECT, DESCRIBE) for Delta Lake and Hudi tables | EMR 7.6\$1 | 
| Full table access for Hive, Iceberg | EMR 7.9\$1 | 
| Full table access for Delta Lake | EMR 7.11\$1 | 
| Write operations (DDL, DML) for Hive, Iceberg and Delta Lake tables | EMR 7.12\$1 | 
| Full table access for Hudi | EMR 7.12\$1 | 

# Lake Formation full table access for EMR Serverless


With Amazon EMR releases 7.8.0 and higher, you can leverage AWS Lake Formation with Glue Data Catalog where the job runtime role has full table permissions without the limitations of fine-grained access control. This capability allows you to read and write to tables that are protected by Lake Formation from your EMR Serverless Spark batch and interactive jobs. See the following sections to learn more about Lake Formation and how to use it with EMR Serverless.

## Using Lake Formation with full table access


You can access AWS Lake Formation protected Glue Data catalog tables from EMR Serverless Spark jobs or interactive sessions where the job's runtime role has full table access. You do not need to enable AWS Lake Formation on the EMR Serverless application. When a Spark job is configured for Full Table Access (FTA), AWS Lake Formation credentials are used to read/write S3 data for AWS Lake Formation registered tables, while the job's runtime role credentials will be used to read/write tables not registered with AWS Lake Formation.

**Important**  
Do not enable AWS Lake Formation for fine-grained access control. A job cannot simultaneously run Full Table Access (FTA) and Fine-Grained Access Control (FGAC) on the same EMR cluster or application.

### Step 1: Enable Full Table Access in Lake Formation


To use Full Table Access (FTA) mode, you must allow third-party query engines to access data without the IAM session tag validation in AWS Lake Formation. To enable, follow the steps in [Application integration for full table access](https://docs.aws.amazon.com/lake-formation/latest/dg/full-table-credential-vending.html).

**Note**  
 When accessing cross-account tables, full-table access must be enabled in both producer and consumer accounts. In the same manner, when accessing cross-region tables, this setting must be enabled in both producer and consumer regions. 

### Step 2: Setup IAM permissions for job runtime role


For read or write access to underlying data, in addition to Lake Formation permissions, a job runtime role needs the `lakeformation:GetDataAccess` IAM permission. With this permission, Lake Formation grants the request for temporary credentials to access the data.

The following is an example policy of how to provide IAM permissions to access a script in Amazon S3, uploading logs to S3, AWS Glue API permissions, and permission to access Lake Formation.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "ScriptAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::*.amzn-s3-demo-bucket/scripts"
      ]
    },
    {
      "Sid": "LoggingAccess",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket/logs/*"
      ]
    },
    {
      "Sid": "GlueCatalogAccess",
      "Effect": "Allow",
      "Action": [
        "glue:Get*",
        "glue:Create*",
        "glue:Update*"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Sid": "LakeFormationAccess",
      "Effect": "Allow",
      "Action": [
        "lakeformation:GetDataAccess"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

#### Step 2.1 Configure Lake Formation permissions

+ Spark jobs that read data from S3 require Lake Formation SELECT permission.
+ Spark jobs that write/delete data in S3 require Lake Formation ALL (SUPER) permission.
+ Spark jobs that interact with Glue Data catalog require DESCRIBE, ALTER, DROP permission as appropriate.

For more information, refer to [Granting permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html).

### Step 3: Initialize a Spark session for Full Table Access using Lake Formation


#### Prerequisites


AWS Glue Data Catalog must be configured as a metastore to access Lake Formation tables.

Set the following settings to configure Glue catalog as a metastore:

```
--conf spark.sql.catalogImplementation=hive
--conf spark.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
```

For more information on enabling Data Catalog for EMR Serverless, refer to [Metastore configuration for EMR Serverless](metastore-config.html).

To access tables registered with AWS Lake Formation, the following configurations need to be set during Spark initialization to configure Spark to use AWS Lake Formation credentials.

------
#### [ Hive ]

```
‐‐conf spark.hadoop.fs.s3.credentialsResolverClass=com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver
--conf spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject=true 
--conf spark.hadoop.fs.s3.folderObject.autoAction.disabled=true
--conf spark.sql.catalog.skipLocationValidationOnCreateTable.enabled=true
--conf spark.sql.catalog.createDirectoryAfterTable.enabled=true
--conf spark.sql.catalog.dropDirectoryBeforeTable.enabled=true
```

------
#### [ Iceberg ]

```
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=S3_DATA_LOCATION
--conf spark.sql.catalog.spark_catalog.client.region=REGION
--conf spark.sql.catalog.spark_catalog.type=glue
--conf spark.sql.catalog.spark_catalog.glue.account-id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.glue.lakeformation-enabled=true
--conf spark.sql.catalog.dropDirectoryBeforeTable.enabled=true
```

------
#### [ Delta Lake ]

```
‐‐conf spark.hadoop.fs.s3.credentialsResolverClass=com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver
--conf spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject=true 
--conf spark.hadoop.fs.s3.folderObject.autoAction.disabled=true
--conf spark.sql.catalog.skipLocationValidationOnCreateTable.enabled=true
--conf spark.sql.catalog.createDirectoryAfterTable.enabled=true
--conf spark.sql.catalog.dropDirectoryBeforeTable.enabled=true
```

------
#### [ Hudi ]

```
‐‐conf spark.hadoop.fs.s3.credentialsResolverClass=com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver
--conf spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject=true 
--conf spark.hadoop.fs.s3.folderObject.autoAction.disabled=true
--conf spark.sql.catalog.skipLocationValidationOnCreateTable.enabled=true
--conf spark.sql.catalog.createDirectoryAfterTable.enabled=true
--conf spark.sql.catalog.dropDirectoryBeforeTable.enabled=true
--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
```

------
+ `spark.hadoop.fs.s3.credentialsResolverClass=com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver`: Configure EMR Filesystem (EMRFS) or EMR S3A to use AWS Lake Formation S3 credentials for Lake Formation registered tables. If the table is not registered, use the job's runtime role credentials. 
+ `spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject=true` and `spark.hadoop.fs.s3.folderObject.autoAction.disabled=true`: Configure EMRFS to use content type header application/x-directory instead of \$1folder\$1 suffix when creating S3 folders. This is required when reading Lake Formation tables, as Lake Formation credentials do not allow reading table folders with \$1folder\$1 suffix.
+ `spark.sql.catalog.skipLocationValidationOnCreateTable.enabled=true`: Configure Spark to skip validating the table location's emptiness before creation. This is necessary for Lake Formation registered tables, as Lake Formation credentials to verify the empty location are available only after Glue Data Catalog table creation. Without this configuration, the job's runtime role credentials will validate the empty table location.
+ `spark.sql.catalog.createDirectoryAfterTable.enabled=true`: Configure Spark to create the Amazon S3 folder after table creation in the Hive metastore. This is required for Lake Formation registered tables, as Lake Formation credentials to create the S3 folder are available only after Glue Data Catalog table creation.
+ `spark.sql.catalog.dropDirectoryBeforeTable.enabled=true`: Configure Spark to drop the S3 folder before table deletion in the Hive metastore. This is necessary for Lake Formation registered tables, as Lake Formation credentials to drop the S3 folder are not available after table deletion from the Glue Data Catalog.
+ `spark.sql.catalog.<catalog>.glue.lakeformation-enabled=true`: Configure Iceberg catalog to use AWS Lake Formation S3 credentials for Lake Formation registered tables. If the table is not registered, use default environment credentials.

#### Configure full table access mode in SageMaker Unified Studio


To access Lake Formation registered tables from interactive Spark sessions in JupyterLab notebooks, use compatibility permission mode. Use the %%configure magic command to set up your Spark configuration. Choose the configuration based on your table type:

------
#### [ For Hive tables ]

```
%%configure -f
{
    "conf": {
        "spark.hadoop.fs.s3.credentialsResolverClass": "com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver",
        "spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject": true,
        "spark.hadoop.fs.s3.folderObject.autoAction.disabled": true,
        "spark.sql.catalog.skipLocationValidationOnCreateTable.enabled": true,
        "spark.sql.catalog.createDirectoryAfterTable.enabled": true,
        "spark.sql.catalog.dropDirectoryBeforeTable.enabled": true
    }
}
```

------
#### [ For Iceberg tables ]

```
%%configure -f
{
    "conf": {
        "spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkSessionCatalog",
        "spark.sql.catalog.spark_catalog.warehouse": "S3_DATA_LOCATION",
        "spark.sql.catalog.spark_catalog.client.region": "REGION",
        "spark.sql.catalog.spark_catalog.type": "glue",
        "spark.sql.catalog.spark_catalog.glue.account-id": "ACCOUNT_ID",
        "spark.sql.catalog.spark_catalog.glue.lakeformation-enabled": "true",
        "spark.sql.catalog.dropDirectoryBeforeTable.enabled": "true", 
    }
}
```

------
#### [ For Delta Lake tables ]

```
%%configure -f
{
    "conf": {
        "spark.hadoop.fs.s3.credentialsResolverClass": "com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver",
        "spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject": true,
        "spark.hadoop.fs.s3.folderObject.autoAction.disabled": true,
        "spark.sql.catalog.skipLocationValidationOnCreateTable.enabled": true,
        "spark.sql.catalog.createDirectoryAfterTable.enabled": true,
        "spark.sql.catalog.dropDirectoryBeforeTable.enabled": true
    }
}
```

------
#### [ For Hudi tables ]

```
%%configure -f
{
    "conf": {
        "spark.hadoop.fs.s3.credentialsResolverClass": "com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver",
        "spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject": true,
        "spark.hadoop.fs.s3.folderObject.autoAction.disabled": true,
        "spark.sql.catalog.skipLocationValidationOnCreateTable.enabled": true,
        "spark.sql.catalog.createDirectoryAfterTable.enabled": true,
        "spark.sql.catalog.dropDirectoryBeforeTable.enabled": true,
        "spark.jars": "/usr/lib/hudi/hudi-spark-bundle.jar",
        "spark.sql.extensions": "org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.hudi.catalog.HoodieCatalog",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer"
    }
}
```

------

Replace the placeholders:
+ `S3_DATA_LOCATION`: Your S3 bucket path
+ `REGION`: AWS region (e.g., us-east-1)
+ `ACCOUNT_ID`: Your AWS account ID

**Note**  
You must set these configurations before executing any Spark operations in your notebook.

#### Supported Operations


These operations will use AWS Lake Formation credentials to access the table data.
+ CREATE TABLE
+ ALTER TABLE
+ INSERT INTO
+ INSERT OVERWRITE
+ UPDATE
+ MERGE INTO
+ DELETE FROM
+ ANALYZE TABLE
+ REPAIR TABLE
+ DROP TABLE
+ Spark datasource queries
+ Spark datasource writes

**Note**  
Operations not listed above will continue to use IAM permissions to access table data.

#### Considerations

+ If a Hive table is created using a job that doesn’t have full table access enabled, and no records are inserted, subsequent reads or writes from a job with full table access will fail. This is because EMR Spark without full table access adds the `$folder$` suffix to the table folder name. To resolve this, you can either:
  + Insert at least one row into the table from a job that does not have FTA enabled.
  + Configure the job that does not have FTA enabled to not use `$folder$` suffix in folder name in S3. This can be achieved by setting Spark configuration `spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject=true`.
  + Create a S3 folder at the table location `s3://path/to/table/table_name` using the AWS S3 console or AWS S3 CLI.
+ Full Table Access is supported with the EMR Filesystem (EMRFS) starting in Amazon EMR release 7.8.0, and with the S3A filesystem starting in Amazon EMR release 7.10.0.
+ Full Table Access is supported for Hive, Iceberg, Delta, and Hudi tables.
+ **Hudi FTA Write Support considerations:**
  + Hudi FTA writes require using HoodieCredentialedHadoopStorage for credential vending during job execution. Set the following configuration when running Hudi jobs: `hoodie.storage.class=org.apache.spark.sql.hudi.storage.HoodieCredentialedHadoopStorage`
  + Full Table Access (FTA) write support for Hudi is available starting from Amazon EMR release 7.12.
  + Hudi FTA write support currently works only with the default Hudi configurations. Custom or non-default Hudi settings may not be fully supported and could result in unexpected behavior.
  + Clustering for Hudi Merge-On-Read (MOR) tables is not supported at this point under FTA write mode.
+ Jobs referencing tables with Lake Formation Fine-Grained Access Control (FGAC) rules or Glue Data Catalog Views will fail. To query a table with an FGAC rules or a Glue Data Catalog View, you must use the FGAC mode. You can enable FGAC mode by following the steps outlined in the AWS documentation: [Using EMR Serverless with AWS Lake Formation for fine-grained access control](emr-serverless-lf-enable.html).
+ Full table access does not support Spark Streaming.
+ When writing Spark DataFrame to a Lake Formation table, only APPEND mode is supported for Hive and Iceberg tables: `df.write.mode("append").saveAsTable(table_name)`
+ Creating external tables requires IAM permissions.
+ Because Lake Formation temporarily caches credentials within a Spark job, a Spark batch job or interactive session that is currently running might not reflect permission changes.
+ You must use user defined role and not a service-linked role:[Lake Formation requirements for roles](https://docs.aws.amazon.com/lake-formation/latest/dg/registration-role.html).

#### Hudi FTA Write Support - Supported Operations


The following table shows the supported write operations for Hudi Copy-On-Write (COW) and Merge-On-Read (MOR) tables under Full Table Access mode:


**Hudi FTA Supported Write Operations**  

| Table Type | Operation | SQL Write Command | Status | 
| --- | --- | --- | --- | 
| COW | INSERT | INSERT INTO TABLE | Supported | 
| COW | INSERT | INSERT INTO TABLE - PARTITION (Static, Dynamic) | Supported | 
| COW | INSERT | INSERT OVERWRITE | Supported | 
| COW | INSERT | INSERT OVERWRITE - PARTITION (Static, Dynamic) | Supported | 
| UPDATE | UPDATE | UPDATE TABLE | Supported | 
| COW | UPDATE | UPDATE TABLE - Change Partition | Not Supported | 
| DELETE | DELETE | DELETE FROM TABLE | Supported | 
| ALTER | ALTER | ALTER TABLE - RENAME TO | Not Supported | 
| COW | ALTER | ALTER TABLE - SET TBLPROPERTIES | Supported | 
| COW | ALTER | ALTER TABLE - UNSET TBLPROPERTIES | Supported | 
| COW | ALTER | ALTER TABLE - ALTER COLUMN | Supported | 
| COW | ALTER | ALTER TABLE - ADD COLUMNS | Supported | 
| COW | ALTER | ALTER TABLE - ADD PARTITION | Supported | 
| COW | ALTER | ALTER TABLE - DROP PARTITION | Supported | 
| COW | ALTER | ALTER TABLE - RECOVER PARTITIONS | Supported | 
| COW | ALTER | REPAIR TABLE SYNC PARTITIONS | Supported | 
| DROP | DROP | DROP TABLE | Supported | 
| COW | DROP | DROP TABLE - PURGE | Supported | 
| CREATE | CREATE | CREATE TABLE - Managed | Supported | 
| COW | CREATE | CREATE TABLE - PARTITION BY | Supported | 
| COW | CREATE | CREATE TABLE IF NOT EXISTS | Supported | 
| COW | CREATE | CREATE TABLE LIKE | Supported | 
| COW | CREATE | CREATE TABLE AS SELECT | Supported | 
| CREATE | CREATE | CREATE TABLE with LOCATION - External Table | Not Supported | 
| DATAFRAME(INSERT) | DATAFRAME(INSERT) | saveAsTable.Overwrite | Supported | 
| COW | DATAFRAME(INSERT) | saveAsTable.Append | Not Supported | 
| COW | DATAFRAME(INSERT) | saveAsTable.Ignore | Supported | 
| COW | DATAFRAME(INSERT) | saveAsTable.ErrorIfExists | Supported | 
| COW | DATAFRAME(INSERT) | saveAsTable - External table (Path) | Not Supported | 
| COW | DATAFRAME(INSERT) | save(path) - DF v1 | Not Supported | 
| MOR | INSERT | INSERT INTO TABLE | Supported | 
| MOR | INSERT | INSERT INTO TABLE - PARTITION (Static, Dynamic) | Supported | 
| MOR | INSERT | INSERT OVERWRITE | Supported | 
| MOR | INSERT | INSERT OVERWRITE - PARTITION (Static, Dynamic) | Supported | 
| UPDATE | UPDATE | UPDATE TABLE | Supported | 
| MOR | UPDATE | UPDATE TABLE - Change Partition | Not Supported | 
| DELETE | DELETE | DELETE FROM TABLE | Supported | 
| ALTER | ALTER | ALTER TABLE - RENAME TO | Not Supported | 
| MOR | ALTER | ALTER TABLE - SET TBLPROPERTIES | Supported | 
| MOR | ALTER | ALTER TABLE - UNSET TBLPROPERTIES | Supported | 
| MOR | ALTER | ALTER TABLE - ALTER COLUMN | Supported | 
| MOR | ALTER | ALTER TABLE - ADD COLUMNS | Supported | 
| MOR | ALTER | ALTER TABLE - ADD PARTITION | Supported | 
| MOR | ALTER | ALTER TABLE - DROP PARTITION | Supported | 
| MOR | ALTER | ALTER TABLE - RECOVER PARTITIONS | Supported | 
| MOR | ALTER | REPAIR TABLE SYNC PARTITIONS | Supported | 
| DROP | DROP | DROP TABLE | Supported | 
| MOR | DROP | DROP TABLE - PURGE | Supported | 
| CREATE | CREATE | CREATE TABLE - Managed | Supported | 
| MOR | CREATE | CREATE TABLE - PARTITION BY | Supported | 
| MOR | CREATE | CREATE TABLE IF NOT EXISTS | Supported | 
| MOR | CREATE | CREATE TABLE LIKE | Supported | 
| MOR | CREATE | CREATE TABLE AS SELECT | Supported | 
| CREATE | CREATE | CREATE TABLE with LOCATION - External Table | Not Supported | 
| DATAFRAME(UPSERT) | DATAFRAME(UPSERT) | saveAsTable.Overwrite | Supported | 
| MOR | DATAFRAME(UPSERT) | saveAsTable.Append | Not Supported | 
| MOR | DATAFRAME(UPSERT) | saveAsTable.Ignore | Supported | 
| MOR | DATAFRAME(UPSERT) | saveAsTable.ErrorIfExists | Supported | 
| MOR | DATAFRAME(UPSERT) | saveAsTable - External table (Path) | Not Supported | 
| MOR | DATAFRAME(UPSERT) | save(path) - DF v1 | Not Supported | 
| DATAFRAME(DELETE) | DATAFRAME(DELETE) | saveAsTable.Append | Not Supported | 
| MOR | DATAFRAME(DELETE) | saveAsTable - External table (Path) | Not Supported | 
| MOR | DATAFRAME(DELETE) | save(path) - DF v1 | Not Supported | 
| DATAFRAME(BULK\$1INSERT) | DATAFRAME(BULK\$1INSERT) | saveAsTable.Overwrite | Supported | 
| MOR | DATAFRAME(BULK\$1INSERT) | saveAsTable.Append | Not Supported | 
| MOR | DATAFRAME(BULK\$1INSERT) | saveAsTable.Ignore | Supported | 
| MOR | DATAFRAME(BULK\$1INSERT) | saveAsTable.ErrorIfExists | Supported | 
| MOR | DATAFRAME(BULK\$1INSERT) | saveAsTable - External table (Path) | Not Supported | 
| MOR | DATAFRAME(BULK\$1INSERT) | save(path) - DF v1 | Not Supported | 

# Using EMR Serverless with AWS Lake Formation for fine-grained access control
Lake Formation for FGAC

## Overview


With Amazon EMR releases 7.2.0 and higher, leverage AWS Lake Formation to apply fine-grained access controls on Data Catalog tables that are backed by S3. This capability lets you configure table, row, column, and cell level access controls for read queries within your Amazon EMR Serverless Spark jobs. To configure fine-grained access control for Apache Spark batch jobs and interactive sessions, use EMR Studio. See the following sections to learn more about Lake Formation and how to use it with EMR Serverless.

Using Amazon EMR Serverless with AWS Lake Formation incurs additional charges. For more information, refer to [Amazon EMR pricing](https://aws.amazon.com/emr/pricing/).

## How EMR Serverless works with AWS Lake Formation
How it works

Using EMR Serverless with Lake Formation lets you enforce a layer of permissions on each Spark job to apply Lake Formation permissions control when EMR Serverless executes jobs. EMR Serverless uses [ Spark resource profiles](https://spark.apache.org/docs/latest/api/java/org/apache/spark/resource/ResourceProfile.html) to create two profiles to effectively execute jobs. The user profile executes user-supplied code, while the system profile enforces Lake Formation policies. For more information, refer to [What is AWS Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html) and [Considerations and limitations](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable-considerations.html).

When you use pre-initialized capacity with Lake Formation, we suggest that you have a minimum of two Spark drivers. Each Lake Formation-enabled job utilizes two Spark drivers, one for the user profile and one for the system profile. For the best performance, use double the number of drivers for Lake Formation-enabled jobs compared to if you don't use Lake Formation.

When you run Spark jobs on EMR Serverless, also consider the impact of dynamic allocation on resource management and cluster performance. The configuration `spark.dynamicAllocation.maxExecutors` of the maximum number of executors per resource profile applies to user and system executors. If you configure that number to be equal to the maximum allowed number of executors, your job run might get stuck because of one type of executor that uses all available resources, which prevents the other executor when you run jobs jobs.

So you don't run out of resources, EMR Serverless sets the default maximum number of executors per resource profile to 90% of the `spark.dynamicAllocation.maxExecutors` value. You can override this configuration when you specify `spark.dynamicAllocation.maxExecutorsRatio` with a value between 0 and 1. Additionally, also configure the following properties to optimize resource allocation and overall performance:
+ `spark.dynamicAllocation.cachedExecutorIdleTimeout`
+ `spark.dynamicAllocation.shuffleTracking.timeout`
+ `spark.cleaner.periodicGC.interval`

The following is a high-level overview of how EMR Serverless gets access to data protected by Lake Formation security policies.

![\[How Amazon EMR accesses data protected by Lake Formation security policies.\]](http://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/images/lf-emr-s-architecture.png)


1. A user submits Spark job to an AWS Lake Formation-enabled EMR Serverless application. 

1. EMR Serverless sends the job to a user driver and runs the job in the user profile. The user driver runs a lean version of Spark that has no ability to launch tasks, request executors, access S3 or the Glue Catalog. It builds a job plan.

1. EMR Serverless sets up a second driver called the system driver and runs it in the system profile (with a privileged identity). EMR Serverless sets up an encrypted TLS channel between the two drivers for communication. The user driver uses the channel to send the job plans to the system driver. The system driver does not run user-submitted code. It runs full Spark and communicates with S3, and the Data Catalog for data access. It request executors and compiles the Job Plan into a sequence of execution stages. 

1. EMR Serverless then runs the stages on executors with the user driver or system driver. User code in any stage is run exclusively on user profile executors.

1. Stages that read data from Data Catalog tables protected by AWS Lake Formation or those that apply security filters are delegated to system executors.

## Enabling Lake Formation in Amazon EMR
Enable Lake Formation

To enable Lake Formation, set `spark.emr-serverless.lakeformation.enabled` to `true` under `spark-defaults` classification for the runtime-configuration parameter when [ creating an EMR Serverless application](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/getting-started.html#gs-application-console).

```
aws emr-serverless create-application \
    --release-label emr-7.12.0 \
    --runtime-configuration '{
     "classification": "spark-defaults", 
     "properties": {
      "spark.emr-serverless.lakeformation.enabled": "true"
      }
    }' \
    --type "SPARK"
```

You can also enable Lake Formation when you create a new application in EMR Studio. Choose **Use Lake Formation for fine-grained access control**, available under **Additional configurations**.

[Inter-worker encryption](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/interworker-encryption.html) is enabled by default when you use Lake Formation with EMR Serverless, so you do not need to explicitly enable inter-worker encryption again.

**Enabling Lake Formation for Spark jobs**

To enable Lake Formation for individual Spark jobs, set `spark.emr-serverless.lakeformation.enabled` to true when using `spark-submit`.

```
--conf spark.emr-serverless.lakeformation.enabled=true
```

## Job runtime role IAM permissions
Enable runtime permissions

Lake Formation permissions control access to AWS Glue Data Catalog resources, Amazon S3 locations, and the underlying data at those locations. IAM permissions control access to the Lake Formation and AWS Glue APIs and resources. Although you might have the Lake Formation permission to access a table in the Data Catalog (SELECT), your operation fails if you don’t have the IAM permission on the `glue:Get*` API operation. 

The following is an example policy of how to provide IAM permissions to access a script in S3, uploading logs to S3, AWS Glue API permissions, and permission to access Lake Formation.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "ScriptAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::*.amzn-s3-demo-bucket/scripts",
        "arn:aws:s3:::*.amzn-s3-demo-bucket/*"
      ]
    },
    {
      "Sid": "LoggingAccess",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket/logs/*"
      ]
    },
    {
      "Sid": "GlueCatalogAccess",
      "Effect": "Allow",
      "Action": [
        "glue:Get*",
        "glue:Create*",
        "glue:Update*"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Sid": "LakeFormationAccess",
      "Effect": "Allow",
      "Action": [
        "lakeformation:GetDataAccess"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

## Setting up Lake Formation permissions for job runtime role
Set up runtime permissions

First, register the location of your Hive table with Lake Formation. Then create permissions for your job runtime role on your desired table. For more details about Lake Formation, refer to [ What is AWS Lake Formation?](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html) in the *AWS Lake Formation Developer Guide*.

After you set up the Lake Formation permissions, submit Spark jobs on Amazon EMR Serverless. For more information about Spark jobs, refer to [Spark examples](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-examples).

## Submitting a job run


After you finish setting up the Lake Formation grants, you can [ submit Spark jobs on EMR Serverless.](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-examples) The section that follows shows examples of how to configure and submit job run properties.

## Permission requirements


### Tables not registered in AWS Lake Formation


For tables not registered with AWS Lake Formation, the job runtime role accesses both the AWS Glue Data Catalog and the underlying table data in Amazon S3. This requires the job runtime role to have appropriate IAM permissions for both AWS Glue and Amazon S3 operations. 

### Tables registered in AWS Lake Formation


For tables registered with AWS Lake Formation, the job runtime role accesses the AWS Glue Data Catalog metadata, while temporary credentials vended by Lake Formation access the underlying table data in Amazon S3. The Lake Formation permissions required to execute an operation depend on the AWS Glue Data Catalog and Amazon S3 API calls that the Spark job initiates and can be summarized as follows:
+ **DESCRIBE** permission allows the runtime role to read table or database metadata in the Data Catalog
+ **ALTER** permission allows the runtime role to modify table or database metadata in the Data Catalog
+ **DROP** permission allows the runtime role to delete table or database metadata from the Data Catalog
+ **SELECT** permission allows the runtime role to read table data from Amazon S3
+ **INSERT** permission allows the runtime role to write table data to Amazon S3
+ **DELETE** permission allows the runtime role to delete table data from Amazon S3
**Note**  
Lake Formation evaluates permissions lazily when a Spark job calls AWS Glue to retrieve table metadata and Amazon S3 to retrieve table data. Jobs that use a runtime role with insufficient permissions will not fail until Spark makes an AWS Glue or Amazon S3 call that requires the missing permission.

**Note**  
In the following supported table matrix:   
Operations marked as **Supported** exclusively use Lake Formation credentials to access table data for tables registered with Lake Formation. If Lake Formation permissions are insufficient, the operation will not fall back to runtime role credentials. For tables not registered with Lake Formation, the job runtime role credentials access the table data.
Operations marked as **Supported with IAM permissions on Amazon S3 location** do not use Lake Formation credentials to access underlying table data in Amazon S3. To run these operations, the job runtime role must have the necessary Amazon S3 IAM permissions to access the table data, regardless of whether the table is registered with Lake Formation.

------
#### [ Hive ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable.html)

------
#### [ Iceberg ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable.html)

**Spark configuration for Iceberg:** The following sample shows how to configure Spark with Iceberg. To run Iceberg jobs, provide the following `spark-submit` properties.

```
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=<S3_DATA_LOCATION>
--conf spark.sql.catalog.spark_catalog.glue.account-id=<ACCOUNT_ID>
--conf spark.sql.catalog.spark_catalog.client.region=<REGION>
--conf spark.sql.catalog.spark_catalog.glue.endpoint=https://glue.<REGION>.amazonaws.com
```

------
#### [ Hudi ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable.html)

The following samples configure Spark with Hudi, specifying file locations and other properties necessary for use.

**Spark config for Hudi:** This snippet when used in a notebook specifies the path to the Hudi Spark bundle JAR file, which enables Hudi functionality in Spark. It also configures Spark to use the AWS Glue Data Catalog as the metastore.

```
%%configure -f
{
    "conf": {
        "spark.jars": "/usr/lib/hudi/hudi-spark-bundle.jar",
        "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
        "spark.serializer": "org.apache.spark.serializer.JavaSerializer",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.hudi.catalog.HoodieCatalog",
        "spark.sql.extensions": "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
    }
}
```

**Spark config for Hudi with AWS Glue:** This snippet when used in a notebook enables Hudi as a supported data-lake format and ensures that Hudi libraries and dependencies are available.

```
%%configure
{
    "--conf": "spark.serializer=org.apache.spark.serializer.JavaSerializer --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
    "--datalake-formats": "hudi",
    "--enable-glue-datacatalog": True,
    "--enable-lakeformation-fine-grained-access": "true"
}
```

------
#### [ Delta Lake ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable.html)

**EMR Serverless with Delta Lake:** To use Delta Lake with Lake Formation on EMR Serverless, run the following command:

```
spark-sql \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension,com.amazonaws.emr.recordserver.connector.spark.sql.RecordServerSQLExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
```

------

# Debugging jobs
Debugging jobs

**Note**  
With this feature, access **stdout** and **stderr** logs for the system profile workers that may contain sensitive, unfiltered information. The following permission should be used only for accessing non-production data. For applications created for use with production jobs, we strongly suggest that you add these permissions only to administrators or users with elevated data access.

With EMR-7.3.0 and later, EMR Serverless is enabling self-debugging capability for Lake Formation-enabled batch jobs. To do so, use the new parameter **accessSystemProfileLogs** in the [GetDashboardForJobRun](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetDashboardForJobRun.html) API. If **accessSystemProfileLogs** is set to **true**, you can access the **stdout** and **stderr** logs for the system profile workers, which can be used for debugging a Lake Formation-enabled EMR Serverless batch job.

```
aws emr-serverless get-dashboard-for-job-run \
  --application-id application-id
  --job-run-id job-run-id
  --access-system-profile-logs
```

## Required permissions
Required permissions

The principal who wants to debug Lake Formation-enabled batch jobs using **GetDashboardForJobRun** must have the following additional permissions:

```
{
    "Sid": "AccessSystemProfileLogs",
    "Effect": "Allow",
    "Action": [
        "emr-serverless:GetDashboardForJobRun",
        "emr-serverless:AccessSystemProfileLogs",
        "glue:GetDatabases",
        "glue:SearchTables"
    ],
    "Resource": [
        "arn:aws:emr-serverless:region:account-id:/applications/applicationId/jobruns/jobid",
        "arn:aws:glue:region:account-id:catalog",
        "arn:aws:glue:region:account-id:database/*",
        "arn:aws:glue:region:account-id:table/*/*"
    ]
}
```

## Considerations
Considerations

System profile logs for debugging are visible for jobs that access databases or tables in Lake Formation within the same account as the job. They are not visible in the following scenarios:
+ If the data catalog managed using Lake Formation permissions has cross-account databases and tables
+ If the data catalog managed using Lake Formation permissions has resource links

# Working with Glue Data Catalog views


You can create and manage views in the AWS Glue Data Catalog for use with EMR Serverless. These are known commonly as AWS Glue Data Catalog views. These views are useful because they support multiple SQL query engines, so you can access the same view across different AWS services, such as EMR Serverless, Amazon Athena, and Amazon Redshift.

By creating a view in the Data Catalog, use resource grants and tag-based access controls in AWS Lake Formation to grant access to it. Using this method of access control, you do not need to configure additional access to the tables you referenced when creating the view. This method of granting permissions is called definer semantics, and these views are called definer views. For more information about access control in Lake Formation, see [Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html) in the AWS Lake Formation Developer Guide.

Data Catalog views are useful for the following use cases:
+ **Granular access control** – You can create a view that restricts data access based on the permissions the user needs. For example, you can use views in the Data Catalog to prevent employees who don’t work in the HR department from seeing personally identifiable information (PII).
+ **Complete view definition** – By applying filters on your view in the Data Catalog, you make sure that data records available in a view in the Data Catalog are always complete.
+ **Enhanced security** – The query definition used to create the view must be complete. This benefit means that views in the Data Catalog are less susceptible to SQL commands from malicious actors.
+ **Simple sharing data** – Share data with other AWS accounts without moving data. For more information, refer to [Cross-account data sharing in Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/cross-account-permissions.html).

## Creating a Data Catalog view


There are different ways to create a Data Catalog view. These include using the AWS CLI or Spark SQL. A few examples follow.

------
#### [ Using SQL ]

The following demonstrates the syntax for creating a Data Catalog view. Note the `MULTI DIALECT` view type. This distinguishes the Data Catalog view from other views. The `SECURITY` predicate is specified as `DEFINER`. This indicates a Data Catalog view with `DEFINER` semantics.

```
CREATE [ OR REPLACE ] PROTECTED MULTI DIALECT VIEW [IF NOT EXISTS] view_name
[(column_name [COMMENT column_comment], ...) ]
[ COMMENT view_comment ]
[TBLPROPERTIES (property_name = property_value, ... )]
SECURITY DEFINER
AS query;
```

The following is a sample `CREATE` statement, following the syntax:

```
CREATE PROTECTED MULTI DIALECT VIEW catalog_view
SECURITY DEFINER
AS
SELECT order_date, sum(totalprice) AS price
FROM source_table
GROUP BY order_date
```

You can also create a view in dry-run mode, using SQL, to test view creation, without actually creating the resource. Using this option results in a "dry run" that validates the input and, if the validation succeeds, returns the JSON of the AWS Glue table object that will represent the view. In this case, The actual view isn't created.

```
CREATE [ OR REPLACE ] PROTECTED MULTI DIALECT VIEW view_name
SECURITY DEFINER 
[ SHOW VIEW JSON ]
AS view-sql
```

------
#### [ Using the AWS CLI ]

**Note**  
When you use the CLI command, the SQL used to create the view isn't parsed. This can result in a case where the view is created, but queries aren't successful. Be sure to test your SQL syntax prior to creating the view.

You use the following CLI command to create a view:

```
aws glue create-table --cli-input-json '{
  "DatabaseName": "database",
  "TableInput": {
    "Name": "view",
    "StorageDescriptor": {
      "Columns": [
        {
          "Name": "col1",
          "Type": "data-type"
        },
        ...
        {
          "Name": "col_n",
          "Type": "data-type"
        }
      ],
      "SerdeInfo": {}
    },
    "ViewDefinition": {
      "SubObjects": [
        "arn:aws:glue:aws-region:aws-account-id:table/database/referenced-table1",
        ...
        "arn:aws:glue:aws-region:aws-account-id:table/database/referenced-tableN",
       ],
      "IsProtected": true,
      "Representations": [
        {
          "Dialect": "SPARK",
          "DialectVersion": "1.0",
          "ViewOriginalText": "Spark-SQL",
          "ViewExpandedText": "Spark-SQL"
        }
      ]
    }
  }
}'
```

------

## Supported view operations


The following command fragments show you various ways to work with Data Catalog views:
+ **CREATE VIEW**

  Creates a data-catalog view. The following is a sample that shows creating a view from an existing table:

  ```
  CREATE PROTECTED MULTI DIALECT VIEW catalog_view 
  SECURITY DEFINER AS SELECT * FROM my_catalog.my_database.source_table
  ```
+ **ALTER VIEW**

  Available syntax:
  + `ALTER VIEW view_name [FORCE] ADD DIALECT AS query`
  + `ALTER VIEW view_name [FORCE] UPDATE DIALECT AS query`
  + `ALTER VIEW view_name DROP DIALECT`

  You can use the `FORCE ADD DIALECT` option to force update the schema and sub objects as per the new engine dialect. Note that doing this can result in query errors if you don't also use `FORCE` to update other engine dialects. The following demonstrates a sample:

  ```
  ALTER VIEW catalog_view FORCE ADD DIALECT
  AS
  SELECT order_date, sum(totalprice) AS price
  FROM source_table
  GROUP BY orderdate;
  ```

  The following demonstrates how to alter a view to update the dialect:

  ```
  ALTER VIEW catalog_view UPDATE DIALECT AS 
  SELECT count(*) FROM my_catalog.my_database.source_table;
  ```
+ **DESCRIBE VIEW**

  Available syntax for describing a view:
  + `SHOW COLUMNS {FROM|IN} view_name [{FROM|IN} database_name]` – If the user has the required AWS Glue and Lake Formation permissions to describe the view, they can list the columns. The following demonstrates a couple sample commands for showing columns:

    ```
    SHOW COLUMNS FROM my_database.source_table;    
    SHOW COLUMNS IN my_database.source_table;
    ```
  + `DESCRIBE view_name` – If the user has the required AWS Glue and Lake Formation permissions to describe the view, they can list the columns in the view along with its metadata.
+ **DROP VIEW**

  Available syntax:
  + `DROP VIEW [ IF EXISTS ] view_name`

    The following sample shows a `DROP` statement that tests if a view exists prior to dropping it:

    ```
    DROP VIEW IF EXISTS catalog_view;
    ```
+ **SHOW CREATE VIEW**
  + `SHOW CREATE VIEW view_name` – Shows the SQL statement that creates the specified view. The following is a sample that shows creating a data-catalog view:

    ```
    SHOW CREATE TABLE my_database.catalog_view;
    CREATE PROTECTED MULTI DIALECT VIEW my_catalog.my_database.catalog_view (
      net_profit,
      customer_id,
      item_id,
      sold_date)
    TBLPROPERTIES (
      'transient_lastDdlTime' = '1736267222')
    SECURITY DEFINER AS SELECT * FROM
    my_database.store_sales_partitioned_lf WHERE customer_id IN (SELECT customer_id from source_table limit 10)
    ```
+ **SHOW VIEWS**

  List all views in the catalog such asregular views, multi-dialect views (MDV), and MDV without Spark dialect. Available syntax is the following:
  + `SHOW VIEWS [{ FROM | IN } database_name] [LIKE regex_pattern]`:

    The following demonstrates a sample command to show views:

    ```
    SHOW VIEWS IN marketing_analytics LIKE 'catalog_view*';
    ```

For more information about creating and configuring data-catalog views, refer to [Building AWS Glue Data Catalog views](https://docs.aws.amazon.com/lake-formation/latest/dg/working-with-views.html) in the AWS Lake Formation Developer Guide.

## Querying a Data Catalog view


 After creating a Data Catalog view, you can query it using an Amazon EMR Serverless Spark job that has AWS Lake Formation fine-grained access control enabled. The job runtime role must have the Lake Formation `SELECT` permission on the Data Catalog view. You don't need to grant access to the underlying tables referenced in the view. 

Once you have everything set up, you can query your view. For example, after creating an EMR Serverless application in EMR Studio, run the following query to access a view.

```
SELECT * from my_database.catalog_view LIMIT 10;
```

A helpful function is the `invoker_principal`. It returns the unique identifier of the EMRS job runtime role. This can be used to control the view output, based on the invoking principal. You can use this to add a condition in your view that refines query results, based on the calling role. The job runtime role must have permission to the `LakeFormation:GetDataLakePrincipal` IAM action to use this function.

```
select invoker_principal();
```

You can add this function to a `WHERE` clause, for instance, to refine query results.

## Considerations and limitations


When you create Data Catalog views, the following apply:
+ You can only create Data Catalog views with Amazon EMR 7.6 and above.
+ The Data Catalog view definer must have `SELECT` access to the underlying base tables accessed by the view. Creating the Data Catalog view fails if a specific base table has any Lake Formation filters imposed on the definer role.
+ Base tables must not have the `IAMAllowedPrincipals` data lake permission in Lake Formation. If present, the error *Multi Dialect views may only reference tables without IAMAllowedPrincipals permissions* occurs.
+ The table's Amazon S3 location must be registered as a Lake Formation data lake location. If the table isn't registered, the error *Multi Dialect views may only reference Lake Formation managed tables* occurs. For information about how to register Amazon S3 locations in Lake Formation, refer to [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html) in the AWS Lake Formation Developer Guide.
+ You can only create `PROTECTED` Data Catalog views. `UNPROTECTED` views aren't supported.
+ You can't reference tables in another AWS account in a Data Catalog view definition. You also can't reference a table in the same account that's in a separate region.
+ To share data across an account or region, the entire view must be be shared cross account and cross region, using Lake Formation resource links.
+ User-defined functions (UDFs) aren't supported.
+ You can use views based on Iceberg tables. The open-table formats Apache Hudi and Delta Lake are also supported.
+ You can't reference other views in Data Catalog views.
+ An AWS Glue Data Catalog view schema is always stored using lowercase. For example, if you use a DDL statement to create a Glue Data Catalog view with a column named `Castle`, the column created in the Glue Data Catalog will be made lowercase, to `castle`. If you then specify the column name in a DML query as `Castle` or `CASTLE`, EMR Spark will make the name lowercase for you to run the query. But the column heading displays using the casing that you specified in the query. 

  If you want a query to fail in a case where a column name specified in the DML query does not match the column name in the Glue Data Catalog, set `spark.sql.caseSensitive=true`.

# Open-table format support


EMR Serverless supports SELECT queries on Apache Hive, Apache Iceberg, Delta Lake (7.6.0\$1), and Apache Hudi (7.6.0\$1). Starting with EMR 7.12, DML and DDL operations that modify table data are supported for Apache Hive, Apache Iceberg, and Delta Lake tables using Lake Formation vended credentials. 

# Considerations and limitations


## General


Review the following limitations when using Lake Formation with EMR Serverless.

**Note**  
When you enable Lake Formation for a Spark job on EMR Serverless, the job launches a system driver and a user driver. If you specified pre-initialized capacity at launch, the drivers provision from the pre-initialized capacity, and the number of system drivers is equal to the number of user drivers that you specify. If you choose On Demand capacity, EMR Serverless launches a system driver in addition to a user driver. To estimate the costs associated with your EMR Serverless with Lake Formation job, use the [AWS Pricing Calculator](https://calculator.aws/#/addService/EMR).
+ Amazon EMR Serverless with Lake Formation is available in all supported [EMR Serverless Regions](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/considerations.html).
+ Lake Formation-enabled applications don’t support usage of [ customized EMR Serverless images](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html).
+ You can't turn off `DynamicResourceAllocation` for Lake Formation jobs.
+ You can only use Lake Formation with Spark jobs.
+ EMR Serverless with Lake Formation only supports a single Spark session throughout a job.
+ EMR Serverless with Lake Formation only supports cross-account table queries shared through resource links.
+ The following aren't supported:
  + Resilient distributed datasets (RDD)
  + Spark streaming
  + Access control for nested columns
+ EMR Serverless blocks functionalities that might undermine the complete isolation of system driver, including the following:
  + UDTs, HiveUDFs, and any user-defined function that involves custom classes
  + Custom data sources
  + Supply of additional jars for Spark extension, connector, or metastore
  + `ANALYZE TABLE` command
+ If your EMR Serverless application is in a private subnet with VPC endpoints for Amazon S3 and you attach an endpoint policy to control access, before your jobs can send log data to AWS Managed Amazon S3, include the permissions detailed in [Managed storage](logging.html#jobs-log-storage-managed-storage) in your VPC policy to S3 gateway endpoint. For troubleshooting requests, contact AWS support.
+ Starting with Amazon EMR 7.9.0, Spark FGAC supports S3AFileSystem when used with the s3a:// scheme.
+ Amazon EMR 7.11 supports creating managed tables using CTAS.
+ Amazon EMR 7.12 supports creating managed and external tables using CTAS.

## Permissions

+ To enforce access controls, EXPLAIN PLAN and DDL operations such as DESCRIBE TABLE don't expose restricted information.
+ When you register a table location with Lake Formation, data access uses Lake Formation stored credentials instead of the EMR Serverless job runtime role's IAM permissions. Jobs will fail if the registered role for table location is misconfigured, even when the runtime role has S3 IAM permissions for that location.
+ Starting with Amazon EMR 7.12, you can write to existing Hive and Iceberg tables using DataFrameWriter (V2) with Lake Formation credentials in append mode. For overwrite operations or when creating new tables, EMR uses the runtime role credentials to modify table data.
+ The following limitations apply when using views or cached tables as source data (these limitations do not apply to AWS Glue Data Catalog views):
  + For MERGE, DELETE, and UPDATE operations
    + Supported: Using views and cached tables as source tables.
    + Not supported: Using views and cached tables in assignment and condition clauses.
  + For CREATE OR REPLACE and REPLACE TABLE AS SELECT operations:
    + Not supported: Using views and cached tables as source tables.
+ Delta Lake tables with UDFs in source data support MERGE, DELETE, and UPDATE operations only when deletion vector is enabled.

## Logs and debugging

+ EMR Serverless restricts access to system driver Spark logs on Lake Formation-enabled applications. Since the system driver runs with elevated permissions, events and logs that the system driver generates can include sensitive information. To prevent unauthorized users or code from accessing this sensitive data, EMR Serverless disables access to system driver logs.
+ System profile logs are always persisted in managed storage – this is a mandatory setting that cannot be disabled. These logs are stored securely and encrypted using either a Customer Managed KMS key or an AWS Managed KMS keys.

## Iceberg


Review the following considerations when using Apache Iceberg:
+ You can only use Apache Iceberg with session catalog and not arbitrarily named catalogs.
+ Iceberg tables that are registered in Lake Formation only support the metadata tables `history`, `metadata_log_entries`, `snapshots`, `files`, `manifests`, and `refs`. Amazon EMR hides the columns that might have sensitive data, such as `partitions`, `path`, and `summaries`. This limitation doesn't apply to Iceberg tables that aren't registered in Lake Formation.
+ Tables that not registered in Lake Formation support all Iceberg stored procedures. The `register_table` and `migrate` procedures aren't supported for any tables.
+ We suggest that you use Iceberg DataFrameWriterV2 instead of V1.

# Troubleshooting


See the following sections for troubleshooting solutions.

## Logging


EMR Serverless uses Spark resources profiles to split job execution. EMR Serverless uses the user profile to run the code you supplied, while the system profile enforces Lake Formation policies. You can access the logs for the tasks ran as the user profile.

For more information about debugging Lake Formation-enabled jobs, refer to [Debugging jobs](emr-serverless-lf-enable-debugging.html).

## Live UI and Spark History Server


The Live UI and the Spark History Server have all Spark events generated from the user profile and redacted events generated from the system driver.

You can see all of the tasks from the user and system drivers in the **Executors** tab. However, log links are available only for the user profile. Also, some information is redacted from Live UI, such as the number of output records.

## Job failed with insufficient Lake Formation permissions


Make sure that your job runtime role has the permissions to run SELECT and DESCRIBE on the table that you are accessing.

## Job with RDD execution failed


EMR Serverless currently doesn't support resilient distributed dataset (RDD) operations on Lake Formation-enabled jobs.

## Unable to access data files in Amazon S3


Make sure you have registered the location of the data lake in Lake Formation.

## Security validation exception


EMR Serverless detected a security validation error. Contact AWS support for assistance.

## Sharing AWS Glue Data Catalog and tables across accounts


You can share databases and tables across accounts and still use Lake Formation. For more information, refer to [Cross-account data sharing in Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/cross-account-permissions.html) and [How do I share AWS Glue Data Catalog and tables cross-account using AWS Lake Formation?](https://repost.aws/knowledge-center/glue-lake-formation-cross-account).

# Spark native fine-grained access control alllowlisted PySpark API


To maintain security and data access controls, Spark fine-grained access control (FGAC) restricts certain PySpark functions. These restrictions are enforced through:
+ Explicit blocking that prevents function execution
+ Architecture incompatibilities that make functions non-functional
+ Functions that may throw errors, return access denied messages, or do nothing when called

The following PySpark features aren't supported in Spark FGAC:
+ RDD operations (blocked with SparkRDDUnsupportedException)
+ Spark Connect (unsupported)
+ Spark Streaming (unsupported)

While we've tested the listed functions in a Native Spark FGAC environment and confirmed they work as expected, our testing typically covers only basic usage of each API. Functions with multiple input types or complex logic paths may have untested scenarios.

For any functions not listed here and not clearly part of the unsupported categories above, we recommend:
+ Testing them first in a gamma environment or small-scale deployment
+ Verifying their behavior before using them in production

**Note**  
If you see a class method listed but not its base class, the method should still work—it just means we haven't explicitly verified the base class constructor.

The PySpark API is organized into modules. General support for methods within each module is detailed in the table below.


| Module name | Status | Notes | 
| --- | --- | --- | 
|  pyspark\$1core  |  Supported  |  This module contains the main RDD classes, and these functions are mostly unsupported.  | 
|  pyspark\$1sql  |  Supported  |  | 
|  pyspark\$1testing  |  Supported  |  | 
|  pyspark\$1resource  |  Supported  |  | 
|  pyspark\$1streaming  |  Blocked  |  Streaming usage is blocked in Spark FGAC.  | 
|  pyspark\$1mllib  |  Experimental  |  This module contains RDD based ML operations, and these functions are mostly unsupported. This module isn't thoroughly tested.  | 
|  pyspark\$1ml  |  Experimental  |  This module containes DataFrame based ML operations, and these functions are mostly supported. This module isn't thoroughly tested.  | 
|  pyspark\$1pandas  |  Supported  |    | 
|  pyspark\$1pandas\$1slow  |  Supported  |    | 
| pyspark\$1connect |  Blocked  |  Spark Connect usage is blocked in Spark FGAC.  | 
| pyspark\$1pandas\$1connect |  Blocked  |  Spark Connect usage is blocked in Spark FGAC.  | 
| pyspark\$1pandas\$1slow\$1connect |  Blocked  |  Spark Connect usage is blocked in Spark FGAC.  | 
| pyspark\$1errors |  Experimental  |  This module is not thoroughly tested. Custom error classes can't be utilized.  | 

**API Allowlist**

For a downloadable and easier to search list, a file with the modules and classes is available at [Python functions allowed in Native FGAC](samples/Python functions allowed in Native FGAC.zip).