

# Using EMR Serverless with AWS Lake Formation for fine-grained access control
<a name="emr-serverless-lf-enable"></a>

## Overview
<a name="emr-serverless-lf-enable-overview"></a>

With Amazon EMR releases 7.2.0 and higher, leverage AWS Lake Formation to apply fine-grained access controls on Data Catalog tables that are backed by S3. This capability lets you configure table, row, column, and cell level access controls for read queries within your Amazon EMR Serverless Spark jobs. To configure fine-grained access control for Apache Spark batch jobs and interactive sessions, use EMR Studio. See the following sections to learn more about Lake Formation and how to use it with EMR Serverless.

Using Amazon EMR Serverless with AWS Lake Formation incurs additional charges. For more information, refer to [Amazon EMR pricing](https://aws.amazon.com/emr/pricing/).

## How EMR Serverless works with AWS Lake Formation
<a name="emr-serverless-lf-enable-how-it-works"></a>

Using EMR Serverless with Lake Formation lets you enforce a layer of permissions on each Spark job to apply Lake Formation permissions control when EMR Serverless executes jobs. EMR Serverless uses [ Spark resource profiles](https://spark.apache.org/docs/latest/api/java/org/apache/spark/resource/ResourceProfile.html) to create two profiles to effectively execute jobs. The user profile executes user-supplied code, while the system profile enforces Lake Formation policies. For more information, refer to [What is AWS Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html) and [Considerations and limitations](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable-considerations.html).

When you use pre-initialized capacity with Lake Formation, we suggest that you have a minimum of two Spark drivers. Each Lake Formation-enabled job utilizes two Spark drivers, one for the user profile and one for the system profile. For the best performance, use double the number of drivers for Lake Formation-enabled jobs compared to if you don't use Lake Formation.

When you run Spark jobs on EMR Serverless, also consider the impact of dynamic allocation on resource management and cluster performance. The configuration `spark.dynamicAllocation.maxExecutors` of the maximum number of executors per resource profile applies to user and system executors. If you configure that number to be equal to the maximum allowed number of executors, your job run might get stuck because of one type of executor that uses all available resources, which prevents the other executor when you run jobs jobs.

So you don't run out of resources, EMR Serverless sets the default maximum number of executors per resource profile to 90% of the `spark.dynamicAllocation.maxExecutors` value. You can override this configuration when you specify `spark.dynamicAllocation.maxExecutorsRatio` with a value between 0 and 1. Additionally, also configure the following properties to optimize resource allocation and overall performance:
+ `spark.dynamicAllocation.cachedExecutorIdleTimeout`
+ `spark.dynamicAllocation.shuffleTracking.timeout`
+ `spark.cleaner.periodicGC.interval`

The following is a high-level overview of how EMR Serverless gets access to data protected by Lake Formation security policies.

![\[How Amazon EMR accesses data protected by Lake Formation security policies.\]](http://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/images/lf-emr-s-architecture.png)


1. A user submits Spark job to an AWS Lake Formation-enabled EMR Serverless application. 

1. EMR Serverless sends the job to a user driver and runs the job in the user profile. The user driver runs a lean version of Spark that has no ability to launch tasks, request executors, access S3 or the Glue Catalog. It builds a job plan.

1. EMR Serverless sets up a second driver called the system driver and runs it in the system profile (with a privileged identity). EMR Serverless sets up an encrypted TLS channel between the two drivers for communication. The user driver uses the channel to send the job plans to the system driver. The system driver does not run user-submitted code. It runs full Spark and communicates with S3, and the Data Catalog for data access. It request executors and compiles the Job Plan into a sequence of execution stages. 

1. EMR Serverless then runs the stages on executors with the user driver or system driver. User code in any stage is run exclusively on user profile executors.

1. Stages that read data from Data Catalog tables protected by AWS Lake Formation or those that apply security filters are delegated to system executors.

## Enabling Lake Formation in Amazon EMR
<a name="emr-serverless-lf-enable-config"></a>

To enable Lake Formation, set `spark.emr-serverless.lakeformation.enabled` to `true` under `spark-defaults` classification for the runtime-configuration parameter when [ creating an EMR Serverless application](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/getting-started.html#gs-application-console).

```
aws emr-serverless create-application \
    --release-label emr-7.12.0 \
    --runtime-configuration '{
     "classification": "spark-defaults", 
     "properties": {
      "spark.emr-serverless.lakeformation.enabled": "true"
      }
    }' \
    --type "SPARK"
```

You can also enable Lake Formation when you create a new application in EMR Studio. Choose **Use Lake Formation for fine-grained access control**, available under **Additional configurations**.

[Inter-worker encryption](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/interworker-encryption.html) is enabled by default when you use Lake Formation with EMR Serverless, so you do not need to explicitly enable inter-worker encryption again.

**Enabling Lake Formation for Spark jobs**

To enable Lake Formation for individual Spark jobs, set `spark.emr-serverless.lakeformation.enabled` to true when using `spark-submit`.

```
--conf spark.emr-serverless.lakeformation.enabled=true
```

## Job runtime role IAM permissions
<a name="emr-serverless-lf-enable-permissions"></a>

Lake Formation permissions control access to AWS Glue Data Catalog resources, Amazon S3 locations, and the underlying data at those locations. IAM permissions control access to the Lake Formation and AWS Glue APIs and resources. Although you might have the Lake Formation permission to access a table in the Data Catalog (SELECT), your operation fails if you don’t have the IAM permission on the `glue:Get*` API operation. 

The following is an example policy of how to provide IAM permissions to access a script in S3, uploading logs to S3, AWS Glue API permissions, and permission to access Lake Formation.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "ScriptAccess",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::*.amzn-s3-demo-bucket/scripts",
        "arn:aws:s3:::*.amzn-s3-demo-bucket/*"
      ]
    },
    {
      "Sid": "LoggingAccess",
      "Effect": "Allow",
      "Action": [
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::amzn-s3-demo-bucket/logs/*"
      ]
    },
    {
      "Sid": "GlueCatalogAccess",
      "Effect": "Allow",
      "Action": [
        "glue:Get*",
        "glue:Create*",
        "glue:Update*"
      ],
      "Resource": [
        "*"
      ]
    },
    {
      "Sid": "LakeFormationAccess",
      "Effect": "Allow",
      "Action": [
        "lakeformation:GetDataAccess"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

## Setting up Lake Formation permissions for job runtime role
<a name="emr-serverless-lf-enable-set-up-grants-for-role"></a>

First, register the location of your Hive table with Lake Formation. Then create permissions for your job runtime role on your desired table. For more details about Lake Formation, refer to [ What is AWS Lake Formation?](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html) in the *AWS Lake Formation Developer Guide*.

After you set up the Lake Formation permissions, submit Spark jobs on Amazon EMR Serverless. For more information about Spark jobs, refer to [Spark examples](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-examples).

## Submitting a job run
<a name="emr-serverless-lf-enable-submit-job"></a>

After you finish setting up the Lake Formation grants, you can [ submit Spark jobs on EMR Serverless.](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-examples) The section that follows shows examples of how to configure and submit job run properties.

## Permission requirements
<a name="emr-serverless-lf-enable-otf-permissions"></a>

### Tables not registered in AWS Lake Formation
<a name="emr-s-lf-otf-permissions"></a>

For tables not registered with AWS Lake Formation, the job runtime role accesses both the AWS Glue Data Catalog and the underlying table data in Amazon S3. This requires the job runtime role to have appropriate IAM permissions for both AWS Glue and Amazon S3 operations. 

### Tables registered in AWS Lake Formation
<a name="emr-s-lf-otf-permissions-tables-lf-registered"></a>

For tables registered with AWS Lake Formation, the job runtime role accesses the AWS Glue Data Catalog metadata, while temporary credentials vended by Lake Formation access the underlying table data in Amazon S3. The Lake Formation permissions required to execute an operation depend on the AWS Glue Data Catalog and Amazon S3 API calls that the Spark job initiates and can be summarized as follows:
+ **DESCRIBE** permission allows the runtime role to read table or database metadata in the Data Catalog
+ **ALTER** permission allows the runtime role to modify table or database metadata in the Data Catalog
+ **DROP** permission allows the runtime role to delete table or database metadata from the Data Catalog
+ **SELECT** permission allows the runtime role to read table data from Amazon S3
+ **INSERT** permission allows the runtime role to write table data to Amazon S3
+ **DELETE** permission allows the runtime role to delete table data from Amazon S3
**Note**  
Lake Formation evaluates permissions lazily when a Spark job calls AWS Glue to retrieve table metadata and Amazon S3 to retrieve table data. Jobs that use a runtime role with insufficient permissions will not fail until Spark makes an AWS Glue or Amazon S3 call that requires the missing permission.

**Note**  
In the following supported table matrix:   
Operations marked as **Supported** exclusively use Lake Formation credentials to access table data for tables registered with Lake Formation. If Lake Formation permissions are insufficient, the operation will not fall back to runtime role credentials. For tables not registered with Lake Formation, the job runtime role credentials access the table data.
Operations marked as **Supported with IAM permissions on Amazon S3 location** do not use Lake Formation credentials to access underlying table data in Amazon S3. To run these operations, the job runtime role must have the necessary Amazon S3 IAM permissions to access the table data, regardless of whether the table is registered with Lake Formation.

------
#### [ Hive ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable.html)

------
#### [ Iceberg ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable.html)

**Spark configuration for Iceberg:** The following sample shows how to configure Spark with Iceberg. To run Iceberg jobs, provide the following `spark-submit` properties.

```
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=<S3_DATA_LOCATION>
--conf spark.sql.catalog.spark_catalog.glue.account-id=<ACCOUNT_ID>
--conf spark.sql.catalog.spark_catalog.client.region=<REGION>
--conf spark.sql.catalog.spark_catalog.glue.endpoint=https://glue.<REGION>.amazonaws.com
```

------
#### [ Hudi ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable.html)

The following samples configure Spark with Hudi, specifying file locations and other properties necessary for use.

**Spark config for Hudi:** This snippet when used in a notebook specifies the path to the Hudi Spark bundle JAR file, which enables Hudi functionality in Spark. It also configures Spark to use the AWS Glue Data Catalog as the metastore.

```
%%configure -f
{
    "conf": {
        "spark.jars": "/usr/lib/hudi/hudi-spark-bundle.jar",
        "spark.hadoop.hive.metastore.client.factory.class": "com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory",
        "spark.serializer": "org.apache.spark.serializer.JavaSerializer",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.hudi.catalog.HoodieCatalog",
        "spark.sql.extensions": "org.apache.spark.sql.hudi.HoodieSparkSessionExtension"
    }
}
```

**Spark config for Hudi with AWS Glue:** This snippet when used in a notebook enables Hudi as a supported data-lake format and ensures that Hudi libraries and dependencies are available.

```
%%configure
{
    "--conf": "spark.serializer=org.apache.spark.serializer.JavaSerializer --conf 
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog --conf 
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
    "--datalake-formats": "hudi",
    "--enable-glue-datacatalog": True,
    "--enable-lakeformation-fine-grained-access": "true"
}
```

------
#### [ Delta Lake ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless-lf-enable.html)

**EMR Serverless with Delta Lake:** To use Delta Lake with Lake Formation on EMR Serverless, run the following command:

```
spark-sql \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension,com.amazonaws.emr.recordserver.connector.spark.sql.RecordServerSQLExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
```

------

# Debugging jobs
<a name="emr-serverless-lf-enable-debugging"></a>

**Note**  
With this feature, access **stdout** and **stderr** logs for the system profile workers that may contain sensitive, unfiltered information. The following permission should be used only for accessing non-production data. For applications created for use with production jobs, we strongly suggest that you add these permissions only to administrators or users with elevated data access.

With EMR-7.3.0 and later, EMR Serverless is enabling self-debugging capability for Lake Formation-enabled batch jobs. To do so, use the new parameter **accessSystemProfileLogs** in the [GetDashboardForJobRun](https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetDashboardForJobRun.html) API. If **accessSystemProfileLogs** is set to **true**, you can access the **stdout** and **stderr** logs for the system profile workers, which can be used for debugging a Lake Formation-enabled EMR Serverless batch job.

```
aws emr-serverless get-dashboard-for-job-run \
  --application-id application-id
  --job-run-id job-run-id
  --access-system-profile-logs
```

## Required permissions
<a name="emr-serverless-lf-enable-debugging-perms"></a>

The principal who wants to debug Lake Formation-enabled batch jobs using **GetDashboardForJobRun** must have the following additional permissions:

```
{
    "Sid": "AccessSystemProfileLogs",
    "Effect": "Allow",
    "Action": [
        "emr-serverless:GetDashboardForJobRun",
        "emr-serverless:AccessSystemProfileLogs",
        "glue:GetDatabases",
        "glue:SearchTables"
    ],
    "Resource": [
        "arn:aws:emr-serverless:region:account-id:/applications/applicationId/jobruns/jobid",
        "arn:aws:glue:region:account-id:catalog",
        "arn:aws:glue:region:account-id:database/*",
        "arn:aws:glue:region:account-id:table/*/*"
    ]
}
```

## Considerations
<a name="emr-serverless-lf-enable-debugging-considerations"></a>

System profile logs for debugging are visible for jobs that access databases or tables in Lake Formation within the same account as the job. They are not visible in the following scenarios:
+ If the data catalog managed using Lake Formation permissions has cross-account databases and tables
+ If the data catalog managed using Lake Formation permissions has resource links

# Working with Glue Data Catalog views
<a name="SECTION-jobs-glue-data-catalog-views"></a>

You can create and manage views in the AWS Glue Data Catalog for use with EMR Serverless. These are known commonly as AWS Glue Data Catalog views. These views are useful because they support multiple SQL query engines, so you can access the same view across different AWS services, such as EMR Serverless, Amazon Athena, and Amazon Redshift.

By creating a view in the Data Catalog, use resource grants and tag-based access controls in AWS Lake Formation to grant access to it. Using this method of access control, you do not need to configure additional access to the tables you referenced when creating the view. This method of granting permissions is called definer semantics, and these views are called definer views. For more information about access control in Lake Formation, see [Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html) in the AWS Lake Formation Developer Guide.

Data Catalog views are useful for the following use cases:
+ **Granular access control** – You can create a view that restricts data access based on the permissions the user needs. For example, you can use views in the Data Catalog to prevent employees who don’t work in the HR department from seeing personally identifiable information (PII).
+ **Complete view definition** – By applying filters on your view in the Data Catalog, you make sure that data records available in a view in the Data Catalog are always complete.
+ **Enhanced security** – The query definition used to create the view must be complete. This benefit means that views in the Data Catalog are less susceptible to SQL commands from malicious actors.
+ **Simple sharing data** – Share data with other AWS accounts without moving data. For more information, refer to [Cross-account data sharing in Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/cross-account-permissions.html).

## Creating a Data Catalog view
<a name="SECTION-jobs-glue-data-catalog-views-create"></a>

There are different ways to create a Data Catalog view. These include using the AWS CLI or Spark SQL. A few examples follow.

------
#### [ Using SQL ]

The following demonstrates the syntax for creating a Data Catalog view. Note the `MULTI DIALECT` view type. This distinguishes the Data Catalog view from other views. The `SECURITY` predicate is specified as `DEFINER`. This indicates a Data Catalog view with `DEFINER` semantics.

```
CREATE [ OR REPLACE ] PROTECTED MULTI DIALECT VIEW [IF NOT EXISTS] view_name
[(column_name [COMMENT column_comment], ...) ]
[ COMMENT view_comment ]
[TBLPROPERTIES (property_name = property_value, ... )]
SECURITY DEFINER
AS query;
```

The following is a sample `CREATE` statement, following the syntax:

```
CREATE PROTECTED MULTI DIALECT VIEW catalog_view
SECURITY DEFINER
AS
SELECT order_date, sum(totalprice) AS price
FROM source_table
GROUP BY order_date
```

You can also create a view in dry-run mode, using SQL, to test view creation, without actually creating the resource. Using this option results in a "dry run" that validates the input and, if the validation succeeds, returns the JSON of the AWS Glue table object that will represent the view. In this case, The actual view isn't created.

```
CREATE [ OR REPLACE ] PROTECTED MULTI DIALECT VIEW view_name
SECURITY DEFINER 
[ SHOW VIEW JSON ]
AS view-sql
```

------
#### [ Using the AWS CLI ]

**Note**  
When you use the CLI command, the SQL used to create the view isn't parsed. This can result in a case where the view is created, but queries aren't successful. Be sure to test your SQL syntax prior to creating the view.

You use the following CLI command to create a view:

```
aws glue create-table --cli-input-json '{
  "DatabaseName": "database",
  "TableInput": {
    "Name": "view",
    "StorageDescriptor": {
      "Columns": [
        {
          "Name": "col1",
          "Type": "data-type"
        },
        ...
        {
          "Name": "col_n",
          "Type": "data-type"
        }
      ],
      "SerdeInfo": {}
    },
    "ViewDefinition": {
      "SubObjects": [
        "arn:aws:glue:aws-region:aws-account-id:table/database/referenced-table1",
        ...
        "arn:aws:glue:aws-region:aws-account-id:table/database/referenced-tableN",
       ],
      "IsProtected": true,
      "Representations": [
        {
          "Dialect": "SPARK",
          "DialectVersion": "1.0",
          "ViewOriginalText": "Spark-SQL",
          "ViewExpandedText": "Spark-SQL"
        }
      ]
    }
  }
}'
```

------

## Supported view operations
<a name="SECTION-jobs-glue-data-catalog-views-supported-operations"></a>

The following command fragments show you various ways to work with Data Catalog views:
+ **CREATE VIEW**

  Creates a data-catalog view. The following is a sample that shows creating a view from an existing table:

  ```
  CREATE PROTECTED MULTI DIALECT VIEW catalog_view 
  SECURITY DEFINER AS SELECT * FROM my_catalog.my_database.source_table
  ```
+ **ALTER VIEW**

  Available syntax:
  + `ALTER VIEW view_name [FORCE] ADD DIALECT AS query`
  + `ALTER VIEW view_name [FORCE] UPDATE DIALECT AS query`
  + `ALTER VIEW view_name DROP DIALECT`

  You can use the `FORCE ADD DIALECT` option to force update the schema and sub objects as per the new engine dialect. Note that doing this can result in query errors if you don't also use `FORCE` to update other engine dialects. The following demonstrates a sample:

  ```
  ALTER VIEW catalog_view FORCE ADD DIALECT
  AS
  SELECT order_date, sum(totalprice) AS price
  FROM source_table
  GROUP BY orderdate;
  ```

  The following demonstrates how to alter a view to update the dialect:

  ```
  ALTER VIEW catalog_view UPDATE DIALECT AS 
  SELECT count(*) FROM my_catalog.my_database.source_table;
  ```
+ **DESCRIBE VIEW**

  Available syntax for describing a view:
  + `SHOW COLUMNS {FROM|IN} view_name [{FROM|IN} database_name]` – If the user has the required AWS Glue and Lake Formation permissions to describe the view, they can list the columns. The following demonstrates a couple sample commands for showing columns:

    ```
    SHOW COLUMNS FROM my_database.source_table;    
    SHOW COLUMNS IN my_database.source_table;
    ```
  + `DESCRIBE view_name` – If the user has the required AWS Glue and Lake Formation permissions to describe the view, they can list the columns in the view along with its metadata.
+ **DROP VIEW**

  Available syntax:
  + `DROP VIEW [ IF EXISTS ] view_name`

    The following sample shows a `DROP` statement that tests if a view exists prior to dropping it:

    ```
    DROP VIEW IF EXISTS catalog_view;
    ```
+ **SHOW CREATE VIEW**
  + `SHOW CREATE VIEW view_name` – Shows the SQL statement that creates the specified view. The following is a sample that shows creating a data-catalog view:

    ```
    SHOW CREATE TABLE my_database.catalog_view;
    CREATE PROTECTED MULTI DIALECT VIEW my_catalog.my_database.catalog_view (
      net_profit,
      customer_id,
      item_id,
      sold_date)
    TBLPROPERTIES (
      'transient_lastDdlTime' = '1736267222')
    SECURITY DEFINER AS SELECT * FROM
    my_database.store_sales_partitioned_lf WHERE customer_id IN (SELECT customer_id from source_table limit 10)
    ```
+ **SHOW VIEWS**

  List all views in the catalog such asregular views, multi-dialect views (MDV), and MDV without Spark dialect. Available syntax is the following:
  + `SHOW VIEWS [{ FROM | IN } database_name] [LIKE regex_pattern]`:

    The following demonstrates a sample command to show views:

    ```
    SHOW VIEWS IN marketing_analytics LIKE 'catalog_view*';
    ```

For more information about creating and configuring data-catalog views, refer to [Building AWS Glue Data Catalog views](https://docs.aws.amazon.com/lake-formation/latest/dg/working-with-views.html) in the AWS Lake Formation Developer Guide.

## Querying a Data Catalog view
<a name="SECTION-jobs-glue-data-catalog-views-querying"></a>

 After creating a Data Catalog view, you can query it using an Amazon EMR Serverless Spark job that has AWS Lake Formation fine-grained access control enabled. The job runtime role must have the Lake Formation `SELECT` permission on the Data Catalog view. You don't need to grant access to the underlying tables referenced in the view. 

Once you have everything set up, you can query your view. For example, after creating an EMR Serverless application in EMR Studio, run the following query to access a view.

```
SELECT * from my_database.catalog_view LIMIT 10;
```

A helpful function is the `invoker_principal`. It returns the unique identifier of the EMRS job runtime role. This can be used to control the view output, based on the invoking principal. You can use this to add a condition in your view that refines query results, based on the calling role. The job runtime role must have permission to the `LakeFormation:GetDataLakePrincipal` IAM action to use this function.

```
select invoker_principal();
```

You can add this function to a `WHERE` clause, for instance, to refine query results.

## Considerations and limitations
<a name="SECTION-jobs-glue-data-catalog-views-considerations"></a>

When you create Data Catalog views, the following apply:
+ You can only create Data Catalog views with Amazon EMR 7.6 and above.
+ The Data Catalog view definer must have `SELECT` access to the underlying base tables accessed by the view. Creating the Data Catalog view fails if a specific base table has any Lake Formation filters imposed on the definer role.
+ Base tables must not have the `IAMAllowedPrincipals` data lake permission in Lake Formation. If present, the error *Multi Dialect views may only reference tables without IAMAllowedPrincipals permissions* occurs.
+ The table's Amazon S3 location must be registered as a Lake Formation data lake location. If the table isn't registered, the error *Multi Dialect views may only reference Lake Formation managed tables* occurs. For information about how to register Amazon S3 locations in Lake Formation, refer to [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html) in the AWS Lake Formation Developer Guide.
+ You can only create `PROTECTED` Data Catalog views. `UNPROTECTED` views aren't supported.
+ You can't reference tables in another AWS account in a Data Catalog view definition. You also can't reference a table in the same account that's in a separate region.
+ To share data across an account or region, the entire view must be be shared cross account and cross region, using Lake Formation resource links.
+ User-defined functions (UDFs) aren't supported.
+ You can use views based on Iceberg tables. The open-table formats Apache Hudi and Delta Lake are also supported.
+ You can't reference other views in Data Catalog views.
+ An AWS Glue Data Catalog view schema is always stored using lowercase. For example, if you use a DDL statement to create a Glue Data Catalog view with a column named `Castle`, the column created in the Glue Data Catalog will be made lowercase, to `castle`. If you then specify the column name in a DML query as `Castle` or `CASTLE`, EMR Spark will make the name lowercase for you to run the query. But the column heading displays using the casing that you specified in the query. 

  If you want a query to fail in a case where a column name specified in the DML query does not match the column name in the Glue Data Catalog, set `spark.sql.caseSensitive=true`.

# Open-table format support
<a name="emr-serverless-lf-enable-open-table-format-support"></a>

EMR Serverless supports SELECT queries on Apache Hive, Apache Iceberg, Delta Lake (7.6.0\$1), and Apache Hudi (7.6.0\$1). Starting with EMR 7.12, DML and DDL operations that modify table data are supported for Apache Hive, Apache Iceberg, and Delta Lake tables using Lake Formation vended credentials. 

# Considerations and limitations
<a name="emr-serverless-lf-enable-considerations"></a>

## General
<a name="emr-s-lf-considerations"></a>

Review the following limitations when using Lake Formation with EMR Serverless.

**Note**  
When you enable Lake Formation for a Spark job on EMR Serverless, the job launches a system driver and a user driver. If you specified pre-initialized capacity at launch, the drivers provision from the pre-initialized capacity, and the number of system drivers is equal to the number of user drivers that you specify. If you choose On Demand capacity, EMR Serverless launches a system driver in addition to a user driver. To estimate the costs associated with your EMR Serverless with Lake Formation job, use the [AWS Pricing Calculator](https://calculator.aws/#/addService/EMR).
+ Amazon EMR Serverless with Lake Formation is available in all supported [EMR Serverless Regions](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/considerations.html).
+ Lake Formation-enabled applications don’t support usage of [ customized EMR Serverless images](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/application-custom-image.html).
+ You can't turn off `DynamicResourceAllocation` for Lake Formation jobs.
+ You can only use Lake Formation with Spark jobs.
+ EMR Serverless with Lake Formation only supports a single Spark session throughout a job.
+ EMR Serverless with Lake Formation only supports cross-account table queries shared through resource links.
+ The following aren't supported:
  + Resilient distributed datasets (RDD)
  + Spark streaming
  + Access control for nested columns
+ EMR Serverless blocks functionalities that might undermine the complete isolation of system driver, including the following:
  + UDTs, HiveUDFs, and any user-defined function that involves custom classes
  + Custom data sources
  + Supply of additional jars for Spark extension, connector, or metastore
  + `ANALYZE TABLE` command
+ If your EMR Serverless application is in a private subnet with VPC endpoints for Amazon S3 and you attach an endpoint policy to control access, before your jobs can send log data to AWS Managed Amazon S3, include the permissions detailed in [Managed storage](logging.html#jobs-log-storage-managed-storage) in your VPC policy to S3 gateway endpoint. For troubleshooting requests, contact AWS support.
+ Starting with Amazon EMR 7.9.0, Spark FGAC supports S3AFileSystem when used with the s3a:// scheme.
+ Amazon EMR 7.11 supports creating managed tables using CTAS.
+ Amazon EMR 7.12 supports creating managed and external tables using CTAS.

## Permissions
<a name="emr-s-lf-reads-writes"></a>
+ To enforce access controls, EXPLAIN PLAN and DDL operations such as DESCRIBE TABLE don't expose restricted information.
+ When you register a table location with Lake Formation, data access uses Lake Formation stored credentials instead of the EMR Serverless job runtime role's IAM permissions. Jobs will fail if the registered role for table location is misconfigured, even when the runtime role has S3 IAM permissions for that location.
+ Starting with Amazon EMR 7.12, you can write to existing Hive and Iceberg tables using DataFrameWriter (V2) with Lake Formation credentials in append mode. For overwrite operations or when creating new tables, EMR uses the runtime role credentials to modify table data.
+ The following limitations apply when using views or cached tables as source data (these limitations do not apply to AWS Glue Data Catalog views):
  + For MERGE, DELETE, and UPDATE operations
    + Supported: Using views and cached tables as source tables.
    + Not supported: Using views and cached tables in assignment and condition clauses.
  + For CREATE OR REPLACE and REPLACE TABLE AS SELECT operations:
    + Not supported: Using views and cached tables as source tables.
+ Delta Lake tables with UDFs in source data support MERGE, DELETE, and UPDATE operations only when deletion vector is enabled.

## Logs and debugging
<a name="emr-s-lf-debug-logs"></a>
+ EMR Serverless restricts access to system driver Spark logs on Lake Formation-enabled applications. Since the system driver runs with elevated permissions, events and logs that the system driver generates can include sensitive information. To prevent unauthorized users or code from accessing this sensitive data, EMR Serverless disables access to system driver logs.
+ System profile logs are always persisted in managed storage – this is a mandatory setting that cannot be disabled. These logs are stored securely and encrypted using either a Customer Managed KMS key or an AWS Managed KMS keys.

## Iceberg
<a name="emr-s-lf-iceberg"></a>

Review the following considerations when using Apache Iceberg:
+ You can only use Apache Iceberg with session catalog and not arbitrarily named catalogs.
+ Iceberg tables that are registered in Lake Formation only support the metadata tables `history`, `metadata_log_entries`, `snapshots`, `files`, `manifests`, and `refs`. Amazon EMR hides the columns that might have sensitive data, such as `partitions`, `path`, and `summaries`. This limitation doesn't apply to Iceberg tables that aren't registered in Lake Formation.
+ Tables that not registered in Lake Formation support all Iceberg stored procedures. The `register_table` and `migrate` procedures aren't supported for any tables.
+ We suggest that you use Iceberg DataFrameWriterV2 instead of V1.

# Troubleshooting
<a name="emr-serverless-lf-troubleshooting"></a>

See the following sections for troubleshooting solutions.

## Logging
<a name="emr-serverless-lf-troubleshooting-logging"></a>

EMR Serverless uses Spark resources profiles to split job execution. EMR Serverless uses the user profile to run the code you supplied, while the system profile enforces Lake Formation policies. You can access the logs for the tasks ran as the user profile.

For more information about debugging Lake Formation-enabled jobs, refer to [Debugging jobs](emr-serverless-lf-enable-debugging.html).

## Live UI and Spark History Server
<a name="emr-serverless-lf-troubleshooting-live-ui"></a>

The Live UI and the Spark History Server have all Spark events generated from the user profile and redacted events generated from the system driver.

You can see all of the tasks from the user and system drivers in the **Executors** tab. However, log links are available only for the user profile. Also, some information is redacted from Live UI, such as the number of output records.

## Job failed with insufficient Lake Formation permissions
<a name="emr-serverless-lf-troubleshooting-insufficient-lf-permissions"></a>

Make sure that your job runtime role has the permissions to run SELECT and DESCRIBE on the table that you are accessing.

## Job with RDD execution failed
<a name="emr-serverless-lf-troubleshooting-rdd-execution"></a>

EMR Serverless currently doesn't support resilient distributed dataset (RDD) operations on Lake Formation-enabled jobs.

## Unable to access data files in Amazon S3
<a name="emr-serverless-lf-troubleshooting-s3-access-failure"></a>

Make sure you have registered the location of the data lake in Lake Formation.

## Security validation exception
<a name="emr-serverless-lf-troubleshooting-security-validation"></a>

EMR Serverless detected a security validation error. Contact AWS support for assistance.

## Sharing AWS Glue Data Catalog and tables across accounts
<a name="emr-serverless-lf-troubleshooting-cross-account"></a>

You can share databases and tables across accounts and still use Lake Formation. For more information, refer to [Cross-account data sharing in Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/cross-account-permissions.html) and [How do I share AWS Glue Data Catalog and tables cross-account using AWS Lake Formation?](https://repost.aws/knowledge-center/glue-lake-formation-cross-account).

# Spark native fine-grained access control allow-listed PySpark API
<a name="clean-rooms-spark-fgac-pyspark-api-allowlist"></a>

To maintain security and data access controls, Spark fine-grained access control (FGAC) restricts certain PySpark functions. These restrictions are enforced through:
+ Explicit blocking that prevents function execution
+ Architecture incompatibilities that make functions non-functional
+ Functions that may throw errors, return access denied messages, or do nothing when called

The following PySpark features aren't supported in Spark FGAC:
+ RDD operations (blocked with SparkRDDUnsupportedException)
+ Spark Connect (unsupported)
+ Spark Streaming (unsupported)

While we've tested the listed functions in a Native Spark FGAC environment and confirmed they work as expected, our testing typically covers only basic usage of each API. Functions with multiple input types or complex logic paths may have untested scenarios.

For any functions not listed here and not clearly part of the unsupported categories above, we recommend:
+ Testing them first in a gamma environment or small-scale deployment
+ Verifying their behavior before using them in production

**Note**  
If you see a class method listed but not its base class, the method should still work—it just means we haven't explicitly verified the base class constructor.

The PySpark API is organized into modules. General support for methods within each module is detailed in the table below.


| Module name | Status | Notes | 
| --- | --- | --- | 
|  pyspark\$1core  |  Supported  |  This module contains the main RDD classes, and these functions are mostly unsupported.  | 
|  pyspark\$1sql  |  Supported  |  | 
|  pyspark\$1testing  |  Supported  |  | 
|  pyspark\$1resource  |  Supported  |  | 
|  pyspark\$1streaming  |  Blocked  |  Streaming usage is blocked in Spark FGAC.  | 
|  pyspark\$1mllib  |  Experimental  |  This module contains RDD based ML operations, and these functions are mostly unsupported. This module isn't thoroughly tested.  | 
|  pyspark\$1ml  |  Experimental  |  This module containes DataFrame based ML operations, and these functions are mostly supported. This module isn't thoroughly tested.  | 
|  pyspark\$1pandas  |  Supported  |    | 
|  pyspark\$1pandas\$1slow  |  Supported  |    | 
| pyspark\$1connect |  Blocked  |  Spark Connect usage is blocked in Spark FGAC.  | 
| pyspark\$1pandas\$1connect |  Blocked  |  Spark Connect usage is blocked in Spark FGAC.  | 
| pyspark\$1pandas\$1slow\$1connect |  Blocked  |  Spark Connect usage is blocked in Spark FGAC.  | 
| pyspark\$1errors |  Experimental  |  This module is not thoroughly tested. Custom error classes can't be utilized.  | 

**API Allowlist**

For a downloadable and easier to search list, a file with the modules and classes is available at [Python functions allowed in Native FGAC](samples/Python functions allowed in Native FGAC.zip).