

# Integrate Amazon EMR with AWS Lake Formation
Integrate Amazon EMR with Lake Formation

AWS Lake Formation is a managed service that helps you discover, catalog, cleanse, and secure data in an Amazon Simple Storage Service (S3) data lake. Lake Formation provides fine-grained access at the column, row, or cell level to databases and tables in the AWS Glue Data Catalog. For more information, see [What is AWS Lake Formation?](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html)

With Amazon EMR release 6.7.0 and later, you can apply Lake Formation based access control to Spark, Hive, and Presto jobs that you submit to Amazon EMR clusters. To integrate with Lake Formation, you must create an EMR cluster with a *runtime role*. A runtime role is an AWS Identity and Access Management (IAM) role that you associate with Amazon EMR jobs or queries. Amazon EMR then uses this role to access AWS resources. For more information, see [Runtime roles for Amazon EMR steps](emr-steps-runtime-roles.md).

## How Amazon EMR works with Lake Formation


After you integrate Amazon EMR with Lake Formation, you can execute queries to Amazon EMR clusters with the [`Step` API](https://docs.aws.amazon.com/emr/latest/APIReference/API_Step.html) or with SageMaker AI Studio. Then, Lake Formation provides access to data through temporary credentials for Amazon EMR. This process is called credential vending. For more information, see [What is AWS Lake Formation?](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html)

The following is a high-level overview of how Amazon EMR gets access to data protected by Lake Formation security policies.

![\[How Amazon EMR accesses data protected by Lake Formation security policies\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/images/lf-emr-security.png)


1. A user submits an Amazon EMR query for data in Lake Formation.

1. Amazon EMR requests temporary credentials from Lake Formation to give the user data access.

1. Lake Formation returns temporary credentials.

1. Amazon EMR sends the query request to retrieve data from Amazon S3.

1. Amazon EMR receives the data from Amazon S3, filters it, and returns results based on the user permissions that the user defined in Lake Formation.

For more information about adding users and groups to Lake Formation policies, see [Granting Data Catalog permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html).

## Prerequisites


You must meet the following requirements before you integrate Amazon EMR and Lake Formation:
+ Turn on runtime role authorization on your Amazon EMR cluster.
+ Use the AWS Glue Data Catalog as your metadata store.
+ Define and manage permissions in Lake Formation to access databases, tables, and columns in AWS Glue Data Catalog. For more information, see [What is AWS Lake Formation?](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html)

# Fine-grained access with Lake Formation
Fine-grained access with Lake Formation

Amazon EMR releases 6.15.0 and higher include support for fine-grained access control at the row, column, or cell level based on AWS Lake Formation. The topics in this section cover how you can access Lake Formation protected Glue Data catalog tables from EMR Spark jobs or interactive sessions with fine-grained access control.

# Enable Lake Formation with Amazon EMR


With Amazon EMR 6.15.0 and higher, when you run Spark jobs on Amazon EMR on EC2 clusters that access data in the AWS Glue Data Catalog, you can use AWS Lake Formation to apply table, row, column, and cell level permissions on Hudi, Iceberg, or Delta Lake based tables.

In this section, we cover how to create a security configuration and set up Lake Formation to work with Amazon EMR. We also go over how to launch a cluster with the security configuration that you created for Lake Formation. 

## Step 1: Set up a runtime role for your EMR cluster
Step 1: Set up a runtime role

To use a runtime role for your EMR cluster, you must create a security configuration. With a security configuration, you can apply consistent security, authorization, and authentication options across your clusters. 

1. Create a file called `lf-runtime-roles-sec-cfg.json` with the following security configuration.

   ```
   {
       "AuthorizationConfiguration": {
           "IAMConfiguration": {
               "EnableApplicationScopedIAMRole": true,
               "ApplicationScopedIAMRoleConfiguration": {
                   "PropagateSourceIdentity": true
               }
           },
           "LakeFormationConfiguration": {
               "AuthorizedSessionTagValue": "Amazon EMR"
           }
       },
       "EncryptionConfiguration": {
   	    "EnableAtRestEncryption": false,
               "EnableInTransitEncryption": true,
               "InTransitEncryptionConfiguration": {
               "TLSCertificateConfiguration": {<certificate-configuration>}
           }
       }
   }
   ```

   The example below illustrates how to use a zip file with certificates in Amazon S3 for certificate configuration:
   + A zip file with certificates in Amazon S3 is used as the key provider. (See [Providing certificates for encrypting data in transit with Amazon EMR encryption](emr-encryption-enable.md#emr-encryption-certificates) for certificate requirements.)

   ```
   "TLSCertificateConfiguration": {
   	"CertificateProviderType": "PEM",       
   	"S3Object": "s3://MyConfigStore/artifacts/MyCerts.zip"
    }
   ```

   The example below illustrates how to use a custom key provider for certificate configuration:
   + A custom key provider is used. (See [Providing certificates for encrypting data in transit with Amazon EMR encryption](emr-encryption-enable.md#emr-encryption-certificates) for certificate requirements.)

   ```
   "TLSCertificateConfiguration": {
   	"CertificateProviderType": "Custom",
   	"S3Object": "s3://MyConfig/artifacts/MyCerts.jar",
   	"CertificateProviderClass": "com.mycompany.MyCertProvider"
       }
   ```

1. Next, to ensure that the session tag can authorize Lake Formation, set the `LakeFormationConfiguration/AuthorizedSessionTagValue` property to `Amazon EMR`. 

1. Use the following command to create the Amazon EMR security configuration.

   ```
   aws emr create-security-configuration \
   --name 'iamconfig-with-iam-lf' \
   --security-configuration file://lf-runtime-roles-sec-cfg.json
   ```

   Alternatively, you can use the [Amazon EMR console](https://console.aws.amazon.com//emr) to create a security configuration with custom settings.

## Step 2: Launch an Amazon EMR cluster
Step 2: Launch a cluster

Now you’re ready to launch an EMR cluster with the security configuration that you created in the previous step. For more information on security configurations, see [Use security configurations to set up Amazon EMR cluster security](emr-security-configurations.md) and [Runtime roles for Amazon EMR steps](emr-steps-runtime-roles.md).

## Step 3: Set up Lake Formation-based column, row, or cell-level permissions with Amazon EMR runtime roles
Step 3: Column, row, or cell-level permissions

To apply fine-grained access control at the column, row, or cell level with Lake Formation, the data lake administrator for Lake Formation must set `Amazon EMR` as the value for the session tag configuration, `AuthorizedSessionTagValue`. Lake Formation uses this session tag to authorize callers and provide access to the data lake. You can set this session tag in the **Application integration settings** section of the Lake Formation console. Replace *123456789012* with your own AWS account ID.

## Step 4: Configure AWS Glue and Lake Formation grants for Amazon EMR runtime roles
Step 4: Configure Glue and Lake Formation

To continue with your setup of Lake Formation based access control with Amazon EMR runtime roles, you must configure AWS Glue and Lake Formation grants for Amazon EMR runtime roles. To allow your IAM runtime roles to interact with Lake Formation, grant them access with `lakeformation:GetDataAccess` and `glue:Get*`.

Lake Formation permissions control access to AWS Glue Data Catalog resources, Amazon S3 locations, and the underlying data at those locations. IAM permissions control access to the Lake Formation and AWS Glue APIs and resources. Although you might have the Lake Formation permission to access a table in the data catalog (SELECT), your operation fails if you don’t have the IAM permission on the `glue:Get*` API. For more details about Lake Formation access control, see [Lake Formation access control overview](https://docs.aws.amazon.com/lake-formation/latest/dg/lf-permissions-overview.html).

1.  Create the `emr-runtime-roles-lake-formation-policy.json` file with the following content. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "LakeFormationManagedAccess",
         "Effect": "Allow",
         "Action": [
           "lakeformation:GetDataAccess",
           "glue:Get*",
           "glue:Create*",
           "glue:Update*"
         ],
         "Resource": [
           "*"
         ]
       }
     ]
   }
   ```

------

1. Create the related IAM policy.

   ```
   aws iam create-policy \
   --policy-name emr-runtime-roles-lake-formation-policy \
   --policy-document file://emr-runtime-roles-lake-formation-policy.json
   ```

1. To assign this policy to your IAM runtime roles, follow the steps in [Managing AWS Lake Formation permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/managing-permissions.html).

You can now use runtime roles and Lake Formation to apply table and column level permissions. You can also use a source identity to control actions and monitor operations with AWS CloudTrail.

For each IAM role that you plan to use as a runtime role, set the following trust policy, replacing `EMR_EC2_DefaultRole` with your instance profile role. To modify the trust policy of an IAM role, see [Modifying a role trust policy](https://docs.aws.amazon.com//IAM/latest/UserGuide/roles-managingrole-editing-console.html).

```
{
   "Sid":"AllowAssumeRole",
   "Effect":"Allow",
   "Principal":{
     "AWS":"arn:aws:iam::<AWS_ACCOUNT_ID>:role/EMR_EC2_DefaultRole"
   },
   "Action":[
        "sts:AssumeRole",
        "sts:TagSession"
       ]
 }
```

For a detailed, end-to-end example, see [Introducing runtime roles for Amazon EMR steps](https://aws.amazon.com/blogs/big-data/introducing-runtime-roles-for-amazon-emr-steps-use-iam-roles-and-aws-lake-formation-for-access-control-with-amazon-emr/).

For information about how to integrate with Iceberg and AWS Glue Data Catalog for a multi-catalog hierarchy, see [Configure Spark to access a multi-catalog hierarchy in AWS Glue Data Catalog](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-multi-catalog.html#emr-lakehouse-using-spark-access).

# Open-table format support


Amazon EMR releases 6.15.0 and higher include support for fine-grained access control based on AWS Lake Formation with Hive tables, Apache Iceberg, Apache Hudi, and Delta Lake when you read and write data with Spark SQL. Amazon EMR supports table, row, column, and cell-level access control with Apache Hudi. Amazon EMR releases 6.15.0 and higher include support for fine-grained access control at the row, column, or cell level based on AWS Lake Formation. Starting with EMR 7.12, DML and DDL operations that modify table data are supported for Apache Hive, Apache Iceberg, and Delta Lake tables using Lake Formation vended credentials. 

The topics in this section cover how you can access Lake Formation registered tables in open table formats from EMR Spark jobs or interactive sessions with fine-grained access control.

## Permission requirements


### Tables not registered in AWS Lake Formation


For tables not registered with AWS Lake Formation, the job runtime role accesses both the AWS Glue Data Catalog and the underlying table data in Amazon S3. This requires the job runtime role to have appropriate IAM permissions for both AWS Glue and Amazon S3 operations. 

### Tables registered in AWS Lake Formation


For tables registered with AWS Lake Formation, the job runtime role accesses the AWS Glue Data Catalog metadata, while temporary credentials vended by Lake Formation access the underlying table data in Amazon S3. The Lake Formation permissions required to execute an operation depend on the AWS Glue Data Catalog and Amazon S3 API calls that the Spark job initiates and can be summarized as follows:
+ **DESCRIBE** permission allows the runtime role to read table or database metadata in the Data Catalog
+ **ALTER** permission allows the runtime role to modify table or database metadata in the Data Catalog
+ **DROP** permission allows the runtime role to delete table or database metadata from the Data Catalog
+ **SELECT** permission allows the runtime role to read table data from Amazon S3
+ **INSERT** permission allows the runtime role to write table data to Amazon S3
+ **DELETE** permission allows the runtime role to delete table data from Amazon S3
**Note**  
Lake Formation evaluates permissions lazily when a Spark job calls AWS Glue to retrieve table metadata and Amazon S3 to retrieve table data. Jobs that use a runtime role with insufficient permissions will not fail until Spark makes an AWS Glue or Amazon S3 call that requires the missing permission.

**Note**  
In the following supported table matrix:   
Operations marked as **Supported** exclusively use Lake Formation credentials to access table data for tables registered with Lake Formation. If Lake Formation permissions are insufficient, the operation will not fall back to runtime role credentials. For tables not registered with Lake Formation, the job runtime role credentials access the table data.
Operations marked as **Supported with IAM permissions on Amazon S3 location** do not use Lake Formation credentials to access underlying table data in Amazon S3. To run these operations, the job runtime role must have the necessary Amazon S3 IAM permissions to access the table data, regardless of whether the table is registered with Lake Formation.

------
#### [ Hive ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-lf-fgac1.html)

------
#### [ Iceberg ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-lf-fgac1.html)

**Spark configuration for Iceberg:** If you want to use Iceberg format, set the following configurations. Replace `DB_LOCATION` with the Amazon S3 path where your Iceberg tables are located, and replace the region and account ID placeholders with your own values.

```
spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog 
--conf spark.sql.catalog.spark_catalog.warehouse=s3://DB_LOCATION
--conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog 
--conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.catalog.spark_catalog.glue.account-id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.glue.id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.client.region=AWS_REGION
```

If you want to use Iceberg format on earlier EMR versions, use the following command instead:

```
spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.amazonaws.emr.recordserver.connector.spark.sql.RecordServerSQLExtension  
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog 
--conf spark.sql.catalog.spark_catalog.warehouse=s3://DB_LOCATION
--conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog 
--conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO  
--conf spark.sql.catalog.spark_catalog.glue.account-id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.glue.id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.client.assume-role.region=AWS_REGION
--conf spark.sql.catalog.spark_catalog.lf.managed=true
```

**Examples:**

Here are some examples of working with Iceberg tables:

```
-- Create an Iceberg table
CREATE TABLE my_iceberg_table (
    id BIGINT,
    name STRING,
    created_at TIMESTAMP
) USING ICEBERG;

-- Insert data
INSERT INTO my_iceberg_table VALUES (1, 'Alice', current_timestamp());

-- Query the table
SELECT * FROM my_iceberg_table;
```

------
#### [ Hudi ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-lf-fgac1.html)

**Spark configuration for Hudi:**

To start the Spark shell on EMR 7.10 or higher versions, use the following command:

```
spark-sql
--jars /usr/lib/hudi/hudi-spark-bundle.jar \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
```

To start the Spark shell on earlier EMR versions, use the below command instead:

```
spark-sql
--jars /usr/lib/hudi/hudi-spark-bundle.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension,com.amazonaws.emr.recordserver.connector.spark.sql.RecordServerSQLExtension  \
--conf spark.sql.catalog.spark_catalog.lf.managed=true
```

**Examples:**

Here are some examples of working with Hudi tables:

```
-- Create a Hudi table
CREATE TABLE my_hudi_table (
    id BIGINT,
    name STRING,
    created_at TIMESTAMP
) USING HUDI
TBLPROPERTIES (
    'type' = 'cow',
    'primaryKey' = 'id'
);

-- Insert data
INSERT INTO my_hudi_table VALUES (1, 'Alice', current_timestamp());

-- Query the latest snapshot
SELECT * FROM my_hudi_table;
```

To query the latest snapshot of copy-on-write tables:

```
SELECT * FROM my_hudi_cow_table
```

```
spark.read.table("my_hudi_cow_table")
```

To query the latest compacted data of `MOR` tables, you can query the read-optimized table that is suffixed with `_ro`:

```
SELECT * FROM my_hudi_mor_table_ro
```

```
spark.read.table("my_hudi_mor_table_ro")
```

------
#### [ Delta Lake ]

[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-lf-fgac1.html)

**Spark configuration for Delta Lake:**

To use Delta Lake with Lake Formation on EMR 7.10 and higher, run the following command:

```
spark-sql \
   --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
```

To use Delta Lake with Lake Formation on EMR 6.15 to 7.9, run the following

```
spark-sql \
  --conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension,com.amazonaws.emr.recordserver.connector.spark.sql.RecordServerSQLExtension \
  --conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
  --conf spark.sql.catalog.spark_catalog.lf.managed=true
```

If you want Lake Formation to use record server to manage your Spark catalog, set `spark.sql.catalog.<managed_catalog_name>.lf.managed` to true.

**Examples:**

Here are some examples of working with Delta Lake tables:

```
-- Create a Delta Lake table
CREATE TABLE my_delta_table (
    id BIGINT,
    name STRING,
    created_at TIMESTAMP
) USING DELTA;

-- Insert data
INSERT INTO my_delta_table VALUES (1, 'Alice', current_timestamp());

-- Query the table
SELECT * FROM my_delta_table;

-- Update data
UPDATE my_delta_table SET name = 'Alice Smith' WHERE id = 1;

-- Merge data
MERGE INTO my_delta_table AS target
USING (SELECT 2 as id, 'Bob' as name, current_timestamp() as created_at) AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
```

**Creating a Delta Lake table in AWS Glue Data Catalog**

Amazon EMR with Lake Formation doesn't support DDL commands and Delta table creation in EMR releases earlier than 7.12. Follow these steps to create tables in the AWS Glue Data Catalog.

1. Use the following example to create a Delta table. Make sure that your S3 location exists.

   ```
   spark-sql \
   --conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
   --conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
   
   > CREATE DATABASE if not exists <DATABASE_NAME> LOCATION 's3://<S3_LOCATION>/transactionaldata/native-delta/<DATABASE_NAME>/';
   > CREATE TABLE <TABLE_NAME> (x INT, y STRING, z STRING) USING delta;
   > INSERT INTO <TABLE_NAME> VALUES (1, 'a1', 'b1');
   ```

1. To see the details of your table, go to [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the left navigation, expand **Data Catalog**, choose **Tables**, then choose the table you created. Under **Schema**, you should see that the Delta table you created with Spark stores all columns in a data type of `array<string>` in AWS Glue.

1. To define column and cell-level filters in Lake Formation, remove the `col` column from your schema, and then add the columns that are in your table schema. In this example, add the columns `x`, `y`, and `z`.

------

With this feature, you can run snapshot queries on copy-on-write tables to query the latest snapshot of the table at a given commit or compaction instant. Currently, a Lake Formation-enabled Amazon EMR cluster must retrieve Hudi's commit time column to perform incremental queries and time travel queries. It doesn't support Spark's `timestamp as of` syntax and the `Spark.read()` function. The correct syntax is `select * from table where _hoodie_commit_time <= point_in_time`. For more information, see [Point in time Time-Travel queries on Hudi table](https://cwiki.apache.org/confluence/display/HUDI/RFC+-+07+%3A+Point+in+time+Time-Travel+queries+on+Hudi+table).

**Note**  
The performance of reads on Lake Formation clusters might be slower because of optimizations that are not supported. These features include file listing based on Hudi metadata, and data skipping. We recommend that you test your application performance to ensure that it meets your requirements.

# Working with Glue Data Catalog views in Amazon EMR


**Note**  
Creating and managing AWS Glue Data Catalog views for use with EMR on EC2 is available with [Amazon EMR release 7.10.0](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-7100-release.html) and later.

You can create and manage views in the AWS Glue Data Catalog for use with EMR on EC2. These are known commonly as AWS Glue Data Catalog views. These views are useful because they support multiple SQL query engines, so you can access the same view across different AWS services, such as EMR on EC2, Amazon Athena, and Amazon Redshift.

By creating a view in the Data Catalog, you can use resource grants and tag-based access controls in AWS Lake Formation to grant access to it. Using this method of access control, you don't have to configure additional access to the tables you referenced when creating the view. This method of granting permissions is called definer semantics, and these views are called definer views. For more information about access control in Lake Formation, see [Granting and revoking permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html) in the AWS Lake Formation Developer Guide.

Data Catalog views are useful for the following use cases:
+ **Granular access control** – You can create a view that restricts data access based on the permissions the user needs. For example, you can use views in the Data Catalog to prevent employees who don’t work in the HR department from seeing personally identifiable information (PII).
+ **Complete view definition** – By applying filters on your view in the Data Catalog, you make sure that data records available in a view in the Data Catalog are always complete.
+ **Enhanced security** – The query definition used to create the view must be complete. This benefit means that views in the Data Catalog are less susceptible to SQL commands from malicious actors.
+ **Simple sharing data** – Share data with other AWS accounts without moving data. For more information, see [Cross-account data sharing in Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/cross-account-permissions.html).

## Creating a Data Catalog view


There are different ways to create a Data Catalog view. These include using the AWS CLI or Spark SQL. A few examples follow.

------
#### [ Using SQL ]

The following shows the syntax for creating a Data Catalog view. Note the `MULTI DIALECT` view type. This distinguishes the Data Catalog view from other views. The `SECURITY` predicate is specified as `DEFINER`. This indicates a Data Catalog view with `DEFINER` semantics.

```
CREATE [ OR REPLACE ] PROTECTED MULTI DIALECT VIEW [IF NOT EXISTS] view_name
[(column_name [COMMENT column_comment], ...) ]
[ COMMENT view_comment ]
[TBLPROPERTIES (property_name = property_value, ... )]
SECURITY DEFINER
AS query;
```

The following is a sample `CREATE` statement, following the syntax:

```
CREATE PROTECTED MULTI DIALECT VIEW catalog_view
SECURITY DEFINER
AS
SELECT order_date, sum(totalprice) AS price
FROM source_table
GROUP BY order_date
```

You can also create a view in dry-run mode, using SQL, to test view creation, without actually creating the resource. Using this option results in a "dry run" that validates the input and, if the validation succeeds, returns the JSON of the AWS Glue table object that will represent the view. In this case, The actual view isn't created.

```
CREATE [ OR REPLACE ] PROTECTED MULTI DIALECT VIEW view_name
SECURITY DEFINER 
[ SHOW VIEW JSON ]
AS view-sql
```

------
#### [ Using the AWS CLI ]

**Note**  
When you use the CLI command, the SQL used to create the view isn't parsed. This can result in a case where the view is created, but queries aren't successful. Be sure to test your SQL syntax prior to creating the view.

You use the following CLI command to create a view:

```
aws glue create-table --cli-input-json '{
  "DatabaseName": "database",
  "TableInput": {
    "Name": "view",
    "StorageDescriptor": {
      "Columns": [
        {
          "Name": "col1",
          "Type": "data-type"
        },
        ...
        {
          "Name": "col_n",
          "Type": "data-type"
        }
      ],
      "SerdeInfo": {}
    },
    "ViewDefinition": {
      "SubObjects": [
        "arn:aws:glue:aws-region:aws-account-id:table/database/referenced-table1",
        ...
        "arn:aws:glue:aws-region:aws-account-id:table/database/referenced-tableN",
       ],
      "IsProtected": true,
      "Representations": [
        {
          "Dialect": "SPARK",
          "DialectVersion": "1.0",
          "ViewOriginalText": "Spark-SQL",
          "ViewExpandedText": "Spark-SQL"
        }
      ]
    }
  }
}'
```

------

## Supported view operations


The following command fragments show you various ways to work with Data Catalog views:
+ **CREATE VIEW**

  Creates a data-catalog view. The following is a sample that shows creating a view from an existing table:

  ```
  CREATE PROTECTED MULTI DIALECT VIEW catalog_view 
  SECURITY DEFINER AS SELECT * FROM my_catalog.my_database.source_table
  ```
+ **ALTER VIEW**

  Available syntax:
  + `ALTER VIEW view_name [FORCE] ADD DIALECT AS query`
  + `ALTER VIEW view_name [FORCE] UPDATE DIALECT AS query`
  + `ALTER VIEW view_name DROP DIALECT`

  You can use the `FORCE ADD DIALECT` option to force update the schema and sub objects as per the new engine dialect. Note that doing this can result in query errors if you don't also use `FORCE` to update other engine dialects. The following shows a sample:

  ```
  ALTER VIEW catalog_view FORCE ADD DIALECT
  AS
  SELECT order_date, sum(totalprice) AS price
  FROM source_table
  GROUP BY orderdate;
  ```

  The following shows how to alter a view in order to update the dialect:

  ```
  ALTER VIEW catalog_view UPDATE DIALECT AS 
  SELECT count(*) FROM my_catalog.my_database.source_table;
  ```
+ **DESCRIBE VIEW**

  Available syntax for describing a view:
  + `SHOW COLUMNS {FROM|IN} view_name [{FROM|IN} database_name]` – If the user has the required AWS Glue and Lake Formation permissions to describe the view, they can list the columns. The following shows a couple sample commands for showing columns:

    ```
    SHOW COLUMNS FROM my_database.source_table;    
    SHOW COLUMNS IN my_database.source_table;
    ```
  + `DESCRIBE view_name` – If the user has the required AWS Glue and Lake Formation permissions to describe the view, they can list the columns in the view along with its metadata.
+ **DROP VIEW**

  Available syntax:
  + `DROP VIEW [ IF EXISTS ] view_name`

    The following sample shows a `DROP` statement that tests if a view exists prior to dropping it:

    ```
    DROP VIEW IF EXISTS catalog_view;
    ```
+ **SHOW CREATE VIEW**
  + `SHOW CREATE VIEW view_name` – Shows the SQL statement that creates the specified view. The following is a sample that shows creating a data-catalog view:

    ```
    SHOW CREATE TABLE my_database.catalog_view;
    CREATE PROTECTED MULTI DIALECT VIEW my_catalog.my_database.catalog_view (
      net_profit,
      customer_id,
      item_id,
      sold_date)
    TBLPROPERTIES (
      'transient_lastDdlTime' = '1736267222')
    SECURITY DEFINER AS SELECT * FROM
    my_database.store_sales_partitioned_lf WHERE customer_id IN (SELECT customer_id from source_table limit 10)
    ```
+ **SHOW VIEWS**

  List all views in the catalog such asregular views, multi-dialect views (MDV), and MDV without Spark dialect. Available syntax is the following:
  + `SHOW VIEWS [{ FROM | IN } database_name] [LIKE regex_pattern]`:

    The following shows a sample command to show views:

    ```
    SHOW VIEWS IN marketing_analytics LIKE 'catalog_view*';
    ```

For more information about creating and configuring data-catalog views, see [Building AWS Glue Data Catalog views](https://docs.aws.amazon.com/lake-formation/latest/dg/working-with-views.html) in the AWS Lake Formation Developer Guide.

## Querying a Data Catalog view


 After creating a Data Catalog view, you can query it using an Amazon EMR Spark job that has AWS Lake Formation fine-grained access control enabled. The job runtime role must have the Lake Formation `SELECT` permission on the Data Catalog view. You don't need to grant access to the underlying tables referenced in the view. 

Once you have everything set up, you can query your view. For example, after creating an Amazon EMR application in EMR Studio, you can run the following query to access a view.

```
SELECT * from my_database.catalog_view LIMIT 10;
```

A helpful function is the `invoker_principal`. It returns the unique identifier of the EMRS job runtime role. This can be used to control the view output, based on the invoking principal. You can use this to add a condition in your view that refines query results, based on the calling role. The job runtime role must have permission to the `LakeFormation:GetDataLakePrincipal` IAM action to use this function.

```
select invoker_principal();
```

You can add this function to a `WHERE` clause, for instance, to refine query results.

## Considerations and limitations


When you create Data Catalog views, the following apply:
+ You can only create Data Catalog views with Amazon EMR 7.10 and above.
+ The Data Catalog view definer must have `SELECT` access to the underlying base tables accessed by the view. Creating the Data Catalog view fails if a specific base table has any Lake Formation filters imposed on the definer role.
+ Base tables must not have the `IAMAllowedPrincipals` data lake permission in Lake Formation. If present, the error *Multi Dialect views may only reference tables without IAMAllowedPrincipals permissions* occurs.
+ The table's Amazon S3 location must be registered as a Lake Formation data lake location. If the table isn't registered, the error *Multi Dialect views may only reference Lake Formation managed tables* occurs. For information about how to register Amazon S3 locations in Lake Formation, see [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html) in the AWS Lake Formation Developer Guide.
+ You can only create `PROTECTED` Data Catalog views. `UNPROTECTED` views aren't supported.
+ You can't reference tables in another AWS account in a Data Catalog view definition. You also can't reference a table in the same account that's in a separate region.
+ To share data across an account or region, the entire view must be be shared cross account and cross region, using Lake Formation resource links.
+ User-defined functions (UDFs) aren't supported.
+ You can use views based on Iceberg tables. The open-table formats Apache Hudi and Delta Lake are also supported.
+ You can't reference other views in Data Catalog views.
+ An AWS Glue Data Catalog view schema is always stored using lowercase. For example, if you use a DDL statement to create a Glue Data Catalog view with a column named `Castle`, the column created in the Glue Data Catalog will be made lowercase, to `castle`. If you then specify the column name in a DML query as `Castle` or `CASTLE`, EMR Spark will make the name lowercase for you in order to run the query. But the column heading displays using the casing that you specified in the query. 

  If you want a query to fail in a case where a column name specified in the DML query does not match the column name in the Glue Data Catalog, you can set `spark.sql.caseSensitive=true`.

# Considerations for Amazon EMR with Lake Formation
Considerations

Amazon EMR with Lake Formation is available in all [available regions](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-region.html).

## Considerations for Amazon EMR with Lake Formation for version 7.9 and earlier
Considerations

Consider the following when using AWS Lake Formation on EMR 7.9 and earlier versions.
+ [Fine-grained access control](emr-lf-enable.md#emr-lf-fgac-perms) at row, column, and cell level is available on clusters with Amazon EMR releases 6.15 and higher.
+ Users with access to a table can access all the properties of that table. If you have Lake Formation based access control on a table, review the table to make sure that the properties don't contain any sensitive data or information.
+ Amazon EMR clusters with Lake Formation don't support Spark's fallback to HDFS when Spark collects table statistics. This ordinarily helps optimize query performance.
+ Operations that support access controls based on Lake Formation with non-governed Apache Spark tables include `INSERT INTO` and `INSERT OVERWRITE`.
+ Operations that support access controls based on Lake Formation with Apache Spark and Apache Hive include `SELECT`, `DESCRIBE`, `SHOW DATABASE`, `SHOW TABLE`, `SHOW COLUMN`, and `SHOW PARTITION`.
+ Amazon EMR doesn't support access control to the following Lake Formation based operations: 
  + Writes to governed tables
  + Amazon EMR doesn't support `CREATE TABLE`. Amazon EMR 6.10.0 and higher supports `ALTER TABLE`.
  + DML statements other than `INSERT` commands.
+ There are performance differences between the same query with and without Lake Formation based access control.
+ You can only use Amazon EMR with Lake Formation for Spark jobs.
+ Trusted Identity propagation is not supported with multi-catalog hierarchy in Glue Data Catalog. For more information, see [Working with a multi-catalog hierarchy in AWS Glue Data Catalog](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-multi-catalog.html).

## Considerations for Amazon EMR with Lake Formation for version 7.10 and later
Considerations

Consider the following when using Amazon EMR with AWS Lake Formation on EMR 7.10 and later versions.
+ Amazon EMR supports fine-grained access control via Lake Formation only for Apache Hive, Apache Iceberg, Apache Delta and Apache Hudi tables. Apache Hive formats include Parquet, ORC, and xSV CSV. 
+ For Lake Formation–enabled applications, Spark logs are written to Amazon S3 in two groups: system space logs and user space logs. System space logs may contain sensitive information such as the full table schema. To safeguard this data, Amazon EMR stores system space logs in a separate location from user space logs. It is strongly recommended that account administrators do not grant users access to system space logs.
+ If you register a table location with Lake Formation, data access will be controlled exclusively by the permissions of the role used for registration, rather than by the Amazon EMR job runtime role. If the registration role is misconfigured, jobs that attempt to access the table will fail.
+ You can't turn off `DynamicResourceAllocation` for Lake Formation jobs.
+ You can only use Lake Formation with Spark jobs.
+ Amazon EMR with Lake Formation only supports a single Spark session throughout a job.
+ Amazon EMR with Lake Formation only supports cross-account table queries shared through resource links.
+ The following aren't supported:
  + Resilient distributed datasets (RDD)
  + Spark streaming
  + Write with Lake Formation granted permissions
  + Access control for nested columns
+ Amazon EMR blocks functionalities that might undermine the complete isolation of system driver, including the following:
  + UDTs, HiveUDFs, and any user-defined function that involves custom classes
  + Custom data sources
  + Supply of additional jars for Spark extension, connector, or metastore
  + `ANALYZE TABLE` command
+ To enforce access controls, `EXPLAIN PLAN` and DDL operations such as `DESCRIBE TABLE` don't expose restricted information.
+ Amazon EMR restricts access to system driver Spark logs on Lake Formation-enabled applications. Since the system driver runs with elevated permissions, events and logs that the system driver generates can include sensitive information. To prevent unauthorized users or code from accessing this sensitive data, Amazon EMR disables access to system driver logs.

  System profile logs are always persisted in managed storage – this is a mandatory setting that cannot be disabled. These logs are stored securely and encrypted using either a Customer Managed KMS key or an AWS Managed KMS key. 

  If your Amazon EMR application is in a private subnet with VPC endpoints for Amazon S3 and you attach an endpoint policy to control access, before your jobs can send log data to AWS Managed Amazon S3, you must include the permissions detailed in [Managed storage](logging.html#jobs-log-storage-managed-storage) in your VPC policy to S3 gateway endpoint. For troubleshooting requests, contact AWS support.
+ If you registered a table location with Lake Formation, the data access path goes through the Lake Formation stored credentials regardless of the IAM permission for the Amazon EMR job runtime role. If you misconfigure the role registered with table location, jobs submitted that use the role with S3 IAM permission to the table location will fail.
+ Writing to a Lake Formation table uses IAM permission rather than Lake Formation granted permissions. If your job runtime role has the necessary S3 permissions, you can use it to run write operations.

The following are considerations and limitations when using Apache Iceberg:
+ You can only use Apache Iceberg with session catalog and not arbitrarily named catalogs.
+ Iceberg tables that are registered in Lake Formation only support the metadata tables `history`, `metadata_log_entries`, `snapshots`, `files`, `manifests`, and `refs`. Amazon EMR hides the columns that might have sensitive data, such as `partitions`, `path`, and `summaries`. This limitation doesn't apply to Iceberg tables that aren't registered in Lake Formation.
+ Tables that you don't register in Lake Formation support all Iceberg stored procedures. The `register_table` and `migrate` procedures aren't supported for any tables.
+ We recommend that you use Iceberg DataFrameWriterV2 instead of V1.

## Considerations for Amazon EMR with Lake Formation for version 7.12 and later


### General


Review the following limitations when using Lake Formation with Amazon EMR .
+ You can't turn off `DynamicResourceAllocation` for Lake Formation jobs.
+ You can only use Lake Formation with Spark jobs.
+ Amazon EMR with Lake Formation only supports a single Spark session throughout a job.
+ Amazon EMR with Lake Formation only supports cross-account table queries shared through resource links.
+ The following aren't supported:
  + Resilient distributed datasets (RDD)
  + Spark streaming
  + Access control for nested columns
+ Amazon EMR blocks functionalities that might undermine the complete isolation of system driver, including the following:
  + UDTs, HiveUDFs, and any user-defined function that involves custom classes
  + Custom data sources
  + Supply of additional jars for Spark extension, connector, or metastore
  + `ANALYZE TABLE` command
+ If your Amazon EMR application is in a private subnet with VPC endpoints for Amazon S3 and you attach an endpoint policy to control access, before your jobs can send log data to AWS Managed Amazon S3, you must include the permissions detailed in [Managed storage](logging.html#jobs-log-storage-managed-storage) in your VPC policy to S3 gateway endpoint. For troubleshooting requests, contact AWS support.
+ Starting with Amazon EMR 7.9.0, Spark FGAC supports S3AFileSystem when used with the s3a:// scheme.
+ Amazon EMR 7.11 supports creating managed tables using CTAS.
+ Amazon EMR 7.12 supports creating managed and external tables using CTAS.

## Permissions

+ To enforce access controls, EXPLAIN PLAN and DDL operations such as DESCRIBE TABLE don't expose restricted information.
+ When you register a table location with Lake Formation, data access uses Lake Formation stored credentials instead of the EMR Serverless job runtime role's IAM permissions. Jobs will fail if the registered role for table location is misconfigured, even when the runtime role has S3 IAM permissions for that location.
+ Starting with Amazon EMR 7.12, you can write to existing Hive and Iceberg tables using DataFrameWriter (V2) with Lake Formation credentials in append mode. For overwrite operations or when creating new tables, EMR uses the runtime role credentials to modify table data.
+ The following limitations apply when using views or cached tables as source data (these limitations do not apply to AWS Glue Data Catalog views):
  + For MERGE, DELETE, and UPDATE operations
    + Supported: Using views and cached tables as source tables.
    + Not supported: Using views and cached tables in assignment and condition clauses.
  + For CREATE OR REPLACE and REPLACE TABLE AS SELECT operations:
    + Not supported: Using views and cached tables as source tables.
+ Delta Lake tables with UDFs in source data support MERGE, DELETE, and UPDATE operations only when deletion vector is enabled.

## Logs and debugging

+ Amazon EMR restricts access to system driver Spark logs on Lake Formation-enabled applications. Since the system driver runs with elevated permissions, events and logs that the system driver generates can include sensitive information. To prevent unauthorized users or code from accessing this sensitive data, Amazon EMR disables access to system driver logs.

  System profile logs are always persisted in managed storage – this is a mandatory setting that cannot be disabled. These logs are stored securely and encrypted using either a Customer Managed KMS key or an AWS Managed KMS key. 

## Iceberg


Review the following considerations when using Apache Iceberg:
+ You can only use Apache Iceberg with session catalog and not arbitrarily named catalogs.
+ Iceberg tables that are registered in Lake Formation only support the metadata tables `history`, `metadata_log_entries`, `snapshots`, `files`, `manifests`, and `refs`. Amazon EMR hides the columns that might have sensitive data, such as `partitions`, `path`, and `summaries`. This limitation doesn't apply to Iceberg tables that aren't registered in Lake Formation.
+ Tables that not registered in Lake Formation support all Iceberg stored procedures. The `register_table` and `migrate` procedures aren't supported for any tables.
+ We suggest that you use Iceberg DataFrameWriterV2 instead of V1.

# Spark native fine-grained access control alllowlisted PySpark API


To maintain security and data access controls, Spark fine-grained access control (FGAC) restricts certain PySpark functions. These restrictions are enforced through:
+ Explicit blocking that prevents function execution
+ Architecture incompatibilities that make functions non-functional
+ Functions that may throw errors, return access denied messages, or do nothing when called

The following PySpark features aren't supported in Spark FGAC:
+ RDD operations (blocked with SparkRDDUnsupportedException)
+ Spark Connect (unsupported)
+ Spark Streaming (unsupported)

While we've tested the listed functions in a Native Spark FGAC environment and confirmed they work as expected, our testing typically covers only basic usage of each API. Functions with multiple input types or complex logic paths may have untested scenarios.

For any functions not listed here and not clearly part of the unsupported categories above, we recommend:
+ Testing them first in a gamma environment or small-scale deployment
+ Verifying their behavior before using them in production

**Note**  
If you see a class method listed but not its base class, the method should still work—it just means we haven't explicitly verified the base class constructor.

The PySpark API is organized into modules. General support for methods within each module is detailed in the table below.


| Module name | Status | Notes | 
| --- | --- | --- | 
|  pyspark\$1core  |  Supported  |  This module contains the main RDD classes, and these functions are mostly unsupported.  | 
|  pyspark\$1sql  |  Supported  |  | 
|  pyspark\$1testing  |  Supported  |  | 
|  pyspark\$1resource  |  Supported  |  | 
|  pyspark\$1streaming  |  Blocked  |  Streaming usage is blocked in Spark FGAC.  | 
|  pyspark\$1mllib  |  Experimental  |  This module contains RDD based ML operations, and these functions are mostly unsupported. This module isn't thoroughly tested.  | 
|  pyspark\$1ml  |  Experimental  |  This module containes DataFrame based ML operations, and these functions are mostly supported. This module isn't thoroughly tested.  | 
|  pyspark\$1pandas  |  Supported  |    | 
|  pyspark\$1pandas\$1slow  |  Supported  |    | 
| pyspark\$1connect |  Blocked  |  Spark Connect usage is blocked in Spark FGAC.  | 
| pyspark\$1pandas\$1connect |  Blocked  |  Spark Connect usage is blocked in Spark FGAC.  | 
| pyspark\$1pandas\$1slow\$1connect |  Blocked  |  Spark Connect usage is blocked in Spark FGAC.  | 
| pyspark\$1errors |  Experimental  |  This module is not thoroughly tested. Custom error classes can't be utilized.  | 

**API Allowlist**

For a downloadable and easier to search list, a file with the modules and classes is available at [Python functions allowed in Native FGAC](samples/Python functions allowed in Native FGAC.zip).

# Lake Formation full table access for Amazon EMR on EC2


With Amazon EMR releases 7.8.0 and higher, you can leverage AWS Lake Formation with Glue Data Catalog where the job runtime role has full table permissions without the limitations of fine-grained access control. This capability allows you to read and write to tables that are protected by Lake Formation from your Amazon EMR on EC2 Spark batch and interactive jobs. See the following sections to learn more about Lake Formation and how to use it with Amazon EMR on EC2.

## Using Lake Formation with full table access


You can access AWS Lake Formation protected Glue Data catalog tables from Amazon EMR on EC2 Spark jobs or interactive sessions where the job's runtime role has full table access. You do not need to enable AWS Lake Formation on the Amazon EMR on EC2 application. When a Spark job is configured for Full Table Access (FTA), AWS Lake Formation credentials are used to read/write S3 data for AWS Lake Formation registered tables, while the job's runtime role credentials will be used to read/write tables not registered with AWS Lake Formation.

**Important**  
Do not enable AWS Lake Formation for fine-grained access control. A job cannot simultaneously run Full Table Access (FTA) and Fine-Grained Access Control (FGAC) on the same EMR cluster or application.

### Step 1: Enable Full Table Access in Lake Formation


To use Full Table Access (FTA) mode, you must allow third-party query engines to access data without the IAM session tag validation in AWS Lake Formation. To enable, follow the steps in [Application integration for full table access](https://docs.aws.amazon.com/lake-formation/latest/dg/full-table-credential-vending.html).

**Note**  
 When accessing cross-account tables, full-table access must be enabled in both producer and consumer accounts. In the same manner, when accessing cross-region tables, this setting must be enabled in both producer and consumer regions. 

### Step 2: Setup IAM permissions for job runtime role


For read or write access to underlying data, in addition to Lake Formation permissions, a job runtime role needs the `lakeformation:GetDataAccess` IAM permission. With this permission, Lake Formation grants the request for temporary credentials to access the data.

The following is an example policy of how to provide IAM permissions to access a script in Amazon S3, uploading logs to S3, AWS Glue API permissions, and permission to access Lake Formation.

#### Step 2.1 Configure Lake Formation permissions

+ Spark jobs that read data from S3 require Lake Formation SELECT permission.
+ Spark jobs that write/delete data in S3 require Lake Formation ALL (SUPER) permission.
+ Spark jobs that interact with Glue Data catalog require DESCRIBE, ALTER, DROP permission as appropriate.

For more information, refer to [Granting permissions on Data Catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html).

### Step 3: Initialize a Spark session for Full Table Access using Lake Formation


#### Prerequisites


AWS Glue Data Catalog must be configured as a metastore to access Lake Formation tables.

Set the following settings to configure Glue catalog as a metastore:

```
--conf spark.sql.catalogImplementation=hive
--conf spark.hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory
```

For more information on enabling Data Catalog for Amazon EMR on EC2, refer to [Metastore configuration for Amazon EMR on EC2](metastore-config.html).

To access tables registered with AWS Lake Formation, the following configurations need to be set during Spark initialization to configure Spark to use AWS Lake Formation credentials.

------
#### [ Hive ]

```
‐‐conf spark.hadoop.fs.s3.credentialsResolverClass=com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver
--conf spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject=true 
--conf spark.hadoop.fs.s3.folderObject.autoAction.disabled=true
--conf spark.sql.catalog.skipLocationValidationOnCreateTable.enabled=true
--conf spark.sql.catalog.createDirectoryAfterTable.enabled=true
--conf spark.sql.catalog.dropDirectoryBeforeTable.enabled=true
```

------
#### [ Iceberg ]

```
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=S3_DATA_LOCATION
--conf spark.sql.catalog.spark_catalog.client.region=REGION
--conf spark.sql.catalog.spark_catalog.type=glue
--conf spark.sql.catalog.spark_catalog.glue.account-id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.glue.lakeformation-enabled=true
--conf spark.sql.catalog.dropDirectoryBeforeTable.enabled=true
```

------
#### [ Delta Lake ]

```
‐‐conf spark.hadoop.fs.s3.credentialsResolverClass=com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver
--conf spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject=true 
--conf spark.hadoop.fs.s3.folderObject.autoAction.disabled=true
--conf spark.sql.catalog.skipLocationValidationOnCreateTable.enabled=true
--conf spark.sql.catalog.createDirectoryAfterTable.enabled=true
--conf spark.sql.catalog.dropDirectoryBeforeTable.enabled=true
```

------
#### [ Hudi ]

```
‐‐conf spark.hadoop.fs.s3.credentialsResolverClass=com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver
--conf spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject=true 
--conf spark.hadoop.fs.s3.folderObject.autoAction.disabled=true
--conf spark.sql.catalog.skipLocationValidationOnCreateTable.enabled=true
--conf spark.sql.catalog.createDirectoryAfterTable.enabled=true
--conf spark.sql.catalog.dropDirectoryBeforeTable.enabled=true
--conf spark.jars=/usr/lib/hudi/hudi-spark-bundle.jar
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
```

------
+ `spark.hadoop.fs.s3.credentialsResolverClass=com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver`: Configure EMR Filesystem (EMRFS) or EMR S3A to use AWS Lake Formation S3 credentials for Lake Formation registered tables. If the table is not registered, use the job's runtime role credentials. 
+ `spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject=true` and `spark.hadoop.fs.s3.folderObject.autoAction.disabled=true`: Configure EMRFS to use content type header application/x-directory instead of \$1folder\$1 suffix when creating S3 folders. This is required when reading Lake Formation tables, as Lake Formation credentials do not allow reading table folders with \$1folder\$1 suffix.
+ `spark.sql.catalog.skipLocationValidationOnCreateTable.enabled=true`: Configure Spark to skip validating the table location's emptiness before creation. This is necessary for Lake Formation registered tables, as Lake Formation credentials to verify the empty location are available only after Glue Data Catalog table creation. Without this configuration, the job's runtime role credentials will validate the empty table location.
+ `spark.sql.catalog.createDirectoryAfterTable.enabled=true`: Configure Spark to create the Amazon S3 folder after table creation in the Hive metastore. This is required for Lake Formation registered tables, as Lake Formation credentials to create the S3 folder are available only after Glue Data Catalog table creation.
+ `spark.sql.catalog.dropDirectoryBeforeTable.enabled=true`: Configure Spark to drop the S3 folder before table deletion in the Hive metastore. This is necessary for Lake Formation registered tables, as Lake Formation credentials to drop the S3 folder are not available after table deletion from the Glue Data Catalog.
+ `spark.sql.catalog.<catalog>.glue.lakeformation-enabled=true`: Configure Iceberg catalog to use AWS Lake Formation S3 credentials for Lake Formation registered tables. If the table is not registered, use default environment credentials.

#### Configure full table access mode in SageMaker Unified Studio


To access Lake Formation registered tables from interactive Spark sessions in JupyterLab notebooks, use compatibility permission mode. Use the %%configure magic command to set up your Spark configuration. Choose the configuration based on your table type:

------
#### [ For Hive tables ]

```
%%configure -f
{
    "conf": {
        "spark.hadoop.fs.s3.credentialsResolverClass": "com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver",
        "spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject": true,
        "spark.hadoop.fs.s3.folderObject.autoAction.disabled": true,
        "spark.sql.catalog.skipLocationValidationOnCreateTable.enabled": true,
        "spark.sql.catalog.createDirectoryAfterTable.enabled": true,
        "spark.sql.catalog.dropDirectoryBeforeTable.enabled": true
    }
}
```

------
#### [ For Iceberg tables ]

```
%%configure -f
{
    "conf": {
        "spark.sql.catalog.spark_catalog": "org.apache.iceberg.spark.SparkSessionCatalog",
        "spark.sql.catalog.spark_catalog.warehouse": "S3_DATA_LOCATION",
        "spark.sql.catalog.spark_catalog.client.region": "REGION",
        "spark.sql.catalog.spark_catalog.type": "glue",
        "spark.sql.catalog.spark_catalog.glue.account-id": "ACCOUNT_ID",
        "spark.sql.catalog.spark_catalog.glue.lakeformation-enabled": "true",
        "spark.sql.catalog.dropDirectoryBeforeTable.enabled": "true", 
    }
}
```

------
#### [ For Delta Lake tables ]

```
%%configure -f
{
    "conf": {
        "spark.hadoop.fs.s3.credentialsResolverClass": "com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver",
        "spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject": true,
        "spark.hadoop.fs.s3.folderObject.autoAction.disabled": true,
        "spark.sql.catalog.skipLocationValidationOnCreateTable.enabled": true,
        "spark.sql.catalog.createDirectoryAfterTable.enabled": true,
        "spark.sql.catalog.dropDirectoryBeforeTable.enabled": true
    }
}
```

------
#### [ For Hudi tables ]

```
%%configure -f
{
    "conf": {
        "spark.hadoop.fs.s3.credentialsResolverClass": "com.amazonaws.glue.accesscontrol.AWSLakeFormationCredentialResolver",
        "spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject": true,
        "spark.hadoop.fs.s3.folderObject.autoAction.disabled": true,
        "spark.sql.catalog.skipLocationValidationOnCreateTable.enabled": true,
        "spark.sql.catalog.createDirectoryAfterTable.enabled": true,
        "spark.sql.catalog.dropDirectoryBeforeTable.enabled": true,
        "spark.jars": "/usr/lib/hudi/hudi-spark-bundle.jar",
        "spark.sql.extensions": "org.apache.spark.sql.hudi.HoodieSparkSessionExtension",
        "spark.sql.catalog.spark_catalog": "org.apache.spark.sql.hudi.catalog.HoodieCatalog",
        "spark.serializer": "org.apache.spark.serializer.KryoSerializer"
    }
}
```

------

Replace the placeholders:
+ `S3_DATA_LOCATION`: Your S3 bucket path
+ `REGION`: AWS region (e.g., us-east-1)
+ `ACCOUNT_ID`: Your AWS account ID

**Note**  
You must set these configurations before executing any Spark operations in your notebook.

#### Supported Operations


These operations will use AWS Lake Formation credentials to access the table data.
+ CREATE TABLE
+ ALTER TABLE
+ INSERT INTO
+ INSERT OVERWRITE
+ UPDATE
+ MERGE INTO
+ DELETE FROM
+ ANALYZE TABLE
+ REPAIR TABLE
+ DROP TABLE
+ Spark datasource queries
+ Spark datasource writes

**Note**  
Operations not listed above will continue to use IAM permissions to access table data.

#### Considerations

+ If a Hive table is created using a job that doesn’t have full table access enabled, and no records are inserted, subsequent reads or writes from a job with full table access will fail. This is because EMR Spark without full table access adds the `$folder$` suffix to the table folder name. To resolve this, you can either:
  + Insert at least one row into the table from a job that does not have FTA enabled.
  + Configure the job that does not have FTA enabled to not use `$folder$` suffix in folder name in S3. This can be achieved by setting Spark configuration `spark.hadoop.fs.s3.useDirectoryHeaderAsFolderObject=true`.
  + Create a S3 folder at the table location `s3://path/to/table/table_name` using the AWS S3 console or AWS S3 CLI.
+ Full Table Access is supported with the EMR Filesystem (EMRFS) starting in Amazon EMR release 7.8.0, and with the S3A filesystem starting in Amazon EMR release 7.10.0.
+ Full Table Access is supported for Hive, Iceberg, Delta, and Hudi tables.
+ **Hudi FTA Write Support considerations:**
  + Hudi FTA writes require using HoodieCredentialedHadoopStorage for credential vending during job execution. Set the following configuration when running Hudi jobs: `hoodie.storage.class=org.apache.spark.sql.hudi.storage.HoodieCredentialedHadoopStorage`
  + Full Table Access (FTA) write support for Hudi is available starting from Amazon EMR release 7.12.
  + Hudi FTA write support currently works only with the default Hudi configurations. Custom or non-default Hudi settings may not be fully supported and could result in unexpected behavior.
  + Clustering for Hudi Merge-On-Read (MOR) tables is not supported at this point under FTA write mode.
+ Jobs referencing tables with Lake Formation Fine-Grained Access Control (FGAC) rules or Glue Data Catalog Views will fail. To query a table with an FGAC rules or a Glue Data Catalog View, you must use the FGAC mode. You can enable FGAC mode by following the steps outlined in the AWS documentation: [Using Amazon EMR on EC2 with AWS Lake Formation for fine-grained access control](emr-serverless-lf-enable.html).
+ Full table access does not support Spark Streaming.
+ When writing Spark DataFrame to a Lake Formation table, only APPEND mode is supported for Hive and Iceberg tables: `df.write.mode("append").saveAsTable(table_name)`
+ Creating external tables requires IAM permissions.
+ Because Lake Formation temporarily caches credentials within a Spark job, a Spark batch job or interactive session that is currently running might not reflect permission changes.
+ You must use user defined role and not a service-linked role:[Lake Formation requirements for roles](https://docs.aws.amazon.com/lake-formation/latest/dg/registration-role.html).

#### Hudi FTA Write Support - Supported Operations


The following table shows the supported write operations for Hudi Copy-On-Write (COW) and Merge-On-Read (MOR) tables under Full Table Access mode:


**Hudi FTA Supported Write Operations**  

| Table Type | Operation | SQL Write Command | Status | 
| --- | --- | --- | --- | 
| COW | INSERT | INSERT INTO TABLE | Supported | 
| COW | INSERT | INSERT INTO TABLE - PARTITION (Static, Dynamic) | Supported | 
| COW | INSERT | INSERT OVERWRITE | Supported | 
| COW | INSERT | INSERT OVERWRITE - PARTITION (Static, Dynamic) | Supported | 
| UPDATE | UPDATE | UPDATE TABLE | Supported | 
| COW | UPDATE | UPDATE TABLE - Change Partition | Not Supported | 
| DELETE | DELETE | DELETE FROM TABLE | Supported | 
| ALTER | ALTER | ALTER TABLE - RENAME TO | Not Supported | 
| COW | ALTER | ALTER TABLE - SET TBLPROPERTIES | Supported | 
| COW | ALTER | ALTER TABLE - UNSET TBLPROPERTIES | Supported | 
| COW | ALTER | ALTER TABLE - ALTER COLUMN | Supported | 
| COW | ALTER | ALTER TABLE - ADD COLUMNS | Supported | 
| COW | ALTER | ALTER TABLE - ADD PARTITION | Supported | 
| COW | ALTER | ALTER TABLE - DROP PARTITION | Supported | 
| COW | ALTER | ALTER TABLE - RECOVER PARTITIONS | Supported | 
| COW | ALTER | REPAIR TABLE SYNC PARTITIONS | Supported | 
| DROP | DROP | DROP TABLE | Supported | 
| COW | DROP | DROP TABLE - PURGE | Supported | 
| CREATE | CREATE | CREATE TABLE - Managed | Supported | 
| COW | CREATE | CREATE TABLE - PARTITION BY | Supported | 
| COW | CREATE | CREATE TABLE IF NOT EXISTS | Supported | 
| COW | CREATE | CREATE TABLE LIKE | Supported | 
| COW | CREATE | CREATE TABLE AS SELECT | Supported | 
| CREATE | CREATE | CREATE TABLE with LOCATION - External Table | Not Supported | 
| DATAFRAME(INSERT) | DATAFRAME(INSERT) | saveAsTable.Overwrite | Supported | 
| COW | DATAFRAME(INSERT) | saveAsTable.Append | Not Supported | 
| COW | DATAFRAME(INSERT) | saveAsTable.Ignore | Supported | 
| COW | DATAFRAME(INSERT) | saveAsTable.ErrorIfExists | Supported | 
| COW | DATAFRAME(INSERT) | saveAsTable - External table (Path) | Not Supported | 
| COW | DATAFRAME(INSERT) | save(path) - DF v1 | Not Supported | 
| MOR | INSERT | INSERT INTO TABLE | Supported | 
| MOR | INSERT | INSERT INTO TABLE - PARTITION (Static, Dynamic) | Supported | 
| MOR | INSERT | INSERT OVERWRITE | Supported | 
| MOR | INSERT | INSERT OVERWRITE - PARTITION (Static, Dynamic) | Supported | 
| UPDATE | UPDATE | UPDATE TABLE | Supported | 
| MOR | UPDATE | UPDATE TABLE - Change Partition | Not Supported | 
| DELETE | DELETE | DELETE FROM TABLE | Supported | 
| ALTER | ALTER | ALTER TABLE - RENAME TO | Not Supported | 
| MOR | ALTER | ALTER TABLE - SET TBLPROPERTIES | Supported | 
| MOR | ALTER | ALTER TABLE - UNSET TBLPROPERTIES | Supported | 
| MOR | ALTER | ALTER TABLE - ALTER COLUMN | Supported | 
| MOR | ALTER | ALTER TABLE - ADD COLUMNS | Supported | 
| MOR | ALTER | ALTER TABLE - ADD PARTITION | Supported | 
| MOR | ALTER | ALTER TABLE - DROP PARTITION | Supported | 
| MOR | ALTER | ALTER TABLE - RECOVER PARTITIONS | Supported | 
| MOR | ALTER | REPAIR TABLE SYNC PARTITIONS | Supported | 
| DROP | DROP | DROP TABLE | Supported | 
| MOR | DROP | DROP TABLE - PURGE | Supported | 
| CREATE | CREATE | CREATE TABLE - Managed | Supported | 
| MOR | CREATE | CREATE TABLE - PARTITION BY | Supported | 
| MOR | CREATE | CREATE TABLE IF NOT EXISTS | Supported | 
| MOR | CREATE | CREATE TABLE LIKE | Supported | 
| MOR | CREATE | CREATE TABLE AS SELECT | Supported | 
| CREATE | CREATE | CREATE TABLE with LOCATION - External Table | Not Supported | 
| DATAFRAME(UPSERT) | DATAFRAME(UPSERT) | saveAsTable.Overwrite | Supported | 
| MOR | DATAFRAME(UPSERT) | saveAsTable.Append | Not Supported | 
| MOR | DATAFRAME(UPSERT) | saveAsTable.Ignore | Supported | 
| MOR | DATAFRAME(UPSERT) | saveAsTable.ErrorIfExists | Supported | 
| MOR | DATAFRAME(UPSERT) | saveAsTable - External table (Path) | Not Supported | 
| MOR | DATAFRAME(UPSERT) | save(path) - DF v1 | Not Supported | 
| DATAFRAME(DELETE) | DATAFRAME(DELETE) | saveAsTable.Append | Not Supported | 
| MOR | DATAFRAME(DELETE) | saveAsTable - External table (Path) | Not Supported | 
| MOR | DATAFRAME(DELETE) | save(path) - DF v1 | Not Supported | 
| DATAFRAME(BULK\$1INSERT) | DATAFRAME(BULK\$1INSERT) | saveAsTable.Overwrite | Supported | 
| MOR | DATAFRAME(BULK\$1INSERT) | saveAsTable.Append | Not Supported | 
| MOR | DATAFRAME(BULK\$1INSERT) | saveAsTable.Ignore | Supported | 
| MOR | DATAFRAME(BULK\$1INSERT) | saveAsTable.ErrorIfExists | Supported | 
| MOR | DATAFRAME(BULK\$1INSERT) | saveAsTable - External table (Path) | Not Supported | 
| MOR | DATAFRAME(BULK\$1INSERT) | save(path) - DF v1 | Not Supported | 