Open-table format support
Amazon EMR releases 6.15.0 and higher include support for fine-grained access control
based on AWS Lake Formation with Hive tables, Apache Iceberg, Apache Hudi, and Delta Lake when
you read and write data with Spark SQL. Amazon EMR supports table, row, column, and
cell-level access control with Apache Hudi. Amazon EMR releases 6.15.0 and higher include
support for fine-grained access control at the row, column, or cell level based on
AWS Lake Formation. Starting with EMR 7.12, DML and DDL operations that modify table data
are supported for Apache Hive, Apache Iceberg, and Delta Lake tables using Lake
Formation vended credentials.
The topics in this section cover how you can access Lake Formation registered tables in open
table formats from EMR Spark jobs or interactive sessions with fine-grained access
control.
Permission requirements
Tables not registered in AWS Lake Formation
For tables not registered with AWS Lake Formation, the job runtime role accesses both
the AWS Glue Data Catalog and the underlying table data in Amazon S3. This requires the
job runtime role to have appropriate IAM permissions for both AWS Glue and Amazon
S3 operations.
Tables registered in AWS Lake Formation
For tables registered with AWS Lake Formation, the job runtime role accesses the
AWS Glue Data Catalog metadata, while temporary credentials vended by Lake Formation
access the underlying table data in Amazon S3. The Lake Formation permissions
required to execute an operation depend on the AWS Glue Data Catalog and Amazon S3 API
calls that the Spark job initiates and can be summarized as follows:
-
DROP permission allows the runtime
role to delete table or database metadata from the Data Catalog
-
SELECT permission allows the runtime
role to read table data from Amazon S3
-
INSERT permission allows the runtime
role to write table data to Amazon S3
-
DELETE permission allows the runtime
role to delete table data from Amazon S3
Lake Formation evaluates permissions lazily when a Spark job calls
AWS Glue to retrieve table metadata and Amazon S3 to retrieve table
data. Jobs that use a runtime role with insufficient permissions
will not fail until Spark makes an AWS Glue or Amazon S3 call that
requires the missing permission.
In the following supported table matrix:
-
Operations marked as Supported
exclusively use Lake Formation credentials to access table data for
tables registered with Lake Formation. If Lake Formation permissions are
insufficient, the operation will not fall back to runtime role
credentials. For tables not registered with Lake Formation, the job
runtime role credentials access the table data.
-
Operations marked as Supported with IAM
permissions on Amazon S3 location do not use Lake
Formation credentials to access underlying table data in Amazon S3. To
run these operations, the job runtime role must have the necessary
Amazon S3 IAM permissions to access the table data, regardless of
whether the table is registered with Lake Formation.
- Hive
-
| Operation |
AWS Lake Formation permissions |
Support status |
| SELECT |
SELECT |
Supported |
| CREATE TABLE |
CREATE_TABLE |
Supported |
| CREATE TABLE LIKE |
CREATE_TABLE |
Supported with IAM permissions on
Amazon S3 location |
| CREATE TABLE AS SELECT |
CREATE_TABLE |
Supported with IAM permissions on
Amazon S3 location |
| DESCRIBE TABLE |
DESCRIBE |
Supported |
| SHOW TBLPROPERTIES |
DESCRIBE |
Supported |
| SHOW COLUMNS |
DESCRIBE |
Supported |
| SHOW PARTITIONS |
DESCRIBE |
Supported |
| SHOW CREATE TABLE |
DESCRIBE |
Supported |
ALTER TABLE
tablename |
SELECT and ALTER |
Supported |
ALTER TABLE tablename SET
LOCATION |
- |
Not supported |
ALTER TABLE tablenameADD
PARTITION |
SELECT, INSERT and ALTER |
Supported |
| REPAIR TABLE |
SELECT and ALTER |
Supported |
| LOAD DATA |
|
Not supported |
| INSERT |
INSERT and ALTER |
Supported |
| INSERT OVERWRITE |
SELECT, INSERT, DELETE and ALTER |
Supported |
| DROP TABLE |
SELECT, DROP, DELETE and ALTER |
Supported |
| TRUNCATE TABLE |
SELECT, INSERT, DELETE and ALTER |
Supported |
| Dataframe Writer V1 |
Same as corresponding SQL operation |
Supported when appending data to an existing
table. Refer to considerations and limitations for more
information
|
| Dataframe Writer V2 |
Same as corresponding SQL operation |
Supported when appending data to an existing
table. Refer to considerations and limitations for more
information
|
- Iceberg
-
| Operation |
AWS Lake Formation permissions |
Support status |
| SELECT |
SELECT |
Supported |
| CREATE TABLE |
CREATE_TABLE |
Supported |
| CREATE TABLE LIKE |
CREATE_TABLE |
Supported with IAM permissions on
Amazon S3 location |
| CREATE TABLE AS SELECT |
CREATE_TABLE |
Supported with IAM permissions on
Amazon S3 location |
| REPLACE TABLE AS SELECT |
SELECT, INSERT and ALTER |
Supported |
| DESCRIBE TABLE |
DESCRIBE |
Supported with IAM permissions on
Amazon S3 location |
| SHOW TBLPROPERTIES |
DESCRIBE |
Supported with IAM permissions on
Amazon S3 location |
| SHOW CREATE TABLE |
DESCRIBE |
Supported with IAM permissions on
Amazon S3 location |
| ALTER TABLE |
SELECT, INSERT and ALTER |
Supported |
| ALTER TABLE SET LOCATION |
SELECT, INSERT and ALTER |
Supported with IAM permissions on
Amazon S3 location |
| ALTER TABLE WRITE ORDERED BY |
SELECT, INSERT and ALTER |
Supported with IAM permissions on
Amazon S3 location |
| ALTER TABLE WRITE DISTRIBUTED
BY |
SELECT, INSERT, and ALTER |
Supported with IAM permissions on
Amazon S3 location |
| ALTER TABLE RENAME TABLE |
CREATE_TABLE, and DROP |
Supported |
| INSERT INTO |
SELECT, INSERT and ALTER |
Supported |
| INSERT OVERWRITE |
SELECT, INSERT and ALTER |
Supported |
| DELETE |
SELECT, INSERT and ALTER |
Supported |
| UPDATE |
SELECT, INSERT and ALTER |
Supported |
| MERGE INTO |
SELECT, INSERT and ALTER |
Supported |
| DROP TABLE |
SELECT, DELETE and DROP |
Supported |
| DataFrame Writer V1 |
- |
Not supported |
| DataFrame Writer V2 |
Same as corresponding SQL operation |
Supported when appending data to an existing
table. Refer to considerations and limitations for more
information.
|
| Metadata tables |
SELECT |
Supported. Certain tables are hidden.
Refer to considerations and limitations for more
information. |
| Stored procedures |
- |
Supported for tables that meet the
following conditions:
-
Tables not registered in AWS Lake Formation
-
Tables that do not use
register_table and
migrate
Refer to considerations and limitations for more
information.
|
Spark configuration for Iceberg: If
you want to use Iceberg format, set the following configurations.
Replace DB_LOCATION with the
Amazon S3 path where your Iceberg tables are located, and replace the region
and account ID placeholders with your own values.
spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=s3://DB_LOCATION
--conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.catalog.spark_catalog.glue.account-id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.glue.id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.client.region=AWS_REGION
If you want to use Iceberg format on earlier EMR versions, use the
following command instead:
spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions,com.amazonaws.emr.recordserver.connector.spark.sql.RecordServerSQLExtension
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkCatalog
--conf spark.sql.catalog.spark_catalog.warehouse=s3://DB_LOCATION
--conf spark.sql.catalog.spark_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog
--conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO
--conf spark.sql.catalog.spark_catalog.glue.account-id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.glue.id=ACCOUNT_ID
--conf spark.sql.catalog.spark_catalog.client.assume-role.region=AWS_REGION
--conf spark.sql.catalog.spark_catalog.lf.managed=true
Examples:
Here are some examples of working with Iceberg tables:
-- Create an Iceberg table
CREATE TABLE my_iceberg_table (
id BIGINT,
name STRING,
created_at TIMESTAMP
) USING ICEBERG;
-- Insert data
INSERT INTO my_iceberg_table VALUES (1, 'Alice', current_timestamp());
-- Query the table
SELECT * FROM my_iceberg_table;
- Hudi
-
| Operation |
AWS Lake Formation permissions |
Support status |
| SELECT |
SELECT |
Supported |
| CREATE TABLE |
CREATE_TABLE |
Supported with IAM permissions on
Amazon S3 location |
| CREATE TABLE LIKE |
CREATE_TABLE |
Supported with IAM permissions on
Amazon S3 location |
| CREATE TABLE AS SELECT |
- |
Not supported |
| DESCRIBE TABLE |
DESCRIBE |
Supported with IAM permissions on
Amazon S3 location |
| SHOW TBLPROPERTIES |
DESCRIBE |
Supported with IAM permissions on
Amazon S3 location |
| SHOW COLUMNS |
DESCRIBE |
Supported with IAM permissions on
Amazon S3 location |
| SHOW CREATE TABLE |
DESCRIBE |
Supported with IAM permissions on
Amazon S3 location |
| ALTER TABLE |
SELECT |
Supported with IAM permissions on
Amazon S3 location |
| INSERT INTO |
SELECT and ALTER |
Supported with IAM permissions on
Amazon S3 location |
| INSERT OVERWRITE |
SELECT and ALTER |
Supported with IAM permissions on
Amazon S3 location |
| DELETE |
- |
Not supported |
| UPDATE |
- |
Not supported |
| MERGE INTO |
- |
Not supported |
| DROP TABLE |
SELECT and DROP |
Supported with IAM permissions on
Amazon S3 location |
| DataFrame Writer V1 |
- |
Not supported |
| DataFrame Writer V2 |
Same as corresponding SQL operation |
Supported with IAM permissions on
Amazon S3 location |
| Metadata tables |
- |
Not supported |
| Table maintenance and utility
features |
- |
Not supported |
Spark configuration for Hudi:
To start the Spark shell on EMR 7.10 or higher versions, use the
following command:
spark-sql
--jars /usr/lib/hudi/hudi-spark-bundle.jar \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension
To start the Spark shell on earlier EMR versions, use the below
command instead:
spark-sql
--jars /usr/lib/hudi/hudi-spark-bundle.jar \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog \
--conf spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension,com.amazonaws.emr.recordserver.connector.spark.sql.RecordServerSQLExtension \
--conf spark.sql.catalog.spark_catalog.lf.managed=true
Examples:
Here are some examples of working with Hudi tables:
-- Create a Hudi table
CREATE TABLE my_hudi_table (
id BIGINT,
name STRING,
created_at TIMESTAMP
) USING HUDI
TBLPROPERTIES (
'type' = 'cow',
'primaryKey' = 'id'
);
-- Insert data
INSERT INTO my_hudi_table VALUES (1, 'Alice', current_timestamp());
-- Query the latest snapshot
SELECT * FROM my_hudi_table;
To query the latest snapshot of copy-on-write tables:
SELECT * FROM my_hudi_cow_table
spark.read.table("my_hudi_cow_table")
To query the latest compacted data of MOR tables, you can
query the read-optimized table that is suffixed with
_ro:
SELECT * FROM my_hudi_mor_table_ro
spark.read.table("my_hudi_mor_table_ro")
- Delta Lake
-
| Operation |
AWS Lake Formation permissions |
Support status |
| SELECT |
SELECT |
Supported |
| CREATE TABLE |
CREATE_TABLE |
Supported |
| CREATE TABLE LIKE |
- |
Not supported |
| CREATE TABLE AS SELECT |
CREATE_TABLE |
Supported |
| REPLACE TABLE AS SELECT |
SELECT, INSERT and ALTER |
Supported |
| DESCRIBE TABLE |
DESCRIBE |
Supported with IAM permissions on
Amazon S3 location |
| SHOW TBLPROPERTIES |
DESCRIBE |
Supported with IAM permissions on
Amazon S3 location |
| SHOW COLUMNS |
DESCRIBE |
Supported with IAM permissions on
Amazon S3 location |
| SHOW CREATE TABLE |
DESCRIBE |
Supported with IAM permissions on
Amazon S3 location |
| ALTER TABLE |
SELECT and INSERT |
Supported |
| ALTER TABLE SET LOCATION |
SELECT and INSERT |
Supported with IAM permissions on
Amazon S3 location |
ALTER TABLE tablename
CLUSTER BY |
SELECT and INSERT |
Supported with IAM permissions on
Amazon S3 location |
ALTER TABLE tablename ADD
CONSTRAINT |
SELECT and INSERT |
Supported with IAM permissions on
Amazon S3 location |
ALTER TABLE tablename
DROP CONSTRAINT |
SELECT and INSERT |
Supported with IAM permissions on
Amazon S3 location |
| INSERT INTO |
SELECT and INSERT |
Supported |
| INSERT OVERWRITE |
SELECT and INSERT |
Supported |
| DELETE |
SELECT and INSERT |
Supported |
| UPDATE |
SELECT and INSERT |
Supported |
| MERGE INTO |
SELECT and INSERT |
Supported |
| DROP TABLE |
SELECT, DELETE and DROP |
Supported |
| DataFrame Writer V1 |
- |
Not supported |
| DataFrame Writer V2 |
Same as corresponding SQL operation |
Supported
|
| Table maintenance and utility
features |
- |
Not supported |
Spark configuration for Delta
Lake:
To use Delta Lake with Lake Formation on EMR 7.10 and higher, run the
following command:
spark-sql \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog
To use Delta Lake with Lake Formation on EMR 6.15 to 7.9, run the
following
spark-sql \
--conf spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension,com.amazonaws.emr.recordserver.connector.spark.sql.RecordServerSQLExtension \
--conf spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog \
--conf spark.sql.catalog.spark_catalog.lf.managed=true
If you want Lake Formation to use record server to manage your Spark catalog,
set
spark.sql.catalog.<managed_catalog_name>.lf.managed
to true.
Examples:
Here are some examples of working with Delta Lake tables:
-- Create a Delta Lake table
CREATE TABLE my_delta_table (
id BIGINT,
name STRING,
created_at TIMESTAMP
) USING DELTA;
-- Insert data
INSERT INTO my_delta_table VALUES (1, 'Alice', current_timestamp());
-- Query the table
SELECT * FROM my_delta_table;
-- Update data
UPDATE my_delta_table SET name = 'Alice Smith' WHERE id = 1;
-- Merge data
MERGE INTO my_delta_table AS target
USING (SELECT 2 as id, 'Bob' as name, current_timestamp() as created_at) AS source
ON target.id = source.id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
Creating a Delta Lake table in
AWS Glue Data Catalog
Amazon EMR with Lake Formation doesn't support DDL commands and Delta table creation
in EMR releases earlier than 7.12. Follow these steps to create tables
in the AWS Glue Data Catalog.
-
Use the following example to create a Delta table. Make sure
that your S3 location exists.
spark-sql \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog"
> CREATE DATABASE if not exists <DATABASE_NAME> LOCATION 's3://<S3_LOCATION>/transactionaldata/native-delta/<DATABASE_NAME>/';
> CREATE TABLE <TABLE_NAME> (x INT, y STRING, z STRING) USING delta;
> INSERT INTO <TABLE_NAME> VALUES (1, 'a1', 'b1');
-
To see the details of your table, go to
https://console.aws.amazon.com/glue/.
-
In the left navigation, expand Data Catalog,
choose Tables, then choose the table you
created. Under Schema, you should see that
the Delta table you created with Spark stores all columns in a
data type of array<string> in AWS Glue.
-
To define column and cell-level filters in Lake Formation, remove the
col column from your schema, and then add the
columns that are in your table schema. In this example, add the
columns x, y, and
z.
With this feature, you can run snapshot queries on copy-on-write tables to query
the latest snapshot of the table at a given commit or compaction instant. Currently,
a Lake Formation-enabled Amazon EMR cluster must retrieve Hudi's commit time column to perform
incremental queries and time travel queries. It doesn't support Spark's
timestamp as of syntax and the Spark.read() function.
The correct syntax is select * from table where _hoodie_commit_time <=
point_in_time. For more information, see Point in time Time-Travel queries on Hudi table.
The performance of reads on Lake Formation clusters might be slower because of
optimizations that are not supported. These features include file listing based
on Hudi metadata, and data skipping. We recommend that you test your application
performance to ensure that it meets your requirements.