Accessing Amazon S3 tables with Amazon EMR
Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that
        simplifies running big data frameworks, such as Apache Hadoop and
        Apache Spark, on AWS to process and analyze vast amounts of data. Using
        these frameworks and related open-source projects, you can process data for analytics
        purposes and business intelligence workloads. Amazon EMR also lets you transform and move
        large amounts of data into and out of other AWS data stores and databases.
You can use Apache Iceberg clusters in Amazon EMR to work with S3 tables by
        connecting to table buckets in a Spark session. To connect to
        table buckets in Amazon EMR, you can use the AWS analytics services integration through
        AWS Glue Data Catalog, or you can use the open source Amazon S3 Tables Catalog for Apache Iceberg client catalog.
        Connecting to S3 table buckets with Spark on
                an Amazon EMR Iceberg cluster
        In this procedure, you set up an Amazon EMR cluster configured for Apache Iceberg and
      then launch a Spark session that connects to your table buckets. You can set this up
      using the AWS analytics services integration through AWS Glue, or you can use the open source
      Amazon S3 Tables Catalog for Apache Iceberg client catalog. For information about the client catalog, see Accessing tables using the Amazon S3 Tables Iceberg REST endpoint. 
        Choose your method of using tables with Amazon EMR from the following options.
        
            - Amazon S3 Tables Catalog for Apache Iceberg
 - 
                    
The following prerequisites are required to query tables with Spark
            on Amazon EMR using the Amazon S3 Tables Catalog for Apache Iceberg.
                    
                    To set up an Amazon EMR cluster to query tables with Spark
Create a cluster with the following configuration. To use this example, replace the
                    user input placeholders with your own
                information.
                            aws emr create-cluster --release-label emr-7.5.0 \
--applications Name=Spark \
--configurations file://configurations.json \
--region us-east-1 \
--name My_Spark_Iceberg_Cluster \
--log-uri s3://amzn-s3-demo-bucket/ \
--instance-type m5.xlarge \
--instance-count 2 \
--service-role EMR_DefaultRole \
--ec2-attributes \
InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-1234567890abcdef0,KeyName=my-key-pair
                            configurations.json:
                            
                            [{
"Classification":"iceberg-defaults",
"Properties":{"iceberg.enabled":"true"}
}]
- 
                            
Connect
                  to the Spark primary node using SSH.
                         - 
                            
To initialize a Spark session for Iceberg that
                connects to your table bucket, enter the following command. Replace the
                    user input placeholders with your table bucket
                ARN.
                            spark-shell \
--packages software.amazon.s3tables:s3-tables-catalog-for-iceberg-runtime:0.1.3 \
--conf spark.sql.catalog.s3tablesbucket=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.s3tablesbucket.catalog-impl=software.amazon.s3tables.iceberg.S3TablesCatalog \
--conf spark.sql.catalog.s3tablesbucket.warehouse=arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-bucket1 \
--conf spark.sql.defaultCatalog=s3tablesbucket \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions           
                         - 
                            
Query your tables with Spark SQL. For example
                                queries, see Querying S3 tables with Spark SQL.
                         
 
                 
            
            - AWS analytics services integration
 - 
                    
The following prerequisites are required to query tables with Spark
            on Amazon EMR using the AWS analytics services integration.
                    
                    
                    To set up an Amazon EMR cluster to query tables with Spark
Create a cluster with the following configuration. To use this example, replace the
                    user input placeholder values with your own
                information.
                            aws emr create-cluster --release-label emr-7.5.0 \
--applications Name=Spark \
--configurations file://configurations.json \
--region us-east-1 \
--name My_Spark_Iceberg_Cluster \
--log-uri s3://amzn-s3-demo-bucket/ \
--instance-type m5.xlarge \
--instance-count 2 \
--service-role EMR_DefaultRole \
--ec2-attributes \
InstanceProfile=EMR_EC2_DefaultRole,SubnetId=subnet-1234567890abcdef0,KeyName=my-key-pair
                            configurations.json:
                            
                            [{
"Classification":"iceberg-defaults",
"Properties":{"iceberg.enabled":"true"}
}]
- 
                            
Connect
                  to the Spark primary node using SSH.
                         - 
                            
Enter the following command to initialize a Spark session for
                  Iceberg that connects to your tables. Replace the user
                    input placeholders for Region, account ID and table bucket name with your own information.
                            spark-shell \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.defaultCatalog=s3tables \
--conf spark.sql.catalog.s3tables=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.s3tables.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.s3tables.client.region=us-east-1 \
--conf spark.sql.catalog.s3tables.glue.id=111122223333:s3tablescatalog/amzn-s3-demo-table-bucket
                         - 
                            
Query your tables with Spark SQL. For example queries, see Querying S3 tables with Spark SQL
                         
 
                 
            
        
        
        If you are using the DROP TABLE PURGE command with Amazon EMR:
                 
                 
            Amazon EMR version 7.5
                    Set the Spark config
                spark.sql.catalog.your-catalog-name.cache-enabled to
              false. If this config is set to true, run the command in a new session
            or application so the table cache is not activated.
                - 
                    
Amazon EMR versions higher than 7.5
                    DROP TABLE is not supported. You can use the S3 Tables
              DeleteTable REST API to delete a table.