

# Best practices for optimizing Apache Iceberg workloads
<a name="best-practices"></a>

Iceberg is a table format that's designed to simplify data lake management and enhance workload performance. Different use cases might prioritize different aspects such as cost, read performance, write performance, or data retention, so Iceberg offers configuration options to manage these trade-offs. This section provides insights for optimizing and fine-tuning your Iceberg workloads to meet your requirements.

**Topics**
+ [General best practices](best-practices-general.md)
+ [Optimizing read performance](best-practices-read.md)
+ [Optimizing write performance](best-practices-write.md)
+ [Optimizing storage](best-practices-storage.md)
+ [Maintaining tables by using compaction](best-practices-compaction.md)
+ [Using Iceberg workloads in Amazon S3](best-practices-workloads.md)

# General best practices
<a name="best-practices-general"></a>

Regardless of your use case, when you use Apache Iceberg on AWS, we recommend that you follow these general best practices.
+ **Use Iceberg format version 2.**

  Athena uses Iceberg format version 2 by default.

  When you use Spark on Amazon EMR or AWS Glue to create Iceberg tables, specify the format version as described in the [Iceberg documentation](https://iceberg.apache.org/docs/nightly/configuration/#reserved-table-properties).
+ **Use the AWS Glue Data Catalog as your data catalog.**

  Athena uses the AWS Glue Data Catalog by default.

  When you use Spark on Amazon EMR or AWS Glue to work with Iceberg, add the following configuration to your Spark session to use the AWS Glue Data Catalog. For more information, see the section [Spark configurations for Iceberg in AWS Glue](iceberg-glue.md#glue-spark-config) earlier in this guide.

  ```
  "spark.sql.catalog.<your_catalog_name>.type": "glue"
  ```
+ **Use the AWS Glue Data Catalog as lock manager.**

  Athena uses the AWS Glue Data Catalog as lock manager by default for Iceberg tables.

  When you use Spark on Amazon EMR or AWS Glue to work with Iceberg, make sure to configure your Spark session configuration to use the AWS Glue Data Catalog as lock manager. For more information, see [Optimistic Locking](https://iceberg.apache.org/docs/latest/aws/#optimistic-locking) in the Iceberg documentation.
+ **Use Zstandard (ZSTD) compression.**

  The default compression codec of Iceberg is gzip, which can be modified by using the table property `write.<file_type>.compression-codec`. Athena already uses ZSTD as the default compression codec for Iceberg tables.

  In general, we recommend using the ZSTD compression codec because it strikes a balance between GZIP and Snappy, and offers good read/write performance without compromising the compression ratio. Additionally, compression levels can be adjusted to suit your needs. For more information, see [ZSTD compression levels in Athena](https://docs.aws.amazon.com/athena/latest/ug/compression-support-zstd-levels.html) in the Athena documentation.

  Snappy might provide the best overall read and write performance but has a lower compression ratio than GZIP and ZSTD. If you prioritize performance—even if it means storing larger data volumes in Amazon S3—Snappy might be the optimal choice.

# Optimizing read performance
<a name="best-practices-read"></a>

This section discusses table properties that you can tune to optimize read performance, independent of the engine.

## Partitioning
<a name="read-partitioning"></a>

As with Hive tables, Iceberg uses partitions as the primary layer of indexing to avoid reading unnecessary metadata files and data files. Column statistics are also taken into consideration as a secondary layer of indexing to further improve query planning, which leads to better overall execution time.

### Partition your data
<a name="read-partitioning-data"></a>

To reduce the amount of data that's scanned when querying Iceberg tables, choose a balanced partition strategy that aligns with your expected read patterns:
+ Identify columns that are frequently used in queries. These are ideal partitioning candidates. For example, if you typically query data from a particular day, a natural example of a partition column would be a date column.
+ Choose a low cardinality partition column to avoid creating an excessive number of partitions. Too many partitions can increase the number of files in the table, which can negatively impact query performance. As a rule of thumb, "too many partitions" can be defined as a scenario where the data size in the majority of partitions is less than 2-5 times the value set by `target-file-size-bytes`.

**Note**  
If you typically query by using filters on a high cardinality column (for example, an `id` column that can have thousands of values), use Iceberg's hidden partitioning feature with bucket transforms, as explained in the next section.

### Use hidden partitioning
<a name="read-partitioning-hidden"></a>

If your queries commonly filter on a derivative of a table column, use hidden partitions instead of explicitly creating new columns to work as partitions. For more information about this feature, see the [Iceberg documentation](https://iceberg.apache.org/docs/latest/partitioning/#icebergs-hidden-partitioning).

For example, in a dataset that has a timestamp column (for example, `2023-01-01 09:00:00`), instead of creating a new column with the parsed date (for example, `2023-01-01`), use partition transforms to extract the date part from the timestamp and create these partitions on the fly.

The most common use cases for hidden partitioning are:
+ **Partitioning on date or time**, when the data has a timestamp column. Iceberg offers multiple transforms to extract the date or time parts of a timestamp.
+ **Partitioning on a hash function of a column**, when the partitioning column has high cardinality and would result in too many partitions. Iceberg's bucket transform groups multiple partition values together into fewer, hidden (bucket) partitions by using hash functions on the partitioning column.

See [partition transforms](https://iceberg.apache.org/spec/#partition-transforms) in the Iceberg documentation for an overview of all available partition transforms.

Columns that are used for hidden partitioning can become part of query predicates through the use of regular SQL functions such as `year()` and `month()`. Predicates can also be combined with operators such as `BETWEEN` and `AND`.

**Note**  
Iceberg cannot perform partition pruning for functions that yield a different data type; for example, `substring(event_time, 1, 10) = '2022-01-01'`.

### Use partition evolution
<a name="read-partitioning-evolution"></a>

Use [Iceberg's partition evolution](https://iceberg.apache.org/docs/latest/evolution/#partition-evolution) when the existing partition strategy isn't optimal. For example, if you choose hourly partitions that turn out to be too small (just a few megabytes each), consider shifting to daily or monthly partitions.

You can use this approach when the best partition strategy for a table is initially unclear, and you want to refine your partitioning strategy as you gain more insights. Another effective use of partition evolution is when data volumes change and the current partitioning strategy becomes less effective over time.

For instructions on how to evolve partitions, see [ALTER TABLE SQL extensions](https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table-sql-extensions) in the Iceberg documentation. 

## Tuning file sizes
<a name="read-file-size"></a>

Optimizing query performance involves minimizing the number of small files in your tables. For good query performance, we generally recommend keeping Parquet and ORC files larger than 100 MB.

File size also impacts query planning for Iceberg tables. As the number of files in a table increases, so does the size of the metadata files. Larger metadata files can result in slower query planning. Therefore, when the table size grows*,* increase the file size** **to alleviate the exponential expansion of metadata.

Use the following best practices to create properly sized files in Iceberg tables.

### Set target file and row group size
<a name="read-file-size-target"></a>

Iceberg offers the following key configuration parameters for tuning the data file layout. We recommend that you use these parameters to set the target file size and row group or strike size.


| **Parameter** | **Default value** | **Comment** | 
| --- |--- |--- |
| `write.target-file-size-bytes` | 512 MB | This parameter specifies the maximum file size that Iceberg will create. However, certain files might be written with a smaller size than this limit. | 
| `write.parquet.row-group-size-bytes` | 128 MB | Both Parquet and ORC store data in chunks so that engines can avoid reading the entire file for some operations. | 
| `write.orc.stripe-size-bytes` | 64 MB | 
| `write.distribution-mode` | None, for Iceberg version 1.1 and lowerHash, starting with Iceberg version 1.2 | Iceberg requests Spark to sort data between its tasks before writing to storage. | 
+ Based on your expected table size, follow these general guidelines:
  + **Small tables** (up to few gigabytes) – Reduce the target file size to 128 MB. Also reduce the row group or stripe size (for example, to 8 or 16 MB).
  + **Medium to large tables** (from a few gigabytes to hundreds of gigabytes) – The default values are a good starting point for these tables. If your queries are very selective, adjust the row group or stripe size (for example, to 16 MB).
  + **Very large tables** (hundreds of gigabytes or terabytes) – Increase the target file size to 1024 MB or more, and consider increasing the row group or stripe size if your queries usually pull large sets of data.
+ To ensure that Spark applications that write to Iceberg tables create appropriately sized files, set the `write.distribution-mode` property to either `hash` or `range`. For a detailed explanation of the difference between these modes, see [Writing Distribution Modes](https://iceberg.apache.org/docs/latest/spark-writes/#writing-distribution-modes) in the Iceberg documentation.

These are general guidelines. We recommend that you run tests to identify the most suitable values for your specific tables and workloads.

### Run regular compaction
<a name="read-file-size-compaction"></a>

The configurations in the previous table set a maximum file size that write tasks can create, but do not guarantee that files will have that size. To ensure proper file sizes, run compaction regularly to combine small files into larger files. For detailed guidance on running compaction, see [Iceberg compaction](best-practices-compaction.md) later in this guide.

## Optimize column statistics
<a name="read-column-statistics"></a>

Iceberg uses column statistics to perform file pruning, which improves query performance by reducing the amount of data that's scanned by queries. To benefit from column statistics, make sure that Iceberg collects statistics for all columns that are frequently used in query filters.

By default, Iceberg collects statistics only for the [first 100 columns in each table](https://github.com/apache/iceberg/blob/ae15c7e36973501b40443e75816d3eac39eddc90/core/src/main/java/org/apache/iceberg/TableProperties.java#L276), as defined by the table property `write.metadata.metrics.max-inferred-column-defaults`. If your table has more than 100 columns and your queries frequently reference columns outside of the first 100 columns (for example, you might have  queries that filter on column 132), make sure that Iceberg collects statistics on those columns. There are two options to achieve this:
+ When you create the Iceberg table, reorder columns so that the columns you need for queries fall within the column range set by `write.metadata.metrics.max-inferred-column-defaults` (default is 100).

  Note: If you don't need statistics on 100 columns, you can adjust the `write.metadata.metrics.max-inferred-column-defaults` configuration to a desired value (for example, 20) and reorder the columns so that the columns you need to read and write queries fall within the first 20 columns on the left side of the dataset.
+ If you use only a few columns in query filters, you can disable the overall property for metrics collection and selectively choose individual columns to collect statistics for, as shown in this example:

  ```
  .tableProperty("write.metadata.metrics.default", "none")
  .tableProperty("write.metadata.metrics.column.my_col_a", "full")
  .tableProperty("write.metadata.metrics.column.my_col_b", "full")
  ```

Note: Column statistics are most effective when data is sorted on those columns. For more information, see the [Set the sort order](#read-sort-order) section later in this guide.

## Choose the right update strategy
<a name="read-update"></a>

Use a copy-on-write strategy to optimize read performance, when slower write operations are acceptable for your use case. This is the default strategy used by Iceberg.

Copy-on-write results in better read performance, because files are directly written to storage in a read-optimized fashion. However, compared with merge-on-read, each write operation takes longer and consumes more compute resources. This presents a classic trade-off between read and write latency. Typically, copy-on-write is ideal for use cases where most updates are collocated in the same table partitions (for example, for daily batch loads).

Copy-on-write configurations (`write.update.mode`, `write.delete.mode`, and `write.merge.mode`) can be set at the table level or independently on the application side.

## Use ZSTD compression
<a name="read-compression"></a>

You can modify the compression codec used by Iceberg by using the table property `write.<file_type>.compression-codec`. We recommend that you use the ZSTD compression codec to improve overall performance on tables.

By default, Iceberg versions 1.3 and earlier use GZIP compression, which provides slower read/write performance compared with ZSTD.

Note: Some engines might use different default values. This is the case for [Iceberg tables that are created with Athena](https://docs.aws.amazon.com/athena/latest/ug/compression-support-iceberg.html) or Amazon EMR version 7.x.

## Set the sort order
<a name="read-sort-order"></a>

To improve read performance on Iceberg tables, we recommend that you sort your table based on one or more columns that are frequently used in query filters. Sorting, combined with Iceberg's column statistics, can make file pruning significantly more efficient, which results in faster read operations. Sorting also reduces the number of Amazon S3 requests for queries that use the sort columns in query filters.

You can set a hierarchical sort order at the table level by running a data definition language (DDL) statement with Spark. For available options, see the [Iceberg documentation](https://iceberg.apache.org/docs/latest/spark-ddl/#alter-table--write-ordered-by). After you set the sort order, writers will apply this sorting to subsequent data write operations in the Iceberg table.

For example, in tables that are partitioned by date (`yyyy-mm-dd`) where most of the queries filter by `uuid`, you can use the DDL option `Write Distributed By Partition Locally Ordered` to make sure that Spark writes files with non-overlapping ranges.

The following diagram illustrates how the efficiency of column statistics improves when tables are sorted. In the example, the sorted table needs to open only a single file, and maximally benefits from Iceberg's partition and file. In the unsorted table, any `uuid` can potentially exist in any data file, so the query has to open all data files.

![\[Setting sort order in Iceberg tables\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/images/setting-sort-order.png)


Changing the sort order doesn't affect existing data files. You can use Iceberg compaction to apply the sort order on those.

Using Iceberg sorted tables might decrease costs for your workload, as illustrated in the following graph.

![\[Comparison costs for Iceberg and Parquet tables\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/images/cost-graph.png)


These graphs summarize the results of running the TPC-H benchmark for Hive (Parquet) tables compared with Iceberg sorted tables. However, the results might be different for other datasets or workloads.

![\[Results of TPC-H benchmark for Parquet vs. Iceberg tables\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/images/s3-api-calls.png)


# Optimizing write performance
<a name="best-practices-write"></a>

This section discusses table properties that you can tune to optimize write performance on Iceberg tables, independent of the engine.

## Set the table distribution mode
<a name="write-distribution-mode"></a>

Iceberg offers multiple write distribution modes that define how write data is distributed across Spark tasks. For an overview of the available modes, see [Writing Distribution Modes](https://iceberg.apache.org/docs/latest/spark-writes/#writing-distribution-modes) in the Iceberg documentation.

For use cases that prioritize write speed, especially in streaming workloads, set `write.distribution-mode` to `none`. This ensures that Iceberg doesn't request additional Spark shuffling and that data is written as it becomes available in Spark tasks. This mode is particularly suitable for Spark Structured Streaming applications.

**Note**  
Setting the write distribution mode to `none` tends to produce numerous small files, which degrades read performance. We recommend regular compaction to consolidate these small files into properly sized files for query performance.

## Choose the right update strategy
<a name="write-update-strategy"></a>

Use a merge-on-read** **strategy to optimize write performance, when slower read operations on the latest data are acceptable for your use case.

When you use merge-on-read, Iceberg writes updates and deletes to storage as separate small files. When the table is read, the reader has to merge these changes with the base files to return the latest view of the data. This results in a performance penalty for read operations, but speeds up the writing of updates and deletes. Typically, merge-on-read is ideal for streaming workloads with updates or jobs with few updates that are spread across many table partitions.

You can set merge-on-read configurations (`write.update.mode`, `write.delete.mode`, and `write.merge.mode`) at the table level or independently on the application side.

Using merge-on-read requires running regular compaction to prevent read performance from degrading over time. Compaction reconciles updates and deletes with existing data files to create a new set of data files, thereby eliminating the performance penalty incurred on the read side. By default, Iceberg's compaction doesn't merge delete files unless you change the default of the `delete-file-threshold` property to a smaller value (see the [Iceberg documentation](https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_data_files)). To learn more about compaction, see the section [Iceberg compaction](best-practices-compaction.md) later in this guide.

## Choose the right file format
<a name="write-file-format"></a>

Iceberg supports writing data in Parquet, ORC, and Avro formats. Parquet is the default format. Parquet and ORC are columnar formats that offer superior read performance but are generally slower to write. This represents the typical trade-off between read and write performance.

If write speed is important for your use case, such as in streaming workloads, consider writing in Avro format by setting `write-format` to `Avro` in the writer's options. Because Avro is a row-based format, it provides faster write times at the cost of slower read performance.

To improve read performance, run regular compaction to merge and transform small Avro files into larger Parquet files. The outcome of the compaction process is governed by the `write.format.default` table setting. The default format for Iceberg is Parquet, so if you write in Avro and then run compaction, Iceberg will transform the Avro files into Parquet files. Here's an example:

```
spark.sql(f"""
    CREATE TABLE IF NOT EXISTS glue_catalog.{DB_NAME}.{TABLE_NAME} (
        Col_1 float, 
        <<<…other columns…>>
        ts timestamp)
    USING iceberg
    PARTITIONED BY (days(ts))
    OPTIONS (
      'format-version'='2',
      write.format.default'=parquet)
""")

query = df \
    .writeStream \
    .format("iceberg") \
    .option("write-format", "avro") \
    .outputMode("append") \
    .trigger(processingTime='60 seconds') \
    .option("path", f"glue_catalog.{DB_NAME}.{TABLE_NAME}") \
    .option("checkpointLocation", f"s3://{BUCKET_NAME}/checkpoints/iceberg/")

    .start()
```

# Optimizing storage
<a name="best-practices-storage"></a>

Updating or deleting data in an Iceberg table increases the number of copies of your data, as illustrated in the following diagram. The same is true for running compaction: It increases the number of data copies in Amazon S3. That's because Iceberg treats the files underlying all tables as immutable.

![\[Results of updating or deleting data in an Iceberg table\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/images/optimizing-storage.png)


Follow the best practices in this section to manage storage costs.

## Enable S3 Intelligent-Tiering
<a name="storage-s3-intelligent-tiering"></a>

Use the [Amazon S3 Intelligent-Tiering](https://docs.aws.amazon.com/AmazonS3/latest/userguide/intelligent-tiering-overview.html) storage class to automatically move data to the most cost-effective access tier when access patterns change. This option has no operational overhead or impact on performance.  

Note: Don't use the optional tiers (such as Archive Access and Deep Archive Access) in S3 Intelligent-Tiering with Iceberg tables. To archive data, see the guidelines in the next section.

You can also use [Amazon S3 Lifecycle rules](https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lifecycle-mgmt.html) to set your own rules for moving objects to another Amazon S3 storage class, such as S3 Standard-IA or S3 One Zone-IA (see [Supported transitions and related constraints](https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-transition-general-considerations.html#lifecycle-general-considerations-transition-sc) in the Amazon S3 documentation).

## Archive or delete historic snapshots
<a name="storage-snapshots"></a>

For every committed transaction (insert, update, merge into, compaction) to an Iceberg table, a new version or snapshot of the table is created. Over time, the number of versions and the number of metadata files in Amazon S3 accumulate.

Keeping snapshots of a table is required for features such as snapshot isolation, table rollback, and time travel queries. However, storage costs grow with the number of versions that you retain.

The following table describes the design patterns you can implement to manage costs based on your data retention requirements.


| **Design pattern** | **Solution** | **Use cases** | 
| --- |--- |--- |
| **Delete old snapshots** |   Use the [VACUUM statement](https://docs.aws.amazon.com/athena/latest/ug/vacuum-statement.html) in Athena to remove old snapshots. This operation doesn't incur any compute cost.    Alternatively, you can use Spark on Amazon EMR or AWS Glue to remove snapshots.For more information, see [expire\$1snapshots](https://iceberg.apache.org/docs/latest/spark-procedures/#expire_snapshots) in the Iceberg documentation.   | This approach deletes snapshots that are no longer needed to reduce storage costs. You can configure how many snapshots should be retained or for how long, based on your data retention requirements.This option performs a hard delete of the snapshots. You can't roll back or time travel to expired snapshots. | 
| **Set retention policies for specific snapshots** |   Use tags to mark specific snapshots and define a retention policy in Iceberg. For more information, see [Historical Tags](https://iceberg.apache.org/docs/latest/branching/#historical-tags) in the Iceberg documentation. For example, you can retain one snapshot per month for one year by using the following SQL statement in Spark on Amazon EMR: <pre>ALTER TABLE glue_catalog.db.table <br />CREATE TAG 'EOM-01' AS OF VERSION 30 RETAIN 365 DAYS</pre>   Use Spark on Amazon EMR or AWS Glue to remove the remaining untagged, intermediate snapshots.   | This pattern is helpful for compliance with business or legal requirements that require you to show the state of a table at a given point in the past. By placing retention policies on specific tagged snapshots, you can remove other (untagged) snapshots that were created. This way, you can meet data retention requirements without retaining every single snapshot created. | 
| **Archive old snapshots** |   Use Amazon S3 tags to mark objects with Spark. (Amazon S3 tags are different from Iceberg tags; for more information, see the [Iceberg documentation](https://iceberg.apache.org/docs/latest/aws/#s3-tags).) For example: <pre>spark.sql.catalog.my_catalog.s3.delete-enabled=false and \<br />spark.sql.catalog.my_catalog.s3.delete.tags.my_key=to_archive</pre>   Use Spark on Amazon EMR or AWS Glue to [remove snapshots](https://iceberg.apache.org/docs/latest/spark-procedures/#expire_snapshots). When you use the settings in the example, this procedure tags objects and detaches them from the Iceberg table metadata instead of deleting them from Amazon S3.   Use S3 Life cycle rules to transition objects tagged as `to_archive` to one of the [S3 Glacier storage classes](https://docs.aws.amazon.com/amazonglacier/latest/dev/introduction.html).   To query archived data:   [Restore the archived objects](https://docs.aws.amazon.com/AmazonS3/latest/userguide/restoring-objects.html) (this step isn’t required if objects were transitioned to the Amazon Glacier Instant Retrieval storage class).   Use the [register\$1table procedure](https://iceberg.apache.org/docs/latest/spark-procedures/#register_table) in Iceberg to register the snapshot as a table in the catalog.    For detailed instructions, see the AWS blog post [Improve operational efficiencies of Apache Iceberg tables build on Amazon S3 data lakes](https://aws.amazon.com/blogs/big-data/improve-operational-efficiencies-of-apache-iceberg-tables-built-on-amazon-s3-data-lakes/).  | This pattern allows you to keep all table versions and snapshots at a lower cost.You cannot time travel or roll back to archived snapshots without first restoring those versions as new tables. This is typically acceptable for audit purposes.You can combine this approach with the previous design pattern, setting retention policies for specific snapshots. | 

## Delete orphan files
<a name="storage-orphan-files"></a>

In certain situations, Iceberg applications can fail before you commit your transactions. This leaves data files in Amazon S3. Because there was no commit, these files won't be associated with any table, so you might have to clean them up asynchronously.

To handle these deletions, you can use the [VACUUM statement](https://docs.aws.amazon.com/athena/latest/ug/vacuum-statement.html) in Amazon Athena. This statement removes snapshots and also deletes orphaned files. This is very cost-efficient, because Athena doesn't charge for the compute cost of this operation. Also, you don't have to schedule any additional operations when you use the `VACUUM` statement.

Alternatively, you can use Spark on Amazon EMR or AWS Glue to run the `remove_orphan_files` procedure. This operation has a compute cost and has to be scheduled independently. For more information, see the [Iceberg documentation](https://iceberg.apache.org/docs/latest/spark-procedures/#remove_orphan_files).

# Maintaining tables by using compaction
<a name="best-practices-compaction"></a>

Iceberg includes features that enable you to carry out [table maintenance operations](https://iceberg.apache.org/docs/latest/maintenance/) after writing data to the table. Some maintenance operations focus on streamlining metadata files, while others enhance how the data is clustered in the files so that query engines can efficiently locate the necessary information to respond to user requests. This section focuses on compaction-related optimizations.

## Iceberg compaction
<a name="iceberg-compaction"></a>

In Iceberg, you can use compaction to perform four tasks:
+ Combining small files into larger files that are generally over 100 MB in size. This technique is known as *bin packing*.
+ Merging delete files with data files. Delete files are generated by updates or deletes that use the merge-on-read approach.
+ (Re)sorting the data in accordance with query patterns. Data can be written without any sort order or with a sort order that is suitable for writes and updates.
+ Clustering the data by using space filling curves to optimize for distinct query patterns, particularly z-order sorting.

On AWS, you can run table compaction and maintenance operations for Iceberg through Amazon Athena or by using Spark in Amazon EMR or AWS Glue.

When you run compaction by using the [rewrite\$1data\$1files](https://iceberg.apache.org/docs/latest/spark-procedures/#rewrite_data_files) procedure, you can adjust several knobs to control the compaction behavior. The following diagram shows the default behavior of bin packing. Understanding bin packing compaction is key to understanding hierarchical sorting and Z-order sorting implementations, because they are extensions of the bin packing interface and operate in a similar manner. The main distinction is the additional step required for sorting or clustering the data.

![\[Default bin packing behavior in Iceberg tables\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/images/compaction.png)


In this example, the Iceberg table consists of four partitions. Each partition has a different size and different number of files. If you start a Spark application to run compaction, the application creates a total of four file groups to process. A file group is an Iceberg abstraction that represents a collection of files that will be processed by a single Spark job. That is, the Spark application that runs compaction will create four Spark jobs to process the data.

## Tuning compaction behavior
<a name="compaction-behavior"></a>

The following key properties control how data files are selected for compaction:
+ [MAX\$1FILE\$1GROUP\$1SIZE\$1BYTES](https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/RewriteDataFiles.html#MAX_FILE_GROUP_SIZE_BYTES) sets the data limit for a single file group (Spark job) at 100 GB by default. This property is especially important for tables without partitions or tables with partitions that span hundreds of gigabytes. By setting this limit, you can break down operations to plan work and make progress while preventing resource exhaustion on the cluster. 

  Note: Each file group is sorted separately. Therefore, if you want to perform a partition-level sort, you must adjust this limit to match the partition size.
+ [MIN\$1FILE\$1SIZE\$1BYTES](https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/BinPackStrategy.html#MIN_FILE_SIZE_BYTES) or [MIN\$1FILE\$1SIZE\$1DEFAULT\$1RATIO](https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/BinPackStrategy.html#MIN_FILE_SIZE_DEFAULT_RATIO) defaults to 75 percent of the target file size set at the table level. For example, if a table has a target size of 512 MB, any file that is smaller than 384 MB is included in the set of files that will be compacted.
+ [MAX\$1FILE\$1SIZE\$1BYTES](https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/BinPackStrategy.html#MAX_FILE_SIZE_BYTES) or [MAX\$1FILE\$1SIZE\$1DEFAULT\$1RATIO](https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/BinPackStrategy.html#MAX_FILE_SIZE_DEFAULT_RATIO) defaults to 180 percent of the target file size. As with the two properties that set minimum file sizes, these properties are used to identify candidate files for the compaction job.
+ [MIN\$1INPUT\$1FILES](https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/BinPackStrategy.html#MIN_INPUT_FILES) specifies the minimum number of files to be compacted if a table partition size is smaller than the target file size. The value of this property is used to determine whether it is worthwhile to compact the files based on the number of files (defaults to 5).
+ [DELETE\$1FILE\$1THRESHOLD](https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/BinPackStrategy.html#DELETE_FILE_THRESHOLD) specifies the minimum number of delete operations for a file before it's included in compaction. Unless you specify otherwise, compaction doesn't combine delete files with data files. To enable this functionality, you must set a threshold value by using this property. This threshold is specific to individual data files, so if you set it to 3, a data file will be rewritten only if there are three or more delete files that reference it.

These properties provide insight into the formation of the file groups in the previous diagram.

For example, the partition labeled `month=01` includes two file groups because it exceeds the maximum size constraint of 100 GB. In contrast, the `month=02` partition contains a single file group because it's under 100 GB. The `month=03` partition doesn't satisfy the default minimum input file requirement of five files. As a result, it won't be compacted. Lastly, although the `month=04` partition doesn't contain enough data to form a single file of the desired size, the files will be compacted because the partition includes more than five small files.

You can set these parameters for Spark running on Amazon EMR or AWS Glue. For Amazon Athena, you can manage similar properties by using the [table properties](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html#querying-iceberg-table-properties) that start with the prefix `optimize_`).

## Running compaction with Spark on Amazon EMR or AWS Glue
<a name="compaction-emr-glue"></a>

This section describes how to properly size a Spark cluster to run Iceberg's compaction utility. The following example uses Amazon EMR Serverless, but you can use the same methodology in Amazon EMR on EC2 or EKS, or in AWS Glue.

You can take advantage of the correlation between file groups and Spark jobs to plan the cluster resources. To process the file groups sequentially, considering the maximum size of 100 GB per file group, you can set the following [Spark properties](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs-spark.html#spark-defaults):
+ `spark.dynamicAllocation.enabled` = `FALSE`
+ `spark.executor.memory` = `20 GB`
+ `spark.executor.instances` = `5`

If you want to speed up compaction, you can scale horizontally by increasing the number of file groups that are compacted in parallel. You can also scale Amazon EMR by using manual or dynamic scaling.
+ **Manually scaling** (for example, by a factor of 4)
  + `MAX_CONCURRENT_FILE_GROUP_REWRITES` = `4` (our factor)
  + `spark.executor.instances` = `5` (value used in the example) x `4` (our factor) = `20`
  + `spark.dynamicAllocation.enabled` = `FALSE`
+ **Dynamic scaling**
  + `spark.dynamicAllocation.enabled` = `TRUE `(default, no action required)
  + [MAX\$1CONCURRENT\$1FILE\$1GROUP\$1REWRITES](https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/RewriteDataFiles.html#MAX_CONCURRENT_FILE_GROUP_REWRITES) = `N `(align this value with `spark.dynamicAllocation.maxExecutors`, which is 100 by default; based on the executor configurations in the example, you can set `N` to 20)

  These are guidelines to help size the cluster. However, you should also monitor the performance of your Spark jobs to find the best settings for your workloads.

## Running compaction with Amazon Athena
<a name="compaction-athena"></a>

Athena offers an implementation of Iceberg's compaction utility as a managed feature through the [OPTIMIZE statement](https://docs.aws.amazon.com/athena/latest/ug/optimize-statement.html). You can use this statement to run compaction without having to evaluate the infrastructure.

This statement groups small files into larger files by using the bin packing algorithm and merges delete files with existing data files. To cluster the data by using hierarchical sorting or z-order sorting, use Spark on Amazon EMR or AWS Glue.

You can change the default behavior of the `OPTIMIZE` statement at table creation by passing table properties in the `CREATE TABLE` statement, or after table creation by using the `ALTER TABLE` statement. For default values, see the [Athena documentation](https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-creating-tables.html#querying-iceberg-table-properties).

## Recommendations for running compaction
<a name="compaction-recommendations"></a>


| **Use case** | **Recommendation** | 
| --- |--- |
| **Running bin packing compaction based on a schedule** |   Use the `OPTIMIZE` statement in Athena if you don't know how many small files your table contains. The Athena pricing model is based on the data scanned, so if there are no files to be compacted, there is no cost associated with these operations. To avoid encountering timeouts on Athena tables, run `OPTIMIZE` on a per-table-partition basis.   Use Amazon EMR or AWS Glue with dynamic scaling when you expect large volumes of small files to be compacted.   | 
| **Running bin packing compaction based on events** |   Use Amazon EMR or AWS Glue with dynamic scaling when you expect large volumes of small files to be compacted.   | 
| **Running compaction to sort data** |   Use Amazon EMR or AWS Glue, because sorting is an expensive operation and might need to spill data to disk.   | 
| **Running compaction to cluster the data using z-order sorting** |   Use Amazon EMR or AWS Glue, because z-order sorting is a very expensive operation and might need to spill data to disk.   | 
| **Running compaction on partitions that might be updated by other applications because of late-arriving data** |   Use Amazon EMR or AWS Glue. Enable the Iceberg [PARTIAL\$1PROGRESS\$1ENABLED](https://iceberg.apache.org/javadoc/1.2.0/org/apache/iceberg/actions/RewriteDataFiles.html#PARTIAL_PROGRESS_ENABLED) property. When you use this option, Iceberg splits the compaction output into multiple commits. If there is a collision (that is, if the data file is updated while compaction is running), this setting reduces the cost of retry by limiting it to the commit that includes the affected file. Otherwise, you might have to recompact all files.   | 
| **Running compaction on cold partitions (data partitions that no longer receive active writes)** |   Use Amazon EMR or AWS Glue. In the `rewrite_data_files` procedure, specify a `where` predicate that excludes actively written partitions. This strategy prevents data conflicts between writers and compaction jobs, and leaves only metadata conflicts that Iceberg can automatically resolve.    | 

# Using Iceberg workloads in Amazon S3
<a name="best-practices-workloads"></a>

This section discusses Iceberg properties that you can use to optimize Iceberg's interaction with Amazon S3.

## Prevent hot partitioning (HTTP 503 errors)
<a name="workloads-503"></a>

Some data lake applications that run on Amazon S3 handle millions or billions of objects and process petabytes of data. This can lead to prefixes that receive a high volume of traffic, which are typically detected through HTTP 503 (service unavailable) errors. To prevent this issue, use the following Iceberg properties:
+ Set `write.distribution-mode` to `hash` or `range` so that Iceberg writes large files, which results in fewer Amazon S3 requests. This is the preferred configuration and should address the majority of cases.
+ If you continue to experience 503 errors due to an immense volume of data in your workloads, you can set `write.object-storage.enabled` to `true` in Iceberg. This instructs Iceberg to hash object names and distribute the load across multiple, randomized Amazon S3 prefixes.

For more information about these properties, see [Write properties](https://iceberg.apache.org/docs/latest/configuration/#write-properties) in the Iceberg documentation.

## Use Iceberg maintenance operations to release unused data
<a name="workloads-unused-data"></a>

To manage Iceberg tables, you can use the Iceberg core API, Iceberg clients (such as Spark), or managed services such as Amazon Athena. To delete old or unused files from Amazon S3, we recommend that you only use Iceberg native APIs to [remove snapshots](https://iceberg.apache.org/docs/latest/maintenance/#expire-snapshots), [remove old metadata files](https://iceberg.apache.org/docs/latest/maintenance/#remove-old-metadata-files), and [delete orphan files](https://iceberg.apache.org/docs/latest/maintenance/#delete-orphan-files).

Using Amazon S3 APIs through Boto3, the Amazon S3 SDK, or the AWS Command Line Interface (AWS CLI), or using any other, non-Iceberg methods to overwrite or remove Amazon S3 files for an Iceberg table leads to table corruption and query failures.

## Replicate data across AWS Regions
<a name="workloads-replication"></a>

When you store Iceberg tables in Amazon S3, you can use the built-in features in Amazon S3, such as [Cross-Region Replication (CRR)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/replication.html) and [Multi-Region Access Points (MRAP)](https://docs.aws.amazon.com/AmazonS3/latest/userguide/MultiRegionAccessPoints.html), to replicate data across multiple AWS Regions. MRAP provides a global endpoint for applications to access S3 buckets that are located in multiple AWS Regions. Iceberg doesn't support relative paths, but you can use MRAP to perform Amazon S3 operations by mapping buckets to access points. MRAP also integrates seamlessly with the Amazon S3 Cross-Region Replication process, which introduces a lag of up to 15 minutes. You have to replicate both data and metadata files.

**Important**  
Currently, Iceberg integration with MRAP works only with Apache Spark. If you need to fail over to the secondary AWS Region, you have to plan to redirect user queries to a Spark SQL environment (such as Amazon EMR) in the failover Region.

The CRR and MRAP features help you build a cross-Region replication solution for Iceberg tables, as illustrated in the following diagram.

![\[Cross-region replication for Iceberg tables\]](http://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/images/cross-region-replication.png)


To set up this cross-Region replication architecture:

1. Create tables by using the MRAP location. This ensures that Iceberg metadata files point to the MRAP location instead of the physical bucket location.

1. Replicate Iceberg files by using Amazon S3 MRAP.** **MRAP supports data replication with a service-level agreement (SLA) of 15 minutes. Iceberg prevents read operations from introducing inconsistencies during replication.

1. Make the tables available in the AWS Glue Data Catalog in the secondary Region. You can choose from two options:
   + Set up a pipeline for replicating Iceberg table metadata by using AWS Glue Data Catalog replication. This utility is available in the GitHub [Glue Catalog and Lake Formation Permissions replication](https://github.com/aws-samples/lake-formation-pemissions-sync) repository. This event-driven mechanism replicates tables in the target Region based on event logs.
   + Register the tables in the secondary Region when you need to fail over. For this option, you can use the previous utility or the Iceberg [register\$1table procedure](https://iceberg.apache.org/docs/latest/spark-procedures/#register_table) and point it to the latest `metadata.json` file.