Optimize Iceberg tables
Athena provides several optimization features to improve query performance on Apache Iceberg tables. As data accumulates, queries can become less efficient due to increased file processing overhead and the computational cost of applying row-level deletes stored in Iceberg delete files. To address these challenges, Athena supports manual compaction and vacuum operators to optimize table structure. Athena also works with Iceberg statistics to enable cost-based query optimization and Parquet column indexing for precise data pruning during query execution. These features work together to reduce query execution time, minimize data scanning, and lower costs. This topic describes how to use these optimization capabilities to maintain high-performance queries on your Iceberg tables.
OPTIMIZE
The OPTIMIZE compaction
action rewrites data files into a more optimized layout based on their size and
number of associated delete files. For syntax and table property details, see OPTIMIZE.table REWRITE DATA
Example
The following example merges delete files into data files and produces files
near the targeted file size where the value of category is
c1.
OPTIMIZE iceberg_table REWRITE DATA USING BIN_PACK WHERE category = 'c1'
VACUUM
VACUUM performs snapshot expiration
Example
The following example uses a table property to configure the table
iceberg_table to retain the last three days of data, then uses
VACUUM to expire the old snapshots and remove the orphan files
from the table.
ALTER TABLE iceberg_table SET TBLPROPERTIES ( 'vacuum_max_snapshot_age_seconds'='259200' ) VACUUM iceberg_table
Use Iceberg table statistics
Athena's cost-based optimizer uses Iceberg statistics to produce optimal query plans. When statistics have been generated for your Iceberg tables, Athena automatically uses this information to make intelligent decisions about join ordering, filters, and aggregation behavior, often improving query performance and reducing costs.
Iceberg statistics are turned on by default when you use S3 Tables. For other Iceberg tables, Athena uses the table property use_iceberg_statistics to
determine whether to leverage statistics for cost-based optimization. To get started, see
Optimizing query performance using column statistics in
the AWS Glue User Guide or use the Athena console to generate on-demand statistics on your Iceberg tables.
Use Parquet column indexing
Parquet column indexing makes it possible for Athena to perform more precise data pruning during query execution by leveraging page-level min/max statistics in addition to row group-level statistics. This allows Athena to skip unnecessary pages within row groups, significantly reducing the amount of data scanned and improving query performance. It works best for queries with selective filter predicates on sorted columns, improving both execution time and data scan efficiency while reducing the amount of data Athena needs to read from Amazon S3.
Athena uses Parquet column indexes by default with S3 Tables if column indexes are present in the underlying Parquet files. For other Iceberg tables,
Athena uses the use_iceberg_parquet_column_index property to determine whether to utilize the column indexes in the Parquet file. Set
this table property using the AWS Glue console or UpdateTable API.