使用適用於的 Amazon S3 Tables Catalog Apache Iceberg搭配 Apache Spark

使用適用於的 Amazon S3 Tables Catalog 存取 Amazon S3 資料表 Apache Iceberg

您可以從開放原始碼查詢引擎存取 S3 資料表，例如Apache Spark使用Apache Iceberg用戶端目錄的 Amazon S3 資料表目錄。適用於的 Amazon S3 Tables Catalog Apache Iceberg 是由 AWS Labs 託管的開放原始碼程式庫。其運作方式是將查詢引擎中的 Apache Iceberg 操作 (例如資料表探索、中繼資料更新，以及新增或移除資料表) 轉換為 S3 Tables API 操作。

適用於的 Amazon S3 Tables Catalog Apache Iceberg 以Maven JAR稱為的形式分佈s3-tables-catalog-for-iceberg.jar。您可以從儲存JARAWS Labs GitHub庫建置用戶端目錄，或從 Maven 下載。連線至資料表時，當您初始化的Spark工作階段時，用戶端目錄JAR會用作相依性Apache Iceberg。

使用適用於的 Amazon S3 Tables Catalog Apache Iceberg搭配 Apache Spark

您可以在初始化Spark工作階段時，使用Apache Iceberg用戶端目錄的 Amazon S3 Tables Catalog，從開放原始碼應用程式連線到資料表。在您的工作階段組態中，您可以指定 Iceberg和 Amazon S3 相依性，並建立使用資料表儲存貯體做為中繼資料倉儲的自訂目錄。

先決條件

可存取資料表儲存貯體和 S3 Tables 動作的 IAM 身分。如需詳細資訊，請參閱S3 Tables 的存取管理。

使用適用於的 Amazon S3 Tables Catalog 初始化Spark工作階段 Apache Iceberg

使用下列命令初始化 Spark。若要使用命令，請將Apache Iceberg版本編號的 Amazon S3 Tables Catalog 取代為儲存AWS Labs GitHub庫的最新版本，並將資料表儲存貯體 ARN 取代為您自己的資料表儲存貯體 ARN。


spark-shell \
--packages org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1,software.amazon.s3tables:s3-tables-catalog-for-iceberg-runtime:0.1.4 \
--conf spark.sql.catalog.s3tablesbucket=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.s3tablesbucket.catalog-impl=software.amazon.s3tables.iceberg.S3TablesCatalog \
--conf spark.sql.catalog.s3tablesbucket.warehouse=arn:aws:s3tables:us-east-1:111122223333:bucket/amzn-s3-demo-table-bucket \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions

使用 Spark SQL 查詢 S3 資料表

您可以使用 Spark 在 S3 資料表上執行 DQL、DML 和 DDL 操作。當您查詢資料表時，請使用完整資料表名稱，包括遵循此模式的工作階段目錄名稱：

CatalogName.NamespaceName.TableName

下列範例查詢顯示您可以與 S3 資料表互動的部分方式。若要在查詢引擎中使用這些範例查詢，請將使用者輸入預留位置值取代為您自己的值。

使用 Spark 查詢資料表

建立命名空間。


spark.sql(" CREATE NAMESPACE IF NOT EXISTS s3tablesbucket.my_namespace")

建立資料表


spark.sql(" CREATE TABLE IF NOT EXISTS s3tablesbucket.my_namespace.`my_table` 
( id INT, name STRING, value INT ) USING iceberg ")

查詢資料表


spark.sql(" SELECT * FROM s3tablesbucket.my_namespace.`my_table` ").show()

將資料插入資料表


spark.sql(
"""
    INSERT INTO s3tablesbucket.my_namespace.my_table 
    VALUES 
        (1, 'ABC', 100), 
        (2, 'XYZ', 200)
""")

將現有的資料檔案載入資料表

將資料讀取至 Spark。


val data_file_location = "Path such as S3 URI to data file"
val data_file = spark.read.parquet(data_file_location)

將資料寫入 Iceberg 資料表。


data_file.writeTo("s3tablesbucket.my_namespace.my_table").using("Iceberg").tableProperty ("format-version", "2").createOrReplace()

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

使用 Amazon S3 TablesIceberg REST 端點存取資料表

Amazon Athena

使用適用於 的 Amazon S3 Tables Catalog 存取 Amazon S3 資料表 Apache Iceberg

使用適用於 的 Amazon S3 Tables Catalog Apache Iceberg搭配 Apache Spark

先決條件

使用適用於 的 Amazon S3 Tables Catalog 初始化Spark工作階段 Apache Iceberg

使用 Spark SQL 查詢 S3 資料表

使用 Spark 查詢資料表

使用適用於的 Amazon S3 Tables Catalog 存取 Amazon S3 資料表 Apache Iceberg

使用適用於的 Amazon S3 Tables Catalog Apache Iceberg搭配 Apache Spark

使用適用於的 Amazon S3 Tables Catalog 初始化Spark工作階段 Apache Iceberg