AWS HealthOmics 變體存放區和註釋存放區可用性變更

在仔細考慮之後，我們決定從 2025 年 11 月 7 日起關閉變 AWS HealthOmics 體存放區和註釋存放區給新客戶。如果您想要使用變體存放區和註釋存放區，請在該日期之前註冊。現有客戶可以繼續正常使用服務。

下一節說明遷移選項，協助您將變體存放區和分析存放區移至新的解決方案。如有任何問題或疑慮，請前往 https：//support.console.aws.amazon.com 建立支援案例。

主題

遷移選項概觀
ETL 邏輯的遷移選項
儲存體的遷移選項
分析
AWS 合作夥伴
範例

遷移選項概觀

下列遷移選項提供使用變體存放區和註釋存放區的替代方案：

使用 HealthOmics 提供的 ETL 邏輯參考實作。

使用 S3 資料表儲存貯體進行儲存，並繼續使用現有的 AWS 分析服務。
使用現有 AWS 服務的組合建立解決方案。

對於 ETL，您可以撰寫自訂 Glue ETL 任務，或在 EMR 上使用開放原始碼 HAIL 或 GLOW 程式碼來轉換變體資料。

使用 S3 資料表儲存貯體進行儲存，並繼續使用現有的 AWS 分析服務
選取提供變體和註釋存放區替代方案的AWS 合作夥伴。

ETL 邏輯的遷移選項

請考慮下列 ETL 邏輯的遷移選項：

HealthOmics 提供目前的變體存放區 ETL 邏輯做為 HealthOmics 工作流程的參考。您可以使用此工作流程的引擎，為與變體存放區完全相同的變體資料 ETL 程序提供動力，但完全控制 ETL 邏輯。

此參考工作流程可依請求提供。若要請求存取，請在 https：//support.console.aws.amazon.com 建立支援案例。
若要轉換變體資料，您可以撰寫自訂 Glue ETL 任務，或在 EMR 上使用開放原始碼 HAIL 或 GLOW 程式碼。

儲存體的遷移選項

做為服務託管資料存放區的替代，您可以使用 Amazon S3 資料表儲存貯體來定義自訂資料表結構描述。如需資料表儲存貯體的詳細資訊，請參閱《Amazon S3 使用者指南》中的資料表儲存貯體。

您可以在 Amazon S3 中將資料表儲存貯體用於全受管 Iceberg 資料表。

您可以提出支援案例，請求 HealthOmics 團隊將資料從變體或註釋存放區遷移到您設定的 Amazon S3 資料表儲存貯體。

在 Amazon S3 資料表儲存貯體中填入您的資料後，您可以刪除變體存放區和註釋存放區。如需詳細資訊，請參閱刪除 HealthOmics 分析存放區。

分析

對於資料分析，請繼續使用 AWS 分析服務，例如 Amazon Athena、Amazon EMR、Amazon Redshift 或 Amazon Quick Suite。

AWS 合作夥伴

您可以與 AWS 合作夥伴合作，提供可自訂的 ETL、資料表結構描述、內建查詢和分析工具，以及用於與資料互動的使用者介面。

範例

下列範例示範如何建立適合存放 VCF 和 GVCF 資料的資料表。

Athena DDL

您可以在 Athena 中使用下列 DDL 範例，建立適合在單一資料表中存放 VCF 和 GVCF 資料的資料表。此範例與變體存放區結構不同，但適用於一般使用案例。

建立資料表時，請為 DATABASE_NAME 和 TABLE_NAME 建立您自己的值。


 CREATE TABLE <DATABASE_NAME>. <TABLE_NAME> (
  sample_name string,
  variant_name string COMMENT 'The ID field in VCF files, '.' indicates no name',
  chrom string,
  pos bigint,
  ref string,
  alt array <string>,
  qual double,
  filter string,
  genotype string,
  info map <string, string>,
  attributes map <string, string>,
  is_reference_block boolean COMMENT 'Used in GVCF for non-variant sites')
PARTITIONED BY (bucket(128, sample_name), chrom)
LOCATION '{URL}/'
TBLPROPERTIES (
  'table_type'='iceberg',
  'write_compression'='zstd'
);

使用 Python 建立資料表（不含 Athena)

下列 Python 程式碼範例示範如何在不使用 Athena 的情況下建立資料表。


 import boto3
from pyiceberg.catalog import Catalog, load_catalog
from pyiceberg.schema import Schema
from pyiceberg.table import Table
from pyiceberg.table.sorting import SortOrder, SortField, SortDirection, NullOrder
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import IdentityTransform, BucketTransform
from pyiceberg.types import (
    NestedField,
    StringType,
    LongType,
    DoubleType,
    MapType,
    BooleanType,
    ListType
)


def load_s3_tables_catalog(bucket_arn: str) -> Catalog:
    session = boto3.session.Session()
    region = session.region_name or 'us-east-1'
    
    catalog_config = {
        "type": "rest",
        "warehouse": bucket_arn,
        "uri": f"https://s3tables.{region}.amazonaws.com/iceberg",
        "rest.sigv4-enabled": "true",
        "rest.signing-name": "s3tables",
        "rest.signing-region": region
    }
    
    return load_catalog("s3tables", **catalog_config)


def create_namespace(catalog: Catalog, namespace: str) -> None:
    try:
        catalog.create_namespace(namespace)
        print(f"Created namespace: {namespace}")
    except Exception as e:
        if "already exists" in str(e):
            print(f"Namespace {namespace} already exists.")
        else:
            raise e


def create_table(catalog: Catalog, namespace: str, table_name: str, schema: Schema, 
                partition_spec: PartitionSpec = None, sort_order: SortOrder = None) -> Table:
    if catalog.table_exists(f"{namespace}.{table_name}"):
        print(f"Table {namespace}.{table_name} already exists.")
        return catalog.load_table(f"{namespace}.{table_name}")
    
    create_table_args = {
        "identifier": f"{namespace}.{table_name}",
        "schema": schema,
        "properties": {"format-version": "2"}
    }
    
    if partition_spec is not None:
        create_table_args["partition_spec"] = partition_spec
    if sort_order is not None:
        create_table_args["sort_order"] = sort_order
    
    table = catalog.create_table(**create_table_args)
    print(f"Created table: {namespace}.{table_name}")
    return table


def main(bucket_arn: str, namespace: str, table_name: str):
    # Schema definition
    genomic_variants_schema = Schema(
        NestedField(1, "sample_name", StringType(), required=True),
        NestedField(2, "variant_name", StringType(), required=True),
        NestedField(3, "chrom", StringType(), required=True),
        NestedField(4, "pos", LongType(), required=True),
        NestedField(5, "ref", StringType(), required=True),
        NestedField(6, "alt", ListType(element_id=1000, element_type=StringType(), element_required=True), required=True),
        NestedField(7, "qual", DoubleType()),
        NestedField(8, "filter", StringType()),
        NestedField(9, "genotype", StringType()),
        NestedField(10, "info", MapType(key_type=StringType(), key_id=1001, value_type=StringType(), value_id=1002)),
        NestedField(11, "attributes", MapType(key_type=StringType(), key_id=2001, value_type=StringType(), value_id=2002)),
        NestedField(12, "is_reference_block", BooleanType()),
        identifier_field_ids=[1, 2, 3, 4]
    )
    
    # Partition and sort specifications
    partition_spec = PartitionSpec(
        PartitionField(source_id=1, field_id=1001, transform=BucketTransform(128), name="sample_bucket"),
        PartitionField(source_id=3, field_id=1002, transform=IdentityTransform(), name="chrom")
    )
    
    sort_order = SortOrder(
        SortField(source_id=3, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_LAST),
        SortField(source_id=4, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_LAST)
    )
    
    # Connect to catalog and create table
    catalog = load_s3_tables_catalog(bucket_arn)
    create_namespace(catalog, namespace)
    table = create_table(catalog, namespace, table_name, genomic_variants_schema, partition_spec, sort_order)
    
    return table


if __name__ == "__main__":
    bucket_arn = 'arn:aws:s3tables:<REGION>:<ACCOUNT_ID>:bucket/<TABLE_BUCKET_NAME'
    namespace = "variant_db"
    table_name = "genomic_variants"
    
    main(bucket_arn, namespace, table_name)

您的瀏覽器已停用或無法使用 Javascript。

您必須啟用 Javascript，才能使用 AWS 文件。請參閱您的瀏覽器說明頁以取得說明。

文件慣用形式

什麼是 AWS HealthOmics？

設定 HealthOmics