迁移选项概述 ETL 逻辑的迁移选项存储迁移选项 Analytics AWS 合作伙伴示例

AWS HealthOmics 变体存储和注释存储库可用性变更

经过深思熟虑，我们决定从 2025 年 11 月 7 日起对新客户关闭 AWS HealthOmics 变体商店和注释商店。如果您想使用变体商店和注释库，请在该日期之前注册。现有客户可以继续照常使用该服务。

以下部分介绍了迁移选项，可帮助您将变体商店和分析商店迁移到新的解决方案。如有任何问题或疑虑，请通过 support.console.a ws.amazon.com 创建支持案例。

主题

迁移选项概述
ETL 逻辑的迁移选项
存储迁移选项
Analytics
AWS 合作伙伴
示例

迁移选项概述

以下迁移选项提供了使用变体存储和注释存储的替代方案：

使用 HealthOmics提供的 ETL 逻辑参考实现。

使用 S3 表存储桶进行存储，并继续使用现有的 AWS 分析服务。
使用现有 AWS 服务的组合创建解决方案。

对于 ETL，你可以编写自定义 Glue ETL 作业，或者在 EMR 上使用开源 HAIL 或 GLOW 代码来转换变体数据。

使用 S3 表存储桶进行存储并继续使用现有的 AWS 分析服务
选择提供变体和注释存储备选方案的AWS 合作伙伴。

ETL 逻辑的迁移选项

考虑以下 ETL 逻辑的迁移选项：

HealthOmics 提供当前变体存储 ETL 逻辑作为参考 HealthOmics 工作流程。您可以使用此工作流程的引擎来支持与变体存储完全相同的变体数据 ETL 流程，但可以完全控制 ETL 逻辑。

此参考工作流程可应要求提供。要申请访问权限，请在 support.console.a ws.amazon.com 上创建支持案例。
要转换变体数据，你可以编写自定义 Glue ETL 作业，或者在 EMR 上使用开源 HAIL 或 GLOW 代码。

存储迁移选项

作为服务托管数据存储的替代方案，您可以使用 Amazon S3 表存储桶来定义自定义表架构。有关表存储桶的更多信息，请参阅 Amazon S3 用户指南中的表存储桶。

在 Amazon S3 中，您可以将表存储桶用于完全托管的 Iceberg 表。

您可以提出支持案例，请求 HealthOmics 团队将数据从您的变体或注释存储迁移到您配置的 Amazon S3 表存储桶。

将数据填充到 Amazon S3 表存储桶后，您可以删除变体存储和注释存储。有关更多信息，请参阅删除 HealthOmics 分析存储。

Analytics

要进行数据分析，请继续使用 AWS 分析服务，例如亚马逊 Athena、亚马逊 EMR、亚马逊 R ed shift 或 A maz on Q uick Suite。

AWS 合作伙伴

您可以与提供可自定义的 ETL、表格架构、内置查询和分析工具以及用于与数据交互的用户界面的AWS 合作伙伴合作。

示例

以下示例说明如何创建适合存储 VCF 和 GVCF 数据的表。

Athena DDL

你可以在 Athena 中使用以下 DDL 示例来创建适合在单个表中存储 VCF 和 GVCF 数据的表。此示例并不完全等同于变体存储结构，但它适用于通用用例。

在创建表时为 DATABASE_NAME 和 TABLE_NAME 创建自己的值。


 CREATE TABLE <DATABASE_NAME>. <TABLE_NAME> (
  sample_name string,
  variant_name string COMMENT 'The ID field in VCF files, '.' indicates no name',
  chrom string,
  pos bigint,
  ref string,
  alt array <string>,
  qual double,
  filter string,
  genotype string,
  info map <string, string>,
  attributes map <string, string>,
  is_reference_block boolean COMMENT 'Used in GVCF for non-variant sites')
PARTITIONED BY (bucket(128, sample_name), chrom)
LOCATION '{URL}/'
TBLPROPERTIES (
  'table_type'='iceberg',
  'write_compression'='zstd'
);

使用 Python 创建表格（不使用 Athena）

以下 Python 代码示例展示了如何在不使用 Athena 的情况下创建表。


 import boto3
from pyiceberg.catalog import Catalog, load_catalog
from pyiceberg.schema import Schema
from pyiceberg.table import Table
from pyiceberg.table.sorting import SortOrder, SortField, SortDirection, NullOrder
from pyiceberg.partitioning import PartitionSpec, PartitionField
from pyiceberg.transforms import IdentityTransform, BucketTransform
from pyiceberg.types import (
    NestedField,
    StringType,
    LongType,
    DoubleType,
    MapType,
    BooleanType,
    ListType
)


def load_s3_tables_catalog(bucket_arn: str) -> Catalog:
    session = boto3.session.Session()
    region = session.region_name or 'us-east-1'
    
    catalog_config = {
        "type": "rest",
        "warehouse": bucket_arn,
        "uri": f"https://s3tables.{region}.amazonaws.com/iceberg",
        "rest.sigv4-enabled": "true",
        "rest.signing-name": "s3tables",
        "rest.signing-region": region
    }
    
    return load_catalog("s3tables", **catalog_config)


def create_namespace(catalog: Catalog, namespace: str) -> None:
    try:
        catalog.create_namespace(namespace)
        print(f"Created namespace: {namespace}")
    except Exception as e:
        if "already exists" in str(e):
            print(f"Namespace {namespace} already exists.")
        else:
            raise e


def create_table(catalog: Catalog, namespace: str, table_name: str, schema: Schema, 
                partition_spec: PartitionSpec = None, sort_order: SortOrder = None) -> Table:
    if catalog.table_exists(f"{namespace}.{table_name}"):
        print(f"Table {namespace}.{table_name} already exists.")
        return catalog.load_table(f"{namespace}.{table_name}")
    
    create_table_args = {
        "identifier": f"{namespace}.{table_name}",
        "schema": schema,
        "properties": {"format-version": "2"}
    }
    
    if partition_spec is not None:
        create_table_args["partition_spec"] = partition_spec
    if sort_order is not None:
        create_table_args["sort_order"] = sort_order
    
    table = catalog.create_table(**create_table_args)
    print(f"Created table: {namespace}.{table_name}")
    return table


def main(bucket_arn: str, namespace: str, table_name: str):
    # Schema definition
    genomic_variants_schema = Schema(
        NestedField(1, "sample_name", StringType(), required=True),
        NestedField(2, "variant_name", StringType(), required=True),
        NestedField(3, "chrom", StringType(), required=True),
        NestedField(4, "pos", LongType(), required=True),
        NestedField(5, "ref", StringType(), required=True),
        NestedField(6, "alt", ListType(element_id=1000, element_type=StringType(), element_required=True), required=True),
        NestedField(7, "qual", DoubleType()),
        NestedField(8, "filter", StringType()),
        NestedField(9, "genotype", StringType()),
        NestedField(10, "info", MapType(key_type=StringType(), key_id=1001, value_type=StringType(), value_id=1002)),
        NestedField(11, "attributes", MapType(key_type=StringType(), key_id=2001, value_type=StringType(), value_id=2002)),
        NestedField(12, "is_reference_block", BooleanType()),
        identifier_field_ids=[1, 2, 3, 4]
    )
    
    # Partition and sort specifications
    partition_spec = PartitionSpec(
        PartitionField(source_id=1, field_id=1001, transform=BucketTransform(128), name="sample_bucket"),
        PartitionField(source_id=3, field_id=1002, transform=IdentityTransform(), name="chrom")
    )
    
    sort_order = SortOrder(
        SortField(source_id=3, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_LAST),
        SortField(source_id=4, transform=IdentityTransform(), direction=SortDirection.ASC, null_order=NullOrder.NULLS_LAST)
    )
    
    # Connect to catalog and create table
    catalog = load_s3_tables_catalog(bucket_arn)
    create_namespace(catalog, namespace)
    table = create_table(catalog, namespace, table_name, genomic_variants_schema, partition_spec, sort_order)
    
    return table


if __name__ == "__main__":
    bucket_arn = 'arn:aws:s3tables:<REGION>:<ACCOUNT_ID>:bucket/<TABLE_BUCKET_NAME'
    namespace = "variant_db"
    table_name = "genomic_variants"
    
    main(bucket_arn, namespace, table_name)

Javascript 在您的浏览器中被禁用或不可用。

要使用 Amazon Web Services 文档，必须启用 Javascript。请参阅浏览器的帮助页面以了解相关说明。

文档惯例

什么是 AWS HealthOmics？

设置 HealthOmics