Exporting asset metadata - Amazon SageMaker Unified Studio

Exporting asset metadata

In the current release of Amazon SageMaker Unified Studio, you can export asset metadata as an Apache Iceberg table through Amazon S3 Tables. This allows data teams to query catalog inventory and answer questions - like the following: How many assets were registered last month?, Which assets are classified as confidential?, or Which assets lack business descriptions?, etc. using standard SQL without building custom ETL infrastructure for reporting.

This capability automatically converts catalog asset metadata into a queryable table accessible from Amazon Athena, Amazon SageMaker Unified Studio notebooks, AI agents, and other analytics and BI tools. The exported table includes technical metadata (such as resource_id, resource_type), business metadata (such as asset_name, business_description), ownership details, and timestamps. Data is partitioned by snapshot_date for query performance and automatically appears in Amazon SageMaker Unified Studio under the aws-sagemaker-catalog bucket.

Note

In the current release, you can enable exporting asset metadata only for one domain per AWS account per region. If you disable exporting asset metadata feature for a domain where it's already enabled, you cannot enable this feature for another domain in the same AWS account and region.

Also, encryption configuration for the exported asset table cannot be updated.

This capability is available in all AWS Regions where Amazon SageMaker Catalog is supported at no additional charge. You pay only for underlying services including S3 Tables storage and Amazon Athena queries. You can control storage costs by setting retention policies on S3 tables to automatically remove records older than your specified period. For more information, see https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-record-expiration.html.

Start exporting asset metadata

To get started, activate dataset export by invoking the PutDataExportConfiguration API action, then access the asset table through S3 Tables or Amazon SageMaker Unified Studio's Data tab within 24 hours. Query using Amazon Athena, Studio notebooks, or connect external BI tools through the S3 Tables Iceberg REST Catalog endpoint.

Asset metadata is exported once a day around midnight local time per AWS region.

For more information, see KMS permissions for exporting asset metadata in Amazon SageMaker Unified Studio.

Enable data export:

aws datazone put-data-export-configuration --domain-identifier dzd-440699i00ezy21 --region us-east-2 --enable-export

With KMS key encryption configuration:

aws datazone put-data-export-configuration --encryption-configuration kmsKeyArn=arn:aws:kms:us-east-2:651673343886:key/292fedfe-c9h6-40fa-961b-87393584195c,sseAlgorithm=aws:kms --enable-export --region us-east-2 --domain-identifier dzd-440699i00ezy21

For more information, see the API reference documentation.

Asset table schema

The asset metadata is exported to the following table structure:

Column name Data type Comment
snapshot_time timestamp Timestamp when this metadata snapshot was captured from AWS Sagemaker. Partition key for temporal queries. Use for point-in-time analysis and time-travel queries (e.g., "show catalog state 30 days ago").
asset_id string Unique identifier assigned by AWS Sagemaker Catalog for this asset. Primary key for asset lookups.
resource_type_enum string Type of data resource cataloged. Values: GlueTable, RedshiftTable, S3Collection etc. Used for filtering by resource category.
resource_id string Platform-native unique identifier for the resource. For AWS: ARN format (e.g., arn:aws:glue:region:account:table/db/table). Use for cross-referencing with source systems.
account_id string Cloud account or organizational identifier. For AWS: 12-digit account ID. Used for multi-account queries and cost allocation.
region string Geographic region or availability zone where resource is hosted. For AWS: us-east-1, eu-west-1, etc. NULL for region-agnostic resources. Use for regional analysis and compliance queries.
catalog string Top-level namespace identifier. Meaning varies by resource_type: For GLUE_TABLE: AWS account ID. For REDSHIFT_TABLE: database name. For S3_COLLECTION: AWS account ID. Use for catalog-level grouping.
namespace string Mid-level namespace identifier. Meaning varies by resource_type: For GLUE_TABLE: database name. For REDSHIFT_TABLE: schema name. For S3_COLLECTION: bucket name. May contain multiple levels (e.g., database.schema).
asset_name string Business friendly name for this asset. Optional business identifier that can differ from the technical resource_name. NULL if no custom name was provided during asset registration. Examples: 'Customer Master Dataset', 'Q4 Sales Report', 'Production Orders Table'. Use for user-friendly display and business-oriented searches.
resource_name string Leaf-level resource identifier. For tables: table name. For S3: prefix path. For views: view name. Use for resource-level filtering.
resource_description string Technical description from source system. For Glue: table comment field. For Redshift: table remarks. NULL if source system has no description. Use for technical documentation searches.
business_description string Business-friendly description provided to AWS Sagemaker Catalog. Explains purpose, usage, and business context of the asset. NULL if not provided by data owner. Use for business glossary and discovery queries.
extended_metadata map<string,string> Flexible key-value store for platform-specific attributes not covered by standard columns. Examples: s3_location, compression_type, column_count, partition_keys, namespace_name, cluster_name. Use for advanced filtering and platform-specific queries. Schema varies by platform and resource_type.
asset_created_time timestamp Timestamp when this asset was first created in AWS Sagemaker Catalog.
asset_updated_time timestamp Timestamp when this asset was last updated in AWS Sagemaker Catalog.

Querying asset tables

To query Amazon S3 tables using Amazon SageMaker Unified Studio or Amazon Athena you must first do the following:

Assets registered in last one month

  • Query with sample aggregates

    SELECT DATE(asset_created_time) as date, resource_type_enum, COUNT(*) as count FROM asset_metadata.asset WHERE DATE(snapshot_time) = CURRENT_DATE AND asset_created_time >= DATE_ADD('month', -1, CURRENT_DATE) GROUP BY DATE(asset_created_time), resource_type_enum ORDER BY date DESC;
  • Plain query without aggregates and groupBy

    SELECT * FROM asset_metadata.asset WHERE DATE(snapshot_time) = CURRENT_DATE AND asset_created_time >= DATE_ADD('month', -1, CURRENT_DATE)

Assets without business description or owningEntityId in them

SELECT asset_id, asset_name, resource_name, resource_type_enum, account_id, business_description, extended_metadata['owningEntityId'] as owner FROM asset_metadata.asset WHERE DATE(snapshot_time) = CURRENT_DATE AND (business_description IS NULL OR extended_metadata['owningEntityId'] IS NULL);

Query asset matching metadata form field values

SELECT * FROM asset_metadata.asset WHERE DATE(snapshot_time) = CURRENT_DATE AND extended_metadata['<metadata-form-name>.<field-name>'] = '<field-value>';

Asset distribution queries

  • Get distributions by account

    SELECT account_id, resource_type_enum, COUNT(*) as count FROM asset_metadata.asset WHERE DATE(snapshot_time) = CURRENT_DATE GROUP BY account_id, resource_type_enum ORDER BY count DESC
  • Get distribution by asset owner (projectIds)

    SELECT extended_metadata['owningEntityId'] as owner, COUNT(*) as count FROM asset_metadata.asset WHERE DATE(snapshot_time) = CURRENT_DATE AND extended_metadata['owningEntityId'] IS NOT NULL GROUP BY extended_metadata['owningEntityId'] ORDER BY count DESC;

Time travel capabilities

The asset_metadata.asset table captures daily snapshots of asset metadata, allowing us to view the state of data catalog at any point in time. Simply change the date filter in our query to travel back to any previous snapshot

Note

Querying without a snapshot_time filter will read all historical snapshots, resulting in duplicate records and slower performance. Always filter by the desired date or current timestamp.

  • View Current Assets snapshot

    SELECT * FROM asset_metadata.asset WHERE DATE(snapshot_time) = CURRENT_DATE;
  • Travel to a Specific Date ex: Nov-26-2025

    SELECT * FROM asset_metadata.asset WHERE DATE(snapshot_time) = DATE('2025-11-26');
  • Travel Back Relative to Today ex: travel back by 2 days

    SELECT * FROM asset_metadata.asset WHERE DATE(snapshot_time) = date_add('day', -2, CURRENT_DATE);

Common use cases

  1. Track Metadata Improvements see which assets gained descriptions or ownership over time:

    SELECT t.asset_id, t.resource_name, p.business_description as description_before, t.business_description as description_now FROM asset_metadata.asset t JOIN asset_metadata.asset p ON t.asset_id = p.asset_id WHERE DATE(t.snapshot_time) = CURRENT_DATE AND DATE(p.snapshot_time) = CURRENT_DATE - INTERVAL '7' DAY AND p.business_description IS NULL AND t.business_description IS NOT NULL;
  2. Monitor Asset Growth View how data catalog has grown over the last 30 days:

    SELECT DATE(snapshot_time) as date, COUNT(*) as total_assets FROM asset_metadata.asset WHERE DATE(snapshot_time) >= CURRENT_DATE - INTERVAL '30' DAY GROUP BY DATE(snapshot_time) ORDER BY date DESC;
  3. Audit Historical Changes to investigate what an asset looked like at a specific point in time:

    SELECT asset_id, resource_name, business_description, extended_metadata['owningEntityId'] as owner, snapshot_time FROM asset_metadata.asset WHERE asset_id = 'your-asset-id' AND DATE(snapshot_time) = DATE('2025-11-26');