DataFrame Utilities - Amazon SageMaker Unified Studio

DataFrame Utilities

Read from and write to catalog tables using pandas DataFrames with automatic format detection and database management.

Supported catalog types:

  • AwsDataCatalog

  • S3CatalogTables

Basic Usage

Import the DataFrame utilities:

from sagemaker_studio import dataframeutils

Reading from Catalog Tables

Required Inputs:

  • database (str): Database name within the catalog

  • table (str): Table name

Optional Parameters:

  • catalog (str): Catalog identifier (defaults to AwsDataCatalog if not specified)

  • format (str): Data format - auto-detects from table metadata, falls back to parquet

  • **kwargs: Additional arguments

    • for AwsDataCatalog, kwargs can be columns, chunked, etc

    • for S3Tables, kwargs can be limit, row_filter, selected_fields, etc

import pandas as pd # Read from AwsDataCatalog df = pd.read_catalog_table( database="my_database", table="my_table" ) # Read from S3 Tables df = pd.read_catalog_table( database="my_database", table="my_table", catalog="s3tablescatalog/my_s3_tables_catalog", )

Usage with optional parameters

import pandas as pd # Read from AwsDataCatalog by explicitly specifying catalogID and format df = pd.read_catalog_table( database="my_database", table="my_table", catalog="123456789012", format="parquet" ) # Read from AwsDataCatalog by explicitly specifying catalogID, format, and additional args -> columns df = pd.read_catalog_table( database="my_database", table="my_table", catalog="123456789012", format="parquet", columns=['<column_name_1>, <column_name_2>'] ) # Read from S3 Tables with additional args -> limit df = pd.read_catalog_table( database="my_database", table="my_table", catalog="s3tablescatalog/my_s3_tables_catalog", limit=500 ) # Read from S3 Tables with additional args -> selected_fields df = pd.read_catalog_table( database="my_database", table="my_table", catalog="s3tablescatalog/my_s3_tables_catalog", selected_fields=['<field_name_1>, <field_name_2>'] )

Writing to Catalog Tables

Required Inputs:

  • database (str): Database name within the catalog

  • table (str): Table name

Optional Parameters:

  • catalog (str): Catalog identifier (defaults to AwsDataCatalog if not specified)

  • format (str): Data format used for AwsDataCatalog (default: parquet)

  • path (str): Custom Amazon S3 path for writing to AwsDataCatalog (auto-determined if not provided)

  • **kwargs: Additional arguments

Path Resolution Priority - Amazon S3 path is determined in this order:

  • User-provided path parameter

  • Existing database location + table name

  • Existing table location

  • Project default Amazon S3 location

import pandas as pd # Create sample DataFrame df = pd.DataFrame({ 'id': [1, 2, 3], 'name': ['Alice', 'Bob', 'Charlie'], 'value': [10.5, 20.3, 15.7] }) # Write to AwsDataCatalog df.to_catalog_table( database="my_database", table="my_table" ) # Write to S3 Table Catalog df.to_catalog_table( database="my_database", table="my_table", catalog="s3tablescatalog/my_s3_tables_catalog" )

Writing to Catalog Tables

# Write to AwsDataCatalog with csv format df.to_catalog_table( database="my_database", table="my_table", format="csv" ) # Write to AwsDataCatalog at user specified s3 path df.to_catalog_table( database="my_database", table="my_table", path="s3://my-bucket/custom/path/" ) # Write to AwsDataCatalog with additional argument -> compression df.to_catalog_table( database="my_database", table="my_table", compression='gzip' )