Data lineage support matrix - Amazon SageMaker Unified Studio

Data lineage support matrix

Lineage capture is automated from the following tools in Amazon SageMaker Unified Studio:

Tools support matrix
Tool Compute AWS Service Service deployment option Support status Notes
Jupyterlab notebook Spark EMR EMR Serverless Automated Spark DataFrames only; remote workflow execution
Jupyterlab notebook Spark AWS Glue N/A Automated Spark DataFrames only; remote workflow execution
Visual ETL Spark AWS Glue compatibility mode Automated Spark DataFrames only
Visual ETL Spark AWS Glue fineGrained mode Not supported Spark DataFrames only
Query Editor Amazon Redshift Automated

Lineage is captured from the following services:

Services support matrix
Data source Lineage Support status Required Configuration Notes
AWS Glue Crawler Automated by default in SageMaker Unified Studio None Supported for assets crawled via AWS Glue Crawler for the following data sources: Amazon S3, Amazon DynamoDB, Amazon S3 Open Table Formats including: Delta Lake, Iceberg tables, Hudi tables, JDBC, PostgreSql, DocumentDB, and MongoDB.
Amazon Redshift Automated by default in SageMaker Unified Studio None Redshift System tables will be used to retrieve user queries and lineage is generated by parsing those queries
AWS Glue jobs in AWS Glue console Not automated by default User can select "generate lineage events" and pass domainId
EMR Not automated by default User has to pass spark conf parameters to publish lineage events

Supported versions:

  • EMR-S: 7.5+

  • EMR on EC2: 7.11+

  • EMR on EKS: 7.12+

More details in Capture lineage from EMR Spark executions