Data lineage support matrix
Lineage capture is automated from the following tools in Amazon SageMaker Unified Studio:
| Tool | Compute | AWS Service | Service deployment option | Support status | Notes |
|---|---|---|---|---|---|
| Jupyterlab notebook | Spark | EMR | EMR Serverless | Automated | Spark DataFrames only; remote workflow execution |
| Jupyterlab notebook | Spark | AWS Glue | N/A | Automated | Spark DataFrames only; remote workflow execution |
| Visual ETL | Spark | AWS Glue | compatibility mode | Automated | Spark DataFrames only |
| Visual ETL | Spark | AWS Glue | fineGrained mode | Not supported | Spark DataFrames only |
| Query Editor | Amazon Redshift | Automated |
Lineage is captured from the following services:
| Data source | Lineage Support status | Required Configuration | Notes |
|---|---|---|---|
| AWS Glue Crawler | Automated by default in SageMaker Unified Studio | None | Supported for assets crawled via AWS Glue Crawler for the following data sources: Amazon S3, Amazon DynamoDB, Amazon S3 Open Table Formats including: Delta Lake, Iceberg tables, Hudi tables, JDBC, PostgreSql, DocumentDB, and MongoDB. |
| Amazon Redshift | Automated by default in SageMaker Unified Studio | None | Redshift System tables will be used to retrieve user queries and lineage is generated by parsing those queries |
| AWS Glue jobs in AWS Glue console | Not automated by default | User can select "generate lineage events" and pass domainId | |
| EMR | Not automated by default | User has to pass spark conf parameters to publish lineage events | Supported versions:
More details in Capture lineage from EMR Spark executions |