Automate lineage capture from data connections
Configure automated lineage capture for AWS Glue (Lakehouse) connections
As databases and tables are added to the Amazon SageMaker Unified Studio’s catalog, the lineage extraction can be automated from source for those assets using data source runs in Create Connection workflow. For every connection created, lineage is not automatically enabled.
To enable lineage capture for an AWS Glue connection
-
Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials.
-
Choose Select project from the top navigation pane and select the project to which you want to add the data source.
-
Choose Data sources from the left navigation pane under Project catalog.
-
Choose the data source that you want to modify.
-
Expand the Actions menu, then choose Edit data source or click on the Data Source run name to view the details and go to Data Source Definition tab and choose Edit in Connection details.
-
Go to the connections and select Import data lineage checkbox to configure lineage capture from the source.
-
Make other changes to the data source fields as desired, then choose Save.
Limitations
The lineage collection in data source runs fetches information from table metadata to build lineage. AWS Glue crawler supports different types of sources for which lineage is captured, including Amazon S3, DynamoDB, Catalog, Delta Lake, Iceberg tables, and Hudi tables stored in Amazon S3. JDBC and DocumentDB or MongoDB are currently NOT supported as sources.
Lineage is captured only for crawlers which imported less than 250 tables in a crawler run.
Note
When enabled, the lineage runs asynchronously to capture metadata from the source and generate lineage events to be stored in SageMaker Catalog to be visualized from a particular asset. The status of lineage runs for the data source can be viewed along with data source run details.
Configure automated lineage capture for Amazon Redshift connections
Capturing lineage from Amazon Redshift can be automated when the connection is added to an Amazon Redshift source in Amazon SageMaker Unified Studio’s Data explorer. Lineage capture can be automated for a connection at the data source configuration. For every connection created, lineage is not automatically enabled.
To enable lineage capture for an Amazon Redshift connection
-
Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials.
-
Choose Select project from the top navigation pane and select the project to which you want to add the data source.
-
Choose Data sources from the left navigation pane under Project catalog.
-
Choose the data source that you want to modify.
-
Expand the Actions menu, then choose Edit data source or click on the data source run name to view the details and go to Data Source Definition tab and select Edit in Connection details.
-
Go to the connections and select Import data lineage checkbox to configure lineage capture from the source.
-
Make other changes to the data source fields as desired, then choose Save.
Note
When enabled, the lineage runs captures queries executed for a given database and generates lineage events to be stored in Amazon DataZone to be visualized from a particular asset. The lineage run for Amazon Redshift is set up for a daily run to pull from the Amazon Redshift system tables to derive lineage. For the first run, after enabling the feature, the first pull is scheduled for ~5 minutes after and set for a daily run. You can configure specific time programmatically.