# Data lineage in Amazon SageMaker Unified Studio
Data lineage

Data lineage in Amazon SageMaker Unified Studio is an OpenLineage-compatible feature that can help you to capture and visualize lineage events, from OpenLineage-enabled systems or through APIs, to trace data origins, track transformations, and view cross-organizational data consumption. It provides you with an overarching view into your data assets to see the origin of assets and their chain of connections. The lineage data includes information on the activities inside the Amazon SageMaker Catalog, including information about the catalogued assets, the subscribers of those assets, and the activities that happen outside the business data catalog captured programmatically using the APIs.

**Topics**
+ [

# What is OpenLineage?
](datazone-data-lineage-what-is-openlineage.md)
+ [

# Data lineage support
](datazone-data-lineage-support.md)
+ [

# Data lineage support matrix
](datazone-support-matrix.md)
+ [

# Visualizing data lineage
](datazone-visualizing-data-lineage.md)
+ [

# Test drive data lineage
](datazone-data-lineage-sample-experience.md)
+ [

# Data lineage authorization
](datazone-data-lineage-authorization.md)
+ [

# Automate lineage capture from data connections
](datazone-data-lineage-automate-capture-from-data-connections.md)
+ [

# Automate lineage capture from tools
](datazone-data-lineage-automate-capture-from-tools.md)
+ [

# Permissions required for data lineage
](datazone-data-lineage-permissions.md)
+ [

# Publishing data lineage programmatically
](datazone-data-lineage-apis.md)
+ [

# The importance of the sourceIdentifier attribute to lineage nodes
](datazone-data-lineage-sourceIdentifier-attribute.md)
+ [

# Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio
](datazone-data-lineage-linking-nodes.md)
+ [

# Troubleshooting data lineage
](datazone-lineage-troubleshooting.md)

# What is OpenLineage?


[OpenLineage](https://openlineage.io/) is an open platform for collection and analysis of data lineage. It tracks metadata about datasets, jobs, and runs, giving users the information required to identify the root cause of complex issues and understand the impact of changes. It is an Open Standard for lineage metadata collection designed to record metadata for a job in execution.

The standard defines a generic model of dataset, job, and run entities uniquely identified using consistent naming strategies. The dataset and job entities are identified by combination of 'namespace' and 'name' attributes whereas run is identified by runId. The entities can be enriched with user-defined metadata via facets (similar to metadata forms in Amazon SageMaker Unified Studio).

OpenLineage supports three types of events: RunEvent, DatasetEvent and JobEvent.
+ RunEvent: this event is generated as a result of job-run execution. It contains details of the run, the job it belongs to, input datasets that run consumes and output datasets the run produces. Reference for samples run events. Currently, Amazon SageMaker Unified Studio only supports RunEvents.
+ DatasetEvent: this event represents the changes in dataset (like any static updates on the dataset)
+ JobEvent: this event represents the changes in job configuration/details

In the current release of Amazon SageMaker Unified Studio, OpenLineage 1.22.0\$1 versions are supported.

# Data lineage support


In Amazon SageMaker Unified Studio, domain administrators or data users can configure lineage in projects while setting up connections for data lake and data warehouse sources to ensure the data source runs created from those resources are enabled for automatic lineage capture. Data lineage is automatically captured from data sources, such as AWS Glue and Amazon Redshift, as well as from tools, such as Visual ETL and notebooks, as executions create, update, or transform data. Additionally, Amazon SageMaker Unified Studio captures the movement of data within the catalog as producers bring their asset into inventory, and publish them, as well as when consumers subscribe and get access, to indicate who the subscribing projects are for a given asset. With this automation, different stages of an asset in the catalog are captured including when schema changes are detected.

Using Amazon SageMaker Unified Studio's OpenLineage-compatible APIs, domain administrators and data producers can capture and store lineage events beyond what is available in Amazon SageMaker Unified Studio, including transformations in Amazon S3, AWS Glue, and other services. This provides a comprehensive view for the data consumers and helps them gain confidence of the asset's origin, while data producers can assess the impact of changes to an asset by understanding its usage. Additionally, Amazon SageMaker Unified Studio versions lineage with each event, enabling users to visualize lineage at any point in time or compare transformations across an asset's or job's history. This historical lineage provides a deeper understanding of how data has evolved, essential for troubleshooting, auditing, and ensuring the integrity of data assets.

With data lineage, you can accomplish the following in Amazon SageMaker Unified Studio: 
+ Understand the provenance of data: knowing where the data originated fosters trust in data by providing you with a clear understanding of its origins, dependencies, and transformations. This transparency helps in making confident data-driven decisions.
+ Understand the impact of changes to data pipelines: when changes are made to data pipelines, lineage can be used to identify all of the downstream consumers that are to be affected. This helps to ensure that changes are made without disrupting critical data flows.
+ Identify the root cause of data quality issues: if a data quality issue is detected in a downstream report, lineage, especially column-level lineage, can be used to trace the data back (at a column level) to identify the issue back to its source. This can help data engineers to identify and fix the problem.
+ Improve data governance and compliance: column-level lineage can be used to demonstrate compliance with data governance and privacy regulations. For example, column-level lineage can be used to show where sensitive data (such as PII) is stored and how it is processed in downstream activities.

**OpenLineage custom transport to send lineage events to SageMaker**

OpenLineage events which contain metadata about data pipelines, jobs, and runs, are typically sent to backend for storage and analysis. The transport mechanism handles this transmission. As an extension of the OpenLineage project, a custom transport is available to send lineage events directly Amazon SageMaker Unified Studio's endpoint. The custom transport was merged into OpenLineage version 1.33.0 ([https://openlineage.io/docs/releases/1\$133\$10/](https://openlineage.io/docs/releases/1_33_0/)). This allows the use of OpenLineage plugins with the transport to send lineage events collected directly to Amazon SageMaker Unified Studio. 

# Data lineage support matrix


Lineage capture is automated from the following tools in Amazon SageMaker Unified Studio:


**Tools support matrix**  

| **Tool** | **Compute** | **AWS Service** | **Service deployment option** | **Support status** | **Notes** | 
| --- | --- | --- | --- | --- | --- | 
| Jupyterlab notebook | Spark | EMR | EMR Serverless | Automated | Spark DataFrames only; remote workflow execution | 
| Jupyterlab notebook | Spark | AWS Glue | N/A | Automated | Spark DataFrames only; remote workflow execution | 
| Visual ETL | Spark | AWS Glue | compatibility mode | Automated | Spark DataFrames only | 
| Visual ETL | Spark | AWS Glue | fineGrained mode | Not supported | Spark DataFrames only | 
| Query Editor |  | Amazon Redshift |  | Automated |  | 

Lineage is captured from the following services: 


**Services support matrix**  

| **Data source** | **Lineage Support status** | **Required Configuration** | **Notes** | 
| --- | --- | --- | --- | 
| AWS Glue Crawler | Automated by default in SageMaker Unified Studio | None | Supported for assets crawled via AWS Glue Crawler for the following data sources: Amazon S3, Amazon DynamoDB, Amazon S3 Open Table Formats including: Delta Lake, Iceberg tables, Hudi tables, JDBC, PostgreSql, DocumentDB, and MongoDB. | 
| Amazon Redshift | Automated by default in SageMaker Unified Studio | None | Redshift System tables will be used to retrieve user queries and lineage is generated by parsing those queries | 
| AWS Glue jobs in AWS Glue console  | Not automated by default | User can select "generate lineage events" and pass domainId  |  | 
| EMR | Not automated by default | User has to pass spark conf parameters to publish lineage events | Supported versions: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/datazone-support-matrix.html)  More details in [Capture lineage from EMR Spark executions](datazone-data-lineage-automate-capture-from-tools.md#datazone-data-lineage-automate-capture-from-tools-emrnotebook) | 

# Visualizing data lineage


In Amazon SageMaker Unified Studio, nodes in the lineage graph contain lineage information while edges represent upstream/downstream directions of data propagation. The lineage information is present in metadata forms attached to the lineage node. Amazon SageMaker Unified Studio defines three types of lineage nodes: 
+ Dataset node - this node includes data lineage information about a specific dataset.
  + Dataset refers to any object such as table, view, Amazon S3 file, Amazon S3 bucket, etc. It also refers to the assets in Amazon SageMaker Unified Studio's inventory and catalog, and subscribed tables/views.
  + Each version of the dataset node represents an event happening on the dataset at that timestamp. The history tab on the dataset node shows all dataset versions.
+ Job node - this node includes job details such as type of the job (query, etl etc), processing type (batch, streaming) etc job-type (query, etl etc), processing-type etc.
+ JobRun node - this node represents the job run details such as the job it belongs to, status, start/end timestamps etc. Amazon SageMaker Unified Studio's lineage graph shows a combined for job and job-run which shows job details and latest run details along with a history of previous job-runs.

Lineage graph can be visualized with base node as an asset. In the SageMaker Unified Studio, search for the assets, open any asset and you can see lineage on the Asset Details page. 

Here is the sample lineage graph for a user who is a data producer:

![\[Sample lineage graph for a user who is a data producer.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot4datalineage.png)


Here is the sample lineage graph for a user who is a data consumer:

![\[Sample lineage graph for a user who is a data consumer.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot5datalineage.png)


The asset details page provides the following capabilities to navigate the graph:
+ Column-level lineage: expand column-level lineage when available in dataset nodes. This automatically shows relationships with upstream or downstream dataset nodes if source column information is available.
+ Column search: the default display for number of columns is 10. If there are more than 10 columns, pagination is activated to navigate to the rest of the columns. To quickly view a particular column, you can search on the dataset node that list just the searched column.
+ View dataset nodes only: if you want to toggle to view only dataset lineage nodes and filter out the job nodes, you can choose the **Open view control** icon on the top left of the graph viewer and toggle the **Display dataset nodes only** option. This will remove all the job nodes from the graph and lets you navigate just the dataset nodes. Note that when the view only dataset nodes is turned on, the graph cannot be expanded upstream or downstream.
+ Details pane: Each lineage node has details captured and displayed when selected.
  + Dataset node has a detail pane to display all the details captured for that node for a given timestamp. Every dataset node has 3 tabs, namely: Lineage info, Schema, and History tab. The history tab lists the different versions of lineage event captured for that node. All details captured from API are displayed using metadata forms or a JSON viewer.
  + Job node has a detail pane to display job details with tabs, namely: Job info, and History. The details pane also captures query or expressions captured as part of the job run. The history tab lists the different job runs captured for that job. All details captured from API are displayed using metadata forms or a JSON viewer.
+ History tab: all lineage nodes in Amazon SageMaker Unified Studio's lineage have versioning. For every dataset node or job node, the versions are captured as history and that enables you to navigate between the different versions to identify what has changed over time. Each version opens a new tab in the lineage page to help compare or contrast.

# Aggregated lineage view


You can view an asset's lineage in two ways:
+ **Aggregated view** - Shows all jobs that are currently contributing to an asset's lineage, providing a complete picture of the data transformations and dependencies across multiple levels of the lineage graph. Use this view to understand the full scope of jobs impacting your datasets and to identify all upstream sources and downstream consumers.
+ **Timestamp view** - Shows the lineage graph as it existed at a specific point in time, displaying only the latest job run for each job at that timestamp. This view includes column-level lineage and is useful for troubleshooting and investigating specific data processing events.

The aggregated view is the default in most regions and shows the current state of your data lineage. In Opt-In Regions, only the timestamp view is available.

To switch between views, choose the **Open view control** icon in the top left of the lineage graph viewer and toggle the **Display in event timestamp order** option. When enabled, the timestamp view is displayed. When disabled, the aggregated view is displayed. This toggle is not available in Opt-In Regions.

Here is a sample aggregated view of a lineage graph:

![\[Sample aggregated view of a lineage graph showing all jobs currently contributing to the asset.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot6datalineage.png)


Here is a sample timestamp view of a lineage graph:

![\[Sample timestamp view of a lineage graph showing the latest job run at a specific point in time.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot7datalineage.png)


# Test drive data lineage


You can use the data lineage sample experience to browse and understand data lineage in Amazon SageMaker Unified Studio, including traversing upstream or downstream in your data lineage graph, exploring versions and column-level lineage.

Complete the following procedure to try the sample data lineage experience in Amazon SageMaker Unified Studio:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project you want to view lineage in.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. On the **Inventory** tab, choose the name of the asset that you want to view lineage for. This opens the asset details page.

1. On the asset details page, choose the **Lineage** tab.

1. In the data lineage window, choose the info icon that says **Try sample data lineage**. Then choose **Launch**. A new pop-up window appears.

1. Choose **Start guided data lineage tour**.

1. Select a guided tour option, and then choose **Start tour**.

   At this point, a tab that provides all the space of lineage information is displayed. The sample data lineage graph is initially displayed with a base node with 1-depth at either ends, upstream and downstream. You can expand the graph upstream or downstream. The columns information is also available for you to choose and see how lineage flows through the nodes. 

# Data lineage authorization


**Write permissions** - to publish lineage events into Amazon SageMaker Unified Studio, you must have an IAM role with a policy that includes an ALLOW action on the PostLineageEvent API. To publish lineage data into Amazon SageMaker Unified Studio, you must have an IAM role with a permissions policy that includes an ALLOW action on the PostLineageEvent API. This IAM authorization happens at API Gateway layer.

**Read permissions to view lineage** - GetLineageNode and ListLineageNodeHistory are included in the AmazonSageMakerDomainExecution managed policy and therefore every user in an Amazon SageMaker unified domain can invoke these to view the data lineage graph in Amazon SageMaker Unified Studio.

**Read permissions to get lineage events:** you must have an IAM role with a policy that includes ALLOW action on ListLineageEvents and GetLineageEvent APIs to view lineage events posted to Amazon SageMaker Unified Studio.

# Automate lineage capture from data connections


**Topics**
+ [

## Configure automated lineage capture for AWS Glue (Lakehouse) connections
](#datazone-data-lineage-automate-capture-from-data-connections-glue)
+ [

## Configure automated lineage capture for Amazon Redshift connections
](#datazone-data-lineage-automate-capture-from-data-connections-redshift)

## Configure automated lineage capture for AWS Glue (Lakehouse) connections


As databases and tables are added to the Amazon SageMaker Unified Studio’s catalog, the lineage extraction can be automated from source for those assets using data source runs in Create Connection workflow. For every connection created, lineage is not automatically enabled. 

**To enable lineage capture for an AWS Glue connection**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials.

1. Choose **Select project** from the top navigation pane and select the project to which you want to add the data source.

1. Choose **Data sources** from the left navigation pane under Project catalog.

1. Choose the data source that you want to modify.

1. Expand the **Actions** menu, then choose **Edit data source** or click on the Data Source run name to view the details and go to **Data Source Definition** tab and choose **Edit** in **Connection** details. 

1. Go to the connections and select **Import data lineage** checkbox to configure lineage capture from the source. 

1. Make other changes to the data source fields as desired, then choose **Save**.

   **Limitations**

   Lineage is captured only for crawlers which imported less than 250 tables in a crawler run.

**Note**  
When enabled, the lineage runs asynchronously to capture metadata from the source and generate lineage events to be stored in SageMaker Catalog to be visualized from a particular asset. The status of lineage runs for the data source can be viewed along with data source run details. 

## Configure automated lineage capture for Amazon Redshift connections


Capturing lineage from Amazon Redshift can be automated when the connection is added to an Amazon Redshift source in Amazon SageMaker Unified Studio’s Data explorer. Lineage capture can be automated for a connection at the data source configuration. For every connection created, lineage is not automatically enabled. 

**To enable lineage capture for an Amazon Redshift connection**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials.

1. Choose **Select project** from the top navigation pane and select the project to which you want to add the data source.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose the data source that you want to modify.

1. Expand the **Actions** menu, then choose **Edit data source** or click on the data source run name to view the details and go to Data Source Definition tab and select **Edit** in **Connection details**. 

1. Go to the connections and select **Import data lineage** checkbox to configure lineage capture from the source. 

1. Make other changes to the data source fields as desired, then choose **Save**.

**Note**  
When enabled, the lineage runs captures queries executed for a given database and generates lineage events to be stored in Amazon DataZone to be visualized from a particular asset. The lineage run for Amazon Redshift is set up for a daily run to pull from the Amazon Redshift system tables to derive lineage. For the first run, after enabling the feature, the first pull is scheduled for \$15 minutes after and set for a daily run. You can configure specific time programmatically. 

# Automate lineage capture from tools


**Topics**
+ [

## Capture lineage for Spark executions in Visual ETL
](#datazone-data-lineage-automate-capture-from-tools-vetl)
+ [

## Capture lineage for AWS Glue Spark executions in Notebooks
](#datazone-data-lineage-automate-capture-from-tools-gluenotebook)
+ [

## Capture lineage from EMR Spark executions
](#datazone-data-lineage-automate-capture-from-tools-emrnotebook)

## Capture lineage for Spark executions in Visual ETL


When a new job is created in Visual ETL in Amazon SageMaker Unified Studio, lineage is automatically enabled. When a Visual ETL flow is created, lineage capture for that ETL flow is automatically enabled when you hit **Save to Project**. For every flow to capture lineage automatically, select **Save to Project** and then select **Run**.

**Note:** if you see that lineage is not getting captured, select **Save** and then move back to the Visual ETL flows section and then reopen the Visual ETL flow.

The following Spark configuration parameters are automatically added to the job being executed. When invoking Visual ETL programmatically, use the below configuration.

```
{
    "--conf":"spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener 
    --conf spark.openlineage.transport.type=amazon_datazone_api 
    --conf spark.openlineage.transport.domainId={DOMAIN_ID} 
    --conf spark.glue.accountId={ACCOUNT_ID} 
    --conf spark.openlineage.facets.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;]
    --conf spark.openlineage.columnLineage.datasetLineageEnabled=True
    --conf spark.glue.JOB_NAME={JOB_NAME}"
}
```

The parameters are auto-configured and do not need any updates from the user. To understand the parameters in detail: 
+ `spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener` - OpenLineageSparkListener will be created and registered with Spark's listener bus
+ `spark.openlineage.transport.type=amazon_datazone_api` - This is an OpenLineage specification to tell the OpenLineage Plugin to use DataZone API Transport to emit lineage events to DataZone’s PostLineageEvent API. For more information, see [https://openlineage.io/docs/integrations/spark/configuration/spark\$1conf/](https://openlineage.io/docs/integrations/spark/configuration/spark_conf/)
+ `spark.openlineage.transport.domainId={DOMAIN_ID}` - This parameter establishes the domain to which the API transport will submit the lineage events to.
+ `spark.openlineage.facets.custom_environment_variables [AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;]` - The following environment variables (`AWS_DEFAULT_REGION`, `GLUE_VERSION`, `GLUE_COMMAND_CRITERIA`, and `GLUE_PYTHON_VERSION`), which AWS Glue interactive session populates, will be added to the LineageEvent
+ `spark.glue.accountId=<ACCOUNT_ID>` - Account Id of the Glue Data Catalog where the metadata resides. This account id is used to construct Glue ARN in lineage event.
+ `spark.glue.JOB_NAME` - Job name of the lineage event. In vETL flow, the job name is configured automatically to be spark.glue.JOB\$1NAME: \$1\$1projectId\$1.\$1\$1pathToNotebook\$1

**Spark compute limitations**
+ OpenLineage libraries for Spark are built into AWS Glue v5.0\$1 for Spark DataFrames only. Dynamic DataFrames are not supported.
+ LineageEvent has a size limit of 300KB.

## Capture lineage for AWS Glue Spark executions in Notebooks


Sessions in notebooks does not have a concept of a job. You can map the Spark executions to lineage events by generating a unique job name for the notebook. You can use the %%configure magic with the below parameters to enable lineage capture for Spark executions in the notebook. 

Note: for AWS Glue Spark executions in notebooks, lineage capture is automated when scheduled with workflow in shared environment using remote workflows.

```
%%configure --name {COMPUTE_NAME} -f
{
"--conf":"spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=amazon_datazone_api --conf spark.openlineage.transport.domainId={DOMAIN_ID} --conf spark.glue.accountId={ACCOUNT_ID} --conf spark.openlineage.columnLineage.datasetLineageEnabled=True --conf spark.openlineage.facets.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] --conf spark.glue.JOB_NAME={JOB_NAME}" 
}
```

Examples of \$1COMPUTE\$1NAME\$1: project.spark.compatibility or project.spark.fineGrained

Here are these parameters and what they configure, in detail:
+ spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
  + OpenLineageSparkListener will be created and registered with Spark's listener bus
+ spark.openlineage.transport.type=amazon\$1datazone\$1api
  + [https://openlineage.io/docs/integrations/spark/configuration/spark\$1conf](https://openlineage.io/docs/integrations/spark/configuration/spark_conf)
  + This is an OpenLineage specification to tell the OpenLineage Plugin to use DataZone API Transport to emit lineage events to DataZone's PostLineageEvent API to be captured in SageMaker.
+ spark.openlineage.transport.domainId=\$1DOMAIN\$1ID\$1
  + This parameter establishes the domain to which the API transport will submit the lineage events to.
+ spark.openlineage.facets.custom\$1environment\$1variables [AWS\$1DEFAULT\$1REGION;GLUE\$1VERSION;GLUE\$1COMMAND\$1CRITERIA;GLUE\$1PYTHON\$1VERSION;]
  + The following environment variables (AWS\$1DEFAULT\$1REGION, GLUE\$1VERSION, GLUE\$1COMMAND\$1CRITERIA, and GLUE\$1PYTHON\$1VERSION), which Glue interactive session populates, will be added to the LineageEvent
+ spark.glue.accountId=\$1ACCOUNT\$1ID\$1
  + Account Id of the Glue Data Catalog where the metadata resides. This account id is used to construct Glue ARN in lineage event.
+ [optional] spark.openlineage.transport.region=\$1DOMAIN\$1REGION\$1
  + If domain region is different from that of the job's execution region, pass this parameter with value as domain's region
+ spark.glue.JOB\$1NAME
  + Job name of the lineage event. For example, the job name can be set to be spark.glue.JOB\$1NAME: \$1\$1projectId\$1.\$1\$1pathToNotebook\$1.

## Capture lineage from EMR Spark executions


EMR with Spark engine has the necessary OpenLineage libraries built in. You need to pass the following spark parameters. Be sure to replace the \$1Domain ID\$1 with your specific Amazon DataZone or Amazon SageMaker Unified Studio domain and to replace the \$1Account ID\$1 with the account id where the EMR job is run.

```
%%configure --name emr-s.{EMR_SERVERLESS_COMPUTE_NAME}
{   
    "conf": {
         "spark.extraListeners": "io.openlineage.spark.agent.OpenLineageSparkListener",
         "spark.openlineage.columnLineage.datasetLineageEnabled":"True",
         "spark.openlineage.transport.type":"amazon_datazone_api",
         "spark.openlineage.transport.domainId":"{DOMAIN_ID}",
         "spark.openlineage.transport.region":"{DOMAIN_REGION}" // Only needed if the domain is in different region than that of the job         
         "spark.glue.accountId":"{ACCOUNT_ID}", // needed if AWS Glue is being used as the Hive metastore
         "spark.jars":"/usr/share/aws/datazone-openlineage-spark/lib/DataZoneOpenLineageSpark-1.0.jar" // Only needed incase of EMR-S
    }
}
```
+ Lineage is supported for the following EMR versions:
  + EMR-S: 7.5\$1
  + EMR on EC2: 7.11\$1
  + EMR on EKS: 7.12\$1
+ The JOB\$1NAME is the Spark application name that is automatically set
+ Replace \$1DOMAIN\$1ID\$1, \$1ACCOUNT\$1ID\$1, \$1DOMAIN\$1REGION\$1
+ Amazon SageMaker Unified Studio VPC endpoint is deployed to EMR VPC endpoint

# Permissions required for data lineage


## Read permissions to view lineage


Permissions on following actions are needed to view lineage graph:
+ `datazone:GetLineageNode`
+ `datazone:ListLineageNodeHistory`
+ `datazone:QueryGraph`

Above permissions are included in the `AmazonSageMakerDomainExecution` managed policy and therefore every user in an Amazon SageMaker Unified Studio domain can invoke these to view the data lineage graph in Amazon SageMaker Unified Studio.

Permissions on following actions are needed to view lineage events:
+ `datazone:ListLineageEvents`
+ `datazone:GetLineageEvent`

User must have an IAM role with a policy that includes "Allow" action on these APIs to view lineage events posted to Amazon SageMaker Unified Studio.

## Write permissions to publish lineage


### Lineage for AWS Glue crawler


The project user role is used to fetch required data from AWS Glue. The project user role should contain the following permissions on Glue operations:
+ `glue:listCrawls`
+ `glue:getConnection`

**Note**  
`SageMakerStudioProjectUserRolePolicy` already contains above permissions.

### Lineage for Amazon Redshift


The project user role is used to execute queries on the cluster/workgroup defined in the connection. The project user role should contain the following permissions:
+ `redshift-data:BatchExecuteStatement`
+ `redshift-data:ExecuteStatement`
+ `redshift-data:DescribeStatement`
+ `redshift-data:GetStatementResult`

**Note**  
`SageMakerStudioProjectUserRolePolicy` already contains above permissions.

In addition, the credentials provided for Amazon Redshift connection in Amazon SageMaker Unified Studio should contain following permissions:
+ `sys:operator` role to access the data from system tables for all user queries performed on the cluster/workgroup
+ Has "SELECT" grant on all the tables

### Lineage for AWS Glue, EMR jobs


The IAM role used to execute the job should contain following permissions to publish lineage events to Amazon SageMaker Unified Studio:
+ ALLOW action on `datazone:PostLineageEvent`
+ If your Amazon SageMaker Unified Studio domain is encrypted with KMS CMK (customer managed key), the job role should have permissions to encrypt and decrypt as well
+ If the spark job is in an account different from Amazon SageMaker Unified Studio domain account, associate the account with domain prior to running the job. Follow [https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html](https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html) to set up account association

### Publish Lineage using API


IAM role with a policy to allow `datazone:PostLineageEvent` action is needed to post lineage events programmatically

# Publishing data lineage programmatically


You can also publish data lineage programmatically using [PostLineageEvent](https://docs.aws.amazon.com/datazone/latest/APIReference/API_PostLineageEvent.html) API. The API takes in open lineage run event as the payload. Additionally, the following APIs support retrieving lineage events and traversing lineage graph: 
+ [ GetLineageEvent](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetLineageEvent.html)
+ [ ListLineageEvents](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html)
+ [QueryGraph](https://docs.aws.amazon.com/datazone/latest/APIReference/API_QueryGraph.html): paginated API to return the aggregate view of the lineage graph 
+ [GetLineageNode](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetLineageNode.html): Gets the lineage node along with its immediate neighbors
+ [ListLineageNodeHistory](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageNodeHistory.html): lists the lineage node versions with each version derived from a data/metadata change event

The following is a sample PostLineageEvent operation payload:

```
{
  "producer": "https://github.com/OpenLineage/OpenLineage",
  "schemaURL": "https://openlineage.io/spec/2-0-0/OpenLineage.json#/definitions/RunEvent",    
  "eventType": "COMPLETE",
  "eventTime": "2024-05-04T10:15:30Z",
  "run": {
    "runId": "d2e7c111-8f3c-4f5b-9ebd-cb1d7995082a"
  },
  "job": {
    "namespace": "xyz.analytics",
    "name": "transform_sales_data"
  },
  "inputs": [
    {
      "namespace": "xyz.analytics",
      "name": "raw_sales",
      "facets": {
        "schema": {
          "_producer": "https://github.com/OpenLineage/OpenLineage",
          "_schemaURL": "https://openlineage.io/spec/facets/schema_dataset.json",
          "fields": [
            { "name": "region", "type": "string" },
            { "name": "year", "type": "int" },
            { "name": "created_at", "type": "timestamp" }
          ]
        }
      }
    }
  ],
  "outputs": [
    {
      "namespace": "xyz.analytics",
      "name": "clean_sales",
      "facets": {
        "schema": {
          "_producer": "https://github.com/OpenLineage/OpenLineage",
          "_schemaURL": "https://openlineage.io/spec/facets/schema_dataset.json",
          "fields": [
            { "name": "region", "type": "string" },
            { "name": "year", "type": "int" },
            { "name": "created_at", "type": "timestamp" }
            
          ]
        },
        "columnLineage": {
          "_producer": "https://github.com/OpenLineage/OpenLineage",
          "_schemaURL": "https://openlineage.io/spec/facets/columnLineage" + "DatasetFacet.json",
          "fields": {
            "id": {
              "inputFields": [
                {
                  "namespace": "xyz.analytics",
                  "name": "raw_sales",
                  "field": "id"
                }
              ]
            },
            "year": {
              "inputFields": [
                {
                  "namespace": "xyz.analytics",
                  "name": "raw_sales",
                  "field": "year"
                }
              ]
            },
            "created_at": {
              "inputFields": [
                {
                  "namespace": "xyz.analytics",
                  "name": "raw_sales",
                  "field": "created_at"
                }
              ]
            }
          }
        }
      }
    }
  ]
}
```

# The importance of the sourceIdentifier attribute to lineage nodes


Every lineage node is uniquely identified by its sourceIdentifier (usually provided as part of open-lineage event) in addition to system generated nodeId. sourceIdentifier is generated using <namespace>, <name> of the node in lineage event.

The following are examples of sourceIdentifier values for different types of nodes:
+ **Job nodes**
  + SourceIdentifier of job nodes is populated from <namespace>.<name> on the job node in open-lineage run event
+ **Jobrun nodes**
  + SourceIdentifier of jobrun nodes is populated from <job's namespace>.<job's name>/<run\$1id>
+ **Dataset nodes**
  + Dataset nodes representing AWS resources: sourceIdentifier is in ARN format
    + AWS Glue table: arn:aws:glue:<region>:<account-id>:table/<database>/<table-name>
    + AWS Glue table with federated sources: arn:aws:glue:<region>:<account-id>:table/<catalog><database>/<table-name>
      + Example: catalog can be "s3tablescatalog"/"s3tablesBucket", "lakehouse\$1catalog" etc
    + Amazon Redshift table:
      + serverless: arn:aws:redshift-serverless:<region>:<account-id>:table/workgroupName/<database>/<schema>/<table-name>
      + provisioned: arn:aws:redshift:<region>:<account-id>:table/clusterIdentifier/<database>/<schema>/<table-name>
    + Amazon Redshift view:
      + serverless: arn:aws:redshift-serverless:<region>:<account-id>:view/workgroupName/<database>/<schema>/<view-name>
      + provisioned: arn:aws:redshift:<region>:<account-id>:view/clusterIdentifier/<database>/<schema>/<view-name>
  + Dataset nodes representing SageMaker catalog resources:
    + Asset: amazon.datazone.asset/<assetId>
    + Listing (published asset): amazon.datazone.listing/<listingId>
  + In all other cases, dataset nodes' sourceIdentifier is populated using <namespace>/<name> of the dataset nodes in open-lineage run event
    + https://openlineage.io/docs/spec/naming/ contains naming convention for various datastores.

The following table contains the examples of how sourceIdentifier is generated for datasets of different types.


****  

| Source for lineage event | Sample OpenLineage event data | Source ID computed by Amazon DataZone | 
| --- | --- | --- | 
|  AWS Glue ETL  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />           "environment-properties":{<br />                 ....<br />                "environment-properties":{<br />                     "GLUE_VERSION":"3.0",<br />                     "GLUE_COMMAND_CRITERIA":"glueetl",<br />                     "GLUE_PYTHON_VERSION":"3"<br />                }<br />           }<br />       } <br />    },<br />    .....<br />   "outputs":[<br />      {<br />         "namespace":"namespace.output",<br />         "name":"output_name",<br />         "facets":{<br />             "symlinks":{<br />                 .... <br />                 "identifiers":[<br />                    {<br />                       "namespace":"arn:aws:glue:us-west-2:123456789012",<br />                       "name":"table/testdb/testtb-1",<br />                       "type":"TABLE"<br />                    }<br />                 ]<br />             }<br />        }<br />     }<br />   ]<br />    <br />}<br />                               </pre>  | arn:aws:glue:us-west-2:123456789012:table/testdb/testtb-1 If environment-properties contains GLUE\$1VERSION, GLUE\$1PYTHON\$1VERSION, etc, Amazon DataZone uses namespace and name in symlink of the dataset (input or output) to construct AWS Glue table ARN for sourceIdentifier. | 
|  Amazon Redshift (Provisioned)  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />          .......<br />       } <br />    },<br />    .....<br />   "inputs":[<br />      {<br />         "namespace":"redshift://cluster-20240715.123456789012.us-east-1.redshift.amazonaws.com:5439",<br />         "name":"tpcds_data.public.dws_tpcds_7"<br />         "facets":{<br />             .....<br />        }<br />     }<br />   ]<br />    <br />}<br />                                </pre>  | arn:aws:redshift:us-east-1:123456789012:table/cluster-20240715/tpcds\$1data/public/dws\$1tpcds\$17  If the namespace prefix is `redshift`, Amazon DataZone uses that to construct the Amazon Redshift ARN using values of namespace and name attributes. | 
|  Amazon Redshift (serverless)  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />          .......<br />       } <br />    },<br />    .....<br />   "outputs":[<br />      {<br />         "namespace":"redshift://workgroup-20240715.123456789012.us-east-1.redshift-serverless.amazonaws.com:5439",<br />         "name":"tpcds_data.public.dws_tpcds_7"<br />         "facets":{<br />             .....<br />        }<br />     }<br />   ]<br />}<br />                                </pre>  | arn:aws:redshift-serverless:us-east-1:123456789012:table/workgroup-20240715/tpcds\$1data/public/dws\$1tpcds\$17  As per OpenLineage naming convention, namespace for Amazon Redshift dataset should be `provider://{cluster_identifier or workgroup}.{region_name}:{port}`. If the namespace contains `redshift-serverless`, Amazon DataZone uses that to construct Amazon Redshift ARN using values of namespace and name attributes. | 
|  Any other datastore  |  Recommendation is to populate namespace and name as per OpenLineage convention defined in [https://openlineage.io/docs/spec/naming/](https://openlineage.io/docs/spec/naming/).  |  Amazon DataZone populates sourceIdentifier as <namespace>/<name>.  | 

# Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio


**Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio**

Every lineage node is uniquely identified by its sourceIdentifier. Previous section talks about formats of sourceIdentifier. Amazon SageMaker Unified Studio automatically links the dataset nodes with assets in inventory based on the sourceIdentifier value. Hence, use the same sourceIdentifier value of dataset node when creating/updating the asset (via AssetCommonDetailsForm::sourceIdentifier attribute).

Following images show the sourceIdentifier on asset details page along with lineage graph highlighting the same sourceIdentifier of dataset node with its downstream asset’s sourceIdentifier.

Asset details page:

![\[Asset details page.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot1datalineage.png)


Asset’s SourceIdentifier in node details:

![\[Asset’s SourceIdentifier in node details.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot2datalineage.png)


Amazon Redshift dataset/table’s sourceIdentifier in node details:

![\[Amazon Redshift dataset/table’s sourceIdentifier in node details.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot3datalineage.png)


# Troubleshooting data lineage
Data lineage troubleshooting

This comprehensive troubleshooting guide helps you resolve common data lineage visibility issues in Amazon SageMaker Unified Studio. The guide covers programmatically published events, data source configurations, and tool-specific lineage capture problems.

**Topics**
+ [

## Not seeing lineage graph for events published programmatically
](#lineage-troubleshooting-programmatic-events)
+ [

## Not seeing lineage for assets even though importLineage is shown as true in AWS Glue datasource
](#lineage-troubleshooting-glue-datasource)
+ [

## Not seeing lineage for assets even though importLineage is shown as true in Amazon Redshift datasource
](#lineage-troubleshooting-redshift-datasource)
+ [

## Troubleshooting lineage for lineage events published from AWS Glue ETL jobs/vETL/Notebooks
](#lineage-troubleshooting-glue-etl-jobs)
+ [

## Troubleshooting lineage for lineage events published from EMR-S/EC2/EKS
](#lineage-troubleshooting-emr)

## Not seeing lineage graph for events published programmatically


**Primary requirement:** Lineage graphs are only visible in Amazon SageMaker Unified Studio if at least one node of the graph is an asset. You must create assets for any dataset nodes and properly link them using the sourceIdentifier attribute.

**Troubleshooting steps:**

1. Create assets for any of the dataset nodes involved in your lineage events. Refer to the following sections for proper linking:
   + [The importance of the sourceIdentifier attribute to lineage nodes](datazone-data-lineage-sourceIdentifier-attribute.md) 
   + [Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio](datazone-data-lineage-linking-nodes.md)

1. Once the asset is created, verify that you can see the lineage on the asset details page.

1. If you are still not seeing lineage, verify that the asset's sourceIdentifier (present in AssetCommonDetailsForm) has the same value as the sourceIdentifier of any input/output dataset node in the lineage event.

   Use the following command to get asset details:

   ```
   aws datazone get-asset --domain-identifier {DOMAIN_ID} --identifier {ASSET_ID}
   ```

   The response appears as follows:

   ```
   {
       .....
       "formsOutput": [
           ..... 
           {
               "content": "{\"sourceIdentifier\":\"arn:aws:glue:eu-west-1:123456789012:table/testlfdb/testlftb-1\"}",
               "formName": "AssetCommonDetailsForm",
               "typeName": "amazon.datazone.AssetCommonDetailsFormType",
               "typeRevision": "6"
           },
           .....
       ],
       "id": "{ASSET_ID}",
       ....
   }
   ```

1. If both sourceIdentifiers are matching but you still cannot see lineage, retrieve the eventId from the PostLineageEvent response or use ListLineageEvents to find the eventId, then invoke GetLineageEvent:

   ```
   aws datazone list-lineage-events --domain-identifier {DOMAIN_ID}
   // You can apply additional filters like timerange etc. 
   // Refer https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html
   
   aws datazone get-lineage-event --domain-identifier {DOMAIN_ID} --identifier {EVENT_ID} --output json event.json
   ```

   The response appears as follows and the open-lineage event is written to the event.json file:

   ```
   {
       "domainId": "{DOMAIN_ID}",
       "id": "{EVENT_ID}",
       "createdBy": ....,
       "processingStatus": "SUCCESS"/"FAILED etc",
       "eventTime": "2024-05-04T10:15:30+00:00",
       "createdAt": "2025-05-04T22:18:27+00:00"
   }
   ```

1. If the GetLineageEvent response's processingStatus is FAILED, contact AWS Support by providing the GetLineageEvent response for the appropriate event and the response from GetAsset.

1. If the GetLineageEvent response's processingStatus is SUCCESS, double-check that the sourceIdentifier of the dataset node from the lineage event matches the value in the GetAsset response above. The following steps help verify this.

1. Invoke GetLineageNode for the job run where the identifier is composed of namespace, name of job and run\$1id in the lineage event:

   ```
   aws datazone get-lineage-node --domain-identifier {DOMAIN_ID} --identifier <job's namespace>.<job's name>/<run_id>
   ```

   The response appears as follows:

   ```
   {
       .....
       "downstreamNodes": [
           {
               "eventTimestamp": "2024-07-24T18:08:55+08:00",
               "id": "afymge5k4v0euf"
           }
       ],
       "formsOutput": [
           <some forms corresponding to run and job>
       ],
       "id": "<system generated node-id for run>",
       "sourceIdentifier": "<job's namespace>.<job's name>/<run_id>",
       "typeName": "amazon.datazone.JobRunLineageNodeType",
       ....
       "upstreamNodes": [
           {
               "eventTimestamp": "2024-07-24T18:08:55+08:00",
               "id": "6wf2z27c8hghev"
           },
           {
               "eventTimestamp": "2024-07-24T18:08:55+08:00",
               "id": "4tjbcsnre6banb"
           }
       ]
   }
   ```

1. Invoke GetLineageNode again by passing in the downstream/upstream node identifier (which you think should be linked to the asset node):

   ```
   aws datazone get-lineage-node --domain-identifier {DOMAIN_ID} --identifier afymge5k4v0euf
   ```

   This returns the lineage node details corresponding to the dataset: `afymge5k4v0euf`. Verify if the sourceIdentifier matches that of the asset. If not matching, fix the namespace and name of the dataset in the lineage event and publish the lineage event again. You will see the lineage graph on the asset.

   ```
   {
       .....
       "downstreamNodes": [],
       "eventTimestamp": "2024-07-24T18:08:55+08:00",
       "formsOutput": [
           .....
       ],
       "id": "afymge5k4v0euf",
       "sourceIdentifier": "<sample_sourceIdentifier_value>",
       "typeName": "amazon.datazone.DatasetLineageNodeType",
       "typeRevision": "1",
       ....
       "upstreamNodes": [
           ...
       ]
   }
   ```

## Not seeing lineage for assets even though importLineage is shown as true in AWS Glue datasource


Open the datasource run(s) associated with the AWS Glue datasource and you can see the assets imported as part of the run and the lineage import status along with error message in case of failure.

**Limitations:**
+ Lineage for crawler run importing more than 250 tables isn't supported.

## Not seeing lineage for assets even though importLineage is shown as true in Amazon Redshift datasource


Lineage on Amazon Redshift tables is captured by retrieving user queries performed on the Amazon Redshift database, from the system tables.

**Lineage is not supported in the following cases:**
+ External Tables
+ Unload / Copy
+ Merge / Update
+ Queries that produce Lineage Events larger than 16MB
+ Datashares
+  ColumnLineage limitations:
  +  Column Lineage is not supported for queries not containing specific columns such as (select \$1 from tableA) 
  +  Column Lineage is not supported for queries involving temp tables 
+ Any limitations pertaining to [OpenLineageSqlParser](https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/README.md) results in failure to process some queries

**Troubleshooting steps:**

1. On the Amazon Redshift connection details, you will see the lineageJobId along with the job run schedule. Alternatively, you can fetch it using the [get-connection](https://docs.aws.amazon.com/cli/latest/reference/datazone/get-connection.html) API.

1. Invoke [list-job-runs](https://docs.aws.amazon.com/cli/latest/reference/datazone/list-job-runs.html) to get the runs corresponding to the job:

   ```
   aws datazone list-job-runs --domain-identifier {DOMAIN_ID} --job-identifier {JOB_ID}
   ```

   The response appears as follows:

   ```
   {
      "items": [ 
         { 
            .....
            "error": { 
               "message": "string"
            },
            "jobId": {JOB_ID},
            "jobType": LINEAGE,
            "runId": ...,
            "runMode": SCHEDULED,
            "status": SCHEDULED | IN_PROGRESS | SUCCESS | PARTIALLY_SUCCEEDED | FAILED | ABORTED | TIMED_OUT | CANCELED
            .....
         }
      ],
      "nextToken": ...
   }
   ```

1. If no job-runs are returned, check your job run schedule on the Amazon Redshift connection details. Reach out to AWS Support providing the lineageJobId, connectionId, projectId and domainId if job runs are not executed per given schedule.

1. If job-runs are returned, pick the relevant jobRunId and invoke GetJobRun to get job run details:

   ```
   aws datazone get-job-run --domain-identifier {DOMAIN_ID} --identifier {JOB_RUN_ID}
   ```

   The response appears as follows:

   ```
   {
     ....
     "details": {
       "lineageRunDetails": {
         "sqlQueryRunDetails": {
           "totalQueriesProcessed": ..,
           "numQueriesFailed": ...,
           "errorMessages":....,
           "queryEndTime": ...,
           "queryStartTime": ...
         }
       }
     },
     .....
   }
   ```

1. The job run fails if none of the queries are successfully processed; partially succeeds if some queries are successfully processed; succeeds if all queries are successfully processed and response also contains start and end times of processed queries.

## Troubleshooting lineage for lineage events published from AWS Glue ETL jobs/vETL/Notebooks


**Limitations:**
+ OpenLineage libraries for Spark are built into AWS Glue v5.0\$1 for Spark DataFrames only. Does not support Glue DynamicFrames.
+ Lineage capture for Spark jobs with fine-grained permission mode are not supported.
+ Lineage event has a size limit of 300KB.

**Common Issues:**
+ Verify necessary permissions are given to your job execution role as per [Permissions required for data lineage](datazone-data-lineage-permissions.md) 
+ Your spark job working with S3 files would produce lineage events with s3 datasets, even when they are catalog'ed in AWS Glue. To generate events including AWS Glue tables and build proper lineage graph with AWS Glue assets, your spark job should instead work with glue tables. 
+ If the AWS Glue ETL is in VPC, make sure the Amazon DataZone VPC endpoint is deployed in that VPC.
+ In case your domain is using a CMK, make sure that the AWS Glue execution role has the appropriate KMS permissions. CMK can be found via [https://docs.aws.amazon.com/datazone/latest/APIReference/API\$1GetDomain.html](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetDomain.html)
+ Failed to publish Lineage Event because payload is greater than 300 kb:
  + Add the following to Spark conf:

    ` "spark.openlineage.columnLineage.datasetLineageEnabled": "True" `
  + **Important Note:**
    + Column lineage typically constitutes a significant portion of the payload and enabling this will efficiently generate the column lineage info.
    + Disabling it can help reduce payload size and avoid validation exceptions.
+ Cross account lineage event submission:
  + Follow [https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html](https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html) to set up account association
  + Ensure that RAM policy is using the latest policy
+ If your Amazon SageMaker Unified Studio domain is in different region from that of job
  + Add this spark parameter: `"spark.openlineage.transport.region":"{region of your domain}"`
+ When the same DataFrame is written to multiple destinations or formats in sequence, Lineage SparkListener may only capture the lineage for the first write operation:
  + For optimization purposes, Spark's internals may reuse execution plans definition for consecutive write operations on the same DataFrame. This can lead to only capturing first lineage event.

**Troubleshooting steps:**

1. Lineage graph can only be visualized if at least one node of the graph is an asset. Therefore, create assets for any of the datasets (such as tables) involved in the job and then attempt to visualize lineage on the asset.

1. First, invoke [ListLineageEvents](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html) to see if the lineage events were submitted (refer to the linked doc to pass filters).

1. If no events are submitted, check AWS CloudWatch logs to see if any exceptions are thrown from the Amazon DataZone Lineage lib:
   + Log groups: /aws-glue/jobs/error, /aws-glue/sessions/error
   + Make sure logging is enabled: [https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html)
   + Following is the AWS CloudWatch log insights query to check for exceptions:

     ```
     fields @timestamp, @message
     | filter @message like /(?i)exception/ and like /(?i)datazone/
     | sort @timestamp desc
     ```
   + Following is the AWS CloudWatch log insights query to confirm events are submitted:

     ```
     fields @timestamp, @message
     | filter @message like /Successfully posted a LineageEvent:/
     | sort @timestamp desc
     ```

1. Fetch the lineage events generated from this job by executing the python script:[ retrieve\$1lineage\$1events.py](https://github.com/aws-samples/amazon-datazone-examples/tree/main/data_lineage) 

1. Check if the dataset on which you expected lineage is present in any of the events
   + You can ignore empty events without any input/output nodes
   + Check if your dataset node has glue arn prefix in the namespace of the node or in the "symlink" facet of the node. If you don't see any node with glue arn prefix, it means your script is not using glue tables directly and hence lineage is not linked to glue asset. One way to workaround this is to update the script to work with glue tables. 

1. If you are still unable to see lineage and it doesn't fall under the limitations category, reach out to AWS Support by providing:
   + Spark config parameters
   + Lineage events file from executing retrieve\$1lineage\$1events.py script
   + GetAsset response

## Troubleshooting lineage for lineage events published from EMR-S/EC2/EKS


**Notes:**
+ Lineage is supported from following versions of EMR:
  + EMR-S: 7.5\$1
  + EMR-EC2: 7.11\$1
  + EMR-EKS: 7.12\$1
+ Lineage capture for Spark jobs with fine-grained permission mode are not supported.
+ If you are trying EMR outside of Amazon SageMaker Unified Studio, Amazon DataZone VPC endpoint needs to be deployed for EMR VPC.
+ Lineage event has a size limit of 300KB.

**Common Issues:**
+ Verify necessary permissions are given to your job execution role as per [Permissions required for data lineage](datazone-data-lineage-permissions.md) 
+ In case your domain is using a CMK, make sure that the job's execution role has the appropriate KMS permissions. CMK can be found via [https://docs.aws.amazon.com/datazone/latest/APIReference/API\$1GetDomain.html](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetDomain.html)
+ Failed to publish Lineage Event because payload is greater than 300 kb:
  + Add the following to Spark conf which efficiently generates the event payload:

    ` "spark.openlineage.columnLineage.datasetLineageEnabled": "true" `
+ Cross account lineage event submission:
  + Follow [https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html](https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html) to set up account association
  + Ensure that RAM policy is using the latest policy
+ If your Amazon SageMaker Unified Studio domain is in different region from that of job
  + Add this spark parameter: `"spark.openlineage.transport.region":"{region of your domain}"`
+ When the same DataFrame is written to multiple destinations or formats in sequence, Lineage SparkListener may only capture the lineage for the first write operation:
  + For optimization purposes, Spark's internals may reuse execution plans definition for consecutive write operations on the same DataFrame. This can lead to only capturing first lineage event.

**Troubleshooting steps:**

1. Lineage graph can only be visualized if at least one node of the graph is an asset. Therefore, create assets for any of the datasets (such as tables) involved in the job and then attempt to visualize lineage on the asset.

1. First, invoke [ListLineageEvents](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html) to see if the lineage events were submitted (refer to the linked doc to pass filters).

1. If no events are submitted, check AWS CloudWatch logs to see if any exceptions are thrown from the Amazon DataZone Lineage lib:
   + **EC2:**
     + You can provide the CloudWatch log group or log destination for S3 path at the time of creating EC2 cluster. Refer [https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html)
     + You will see logs in stderr file within cluster-id/containers/application\$1/ folder.
   + **EKS:**
     + You need to provide the CloudWatch log group or log destination for S3 path while submitting spark job. Refer [https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/getting-started.html](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/getting-started.html)
     + You will see logs in stderr file of spark\$1driver within virtual-cluster-id/jobs/job-id/containers/\$1 folder.
   + **EMR-S:**
     + You can enable logs by following [https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/logging.html](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/logging.html)
     + You will see logs in stderr files of spark\$1driver
   + Following is the AWS CloudWatch log insights query to check for exceptions:

     ```
     fields @timestamp, @message
     | filter @message like /(?i)exception/ and like /(?i)datazone/
     | sort @timestamp desc
     ```
   + Following is the AWS CloudWatch log insights query to inspect generated events:

     ```
     fields @timestamp, @message
     | filter @message like /Successfully posted a LineageEvent:/
     | sort @timestamp desc
     ```

1. Fetch the lineage events generated from this job by executing the python script:[ retrieve\$1lineage\$1events.py](https://github.com/aws-samples/amazon-datazone-examples/tree/main/data_lineage) 

1. Check if dataset on which you expected lineage is present in any of the events
   + You can ignore empty events without any input/output nodes
   + Check if your dataset node has namespace/name matching with the sourceIdentifier of the asset. If you don't see any node with asset's sourceIdentifier, refer to following docs on how to fix it:
     + [The importance of the sourceIdentifier attribute to lineage nodes](datazone-data-lineage-sourceIdentifier-attribute.md) 
     + [Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio](datazone-data-lineage-linking-nodes.md)

1. If you are still unable to see lineage and it doesn't fall under the limitations category, reach out to AWS Support team by providing:
   + Spark config parameters
   + Lineage events file from executing retrieve\$1lineage\$1events.py script
   + GetAsset response