

# Data inventory and publishing


This section describes the tasks and procedures to create an inventory of your data in Amazon SageMaker Unified Studio and to publish your data in Amazon SageMaker Unified Studio.

To use Amazon SageMaker Unified Studio to catalog your data, you must first bring your data (assets) as inventory of your project in Amazon SageMaker Unified Studio. Creating an inventory for a particular project makes the assets discoverable only to that project’s members. Project inventory assets are not available to all domain users in search or browse unless it is published to the Amazon SageMaker Catalog. After creating a project inventory, data owners can curate their inventory assets with the required business metadata by adding or updating business names (asset and schema), descriptions (asset and schema), README, glossary terms (asset and schema), and metadata forms. 

The next step of using Amazon SageMaker Unified Studio to catalog your data is to make your project’s inventory assets discoverable by the domain users. You can do this by publishing the inventory assets to the Amazon SageMaker Unified Studio catalog. Only the latest version of the inventory asset can be published to the catalog and only the latest published version is active in the discovery catalog. If an inventory asset is updated after it's been published into the Amazon SageMaker Unified Studio catalog, you must publish it again for the latest version to be in the discovery catalog. 

For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md)

**Topics**
+ [

# Configure Lake Formation permissions for Amazon SageMaker Unified Studio
](lake-formation-permissions-for-amazon-sagemaker-unified-studio.md)
+ [

# Create custom asset types in Amazon SageMaker Unified Studio
](create-asset-types.md)
+ [

# Create an Amazon SageMaker Unified Studio data source for AWS Glue in the project catalog
](data-source-glue.md)
+ [

# Create an Amazon SageMaker Unified Studio data source for Amazon Redshift in the project catalog
](create-redshift-data-source.md)
+ [

# Create an Amazon SageMaker Unified Studio data source for Amazon SageMaker AI in the project catalog
](create-sagemaker-data-source.md)
+ [

# Edit a data source in Amazon SageMaker Unified Studio
](editing-a-data-source.md)
+ [

# Delete a data source in Amazon SageMaker Unified Studio
](removing-a-data-source.md)
+ [

# Publish assets to the Amazon SageMaker Unified Studio catalog from the project inventory
](publishing-data-asset.md)
+ [

# Share assets
](share-assets.md)
+ [

# Manage inventory and curate assets in Amazon SageMaker Unified Studio
](update-metadata.md)
+ [

# Manually create an asset in Amazon SageMaker Unified Studio
](create-data-asset-manually.md)
+ [

# Unpublish an asset from the Amazon SageMaker Catalog
](archive-data-asset.md)
+ [

# Delete an Amazon SageMaker Unified Studio asset
](delete-data-asset.md)
+ [

# Manually start a data source run in Amazon SageMaker Unified Studio
](manually-start-data-source-run.md)
+ [

# Asset revisions in Amazon SageMaker Unified Studio
](asset-versioning.md)
+ [

# Data quality in Amazon SageMaker Unified Studio
](data-quality.md)
+ [

# Using machine learning and generative AI in Amazon SageMaker Unified Studio
](autodoc.md)
+ [

# Data lineage in Amazon SageMaker Unified Studio
](datazone-data-lineage.md)
+ [

# Analyze Amazon SageMaker Unified Studio data with external analytics applications via JDBC connection
](query-with-jdbc.md)
+ [

# Metadata enforcement rules for publishing
](metadata-rules-publishing.md)

# Configure Lake Formation permissions for Amazon SageMaker Unified Studio


When you create a project in Amazon SageMaker Unified Studio, an AWS Glue database is added as part of this project. If you want to publish assets from this AWS Glue database, no additional permissions are needed.

However, if you want to publish assets and subscribe to assets from an AWS Glue database that exists outside of your Amazon SageMaker Unified Studio project, you must explicitly provide your project with the permissions to access tables in the external AWS Glue database. To do this, you must complete the following settings in AWS Lake Formation and attach necessary AWS Lake Formation permissions to the project's IAM role role.
+ Configure the Amazon S3 location for your data lake in AWS Lake Formation with **Lake Formation** permission mode or **Hybrid access mode**. For more information, see [https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html](https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html).
+ Attach the following AWS Lake Formation permissions to the AWS Glue manage access role:
  + `Describe` and `Describe grantable` permissions on the database where the tables exist.
  + `Describe`, `Select`, `Describe Grantable`, `Select Grantable` permissions on the all the tables in the above database that you want DataZone to manage access on your behalf.

**Note**  
Amazon SageMaker Unified Studio supports the AWS Lake Formation hybrid mode. Lake Formation hybrid mode enables you to start managing permissions on you AWS Glue databases and tables through Lake Formation, while continuing to maintain any existing IAM permissions on these tables and databases. 

# Create custom asset types in Amazon SageMaker Unified Studio
Create custom asset types

In Amazon SageMaker Unified Studio, assets represent specific types of data resources such as database tables, dashboards, or machine learning models. To provide consistency and standardization when describing catalog assets, an Amazon SageMaker Unified Studio domain must have a set of asset types that define how assets are represented in the catalog. An asset type defines the schema for a specific type of asset. An asset type has a set of required and optional nameable metadata form types. Asset types in Amazon SageMaker Unified Studio are versioned. When assets are created, they are validated against the schema defined by their asset type (typically latest version), and if an invalid structure is specified, asset creation fails. 

**System asset types** - Amazon SageMaker Unified Studio provisions service-owned system asset types. System asset types cannot be altered. Amazon SageMaker Unified Studio includes the following system asset types:
+ Amazon Bedrock chat app
+ Amazon Bedrock flow app
+ Amazon Bedrock inference only
+ Amazon Bedrock model
+ Amazon Bedrock prompt
+ Databricks table
+ Databricks view
+ AWS Glue table
+ AWS Glue view
+ Amazon Redshift table
+ Amazon Redshift view
+ Amazon S3 object collection
+ SageMaker feature group
+ SageMaker model package group
+ Snowflake table
+ Snowflake view
+ Data product

**Custom asset types** - to create custom asset types, you start by creating the required metadata form types and glossaries to use in the form types. You can then create custom asset types by specifying a name, description, and associated metadata forms that can be required or optional. 

For asset types with structured data, to represent the column schema in Amazon SageMaker Unified Studio, you can use the `RelationalTableFormType` to add the technical metadata to your columns, including column names, descriptions, and data types, and the ` ColumnBusinessMetadataForm` to add the business descriptions of the columns, including business names, glossary terms, and custom key value pairs. 

To create a custom asset type in Amazon SageMaker Unified Studio, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project where you want to create a custom asset type.

1. Navigate to the **Discover** menu in the top navigation.

1. Choose **Data catalog**.

1. Choose **View asset types**.

1. Choose **Create asset type**.

1. Specify the following:
   + **Name** - the name of the custom asset type 
   + **Description** - the description of the custom asset type.
   + Choose **Add metadata form** to add metadata forms to this custom asset type.
   + Under **Usage permission**, restrict access, by specify which projects or domain units are authorized to use this asset type. 
**Note**  
You must be a domain unit owner or a project owner in order to modify usage permissions. Project contributors can view usage permissions but cannot edit them.

     You can choose the following:
     + **All projects** - give permissions to all projects in this domain
     + **Owning project** - give permissions only to the owning project
     + **Selected projects or domain units** - give permissions to specific projects and/or domain units

       If you select this option, choose **Add usage permission**, and in the **Add projects and designations** pop up window, specify the authorized projects (you can choose **Select projects in a domain unit** or **All project in a domain unit**), the specific domain unit, and the allowed designations - which designations a project member must have to use this policy. You can choose **Owner** or **Contributor**.

1. Choose **Create**. After the custom asset type is created, you can use it to create assets.

# Create an Amazon SageMaker Unified Studio data source for AWS Glue in the project catalog
Create a data source for AWS Glue

In Amazon SageMaker Unified Studio, you can create an AWS Glue Data Catalog data source in order to import technical metadata of database tables from AWS Glue. To add a data source for the AWS Glue Data Catalog, the source database must already exist in AWS Glue. Your Amazon SageMaker Unified Studio project’s IAM role also needs certain permissions to be able to create a data source, as described in the section [Configure Lake Formation permissions for Amazon SageMaker Unified Studio](lake-formation-permissions-for-amazon-sagemaker-unified-studio.md).

When you create and run an AWS Glue data source, you add assets from the source AWS Glue database to your Amazon SageMaker Unified Studio project's inventory. You can run your AWS Glue data sources on a set schedule or on demand to create or update your assets' technical metadata. During the data source runs, you can optionally choose to publish your assets to the Amazon SageMaker Unified Studio catalog and thus make them discoverable by all domain users. You can also publish your project inventory assets after editing their business metadata. Domain users can search for and discover your published assets, and request subscriptions to these assets. 

**Note**  
Adding a data source in the project catalog makes it possible to publish that data into the Amazon SageMaker Catalog. To add a data source for analyzing and editing within your project, use the **Data** page of your project. Data that you add to your connect to on the **Data** page can also be published to the Amazon SageMaker Catalog. For more information, see [The lakehouse architecture of Amazon SageMaker](lakehouse.md).

**To create an AWS Glue data source**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which you want to add the data source.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose **Create data source**.

1. Configure the following fields:
   + **Name** – The data source name.
   + **Description** – The data source description.

1. Under **Data source type**, choose **AWS Glue**.

1. (Optional) Under **Connection**, select **Import data lineage** if you want to import lineage for the data sources that use the connection.

1. Under **Data selection**, provide an AWS Glue database and provide a catalog, database names, and criteria for tables. For example, if you choose **Include** and enter `*corporate`, the database will include all source tables that end with the word `corporate`.

   You can either choose an AWS Glue catalog from the dropdown or type a catalog name. The dropdown includes the default AWS Glue catalog for the connection account. 

   You can add multiple include and exclude rules for tables. You can also add multiple databases using the **Add another database** button.

   

1. Choose **Next**.

1. For **Publishing settings**, choose whether assets are immediately discoverable in the Amazon SageMaker Catalog. If you only add them to the inventory, you can choose subscription terms later and then publish them to the Amazon SageMaker Catalog. 

1. For **Metadata generation methods**, choose whether to automatically generate metadata for assets as they're imported from the source.

1. Under **Data quality**, you can choose to **Enable data quality for this data source**. If you do this, Amazon SageMaker Unified Studio imports your existing AWS Glue data quality output into your Amazon SageMaker Unified Studio catalog. By default, Amazon SageMaker Unified Studio imports the latest existing 100 quality reports with no expiration date from AWS Glue.

   Data quality metrics in Amazon SageMaker Unified Studio help you understand the completeness and accuracy of your data sources. Amazon SageMaker Unified Studio pulls these data quality metrics from AWS Glue in order to provide context during a point in time, for example, during a business data catalog search. Data users can see how data quality metrics change over time for their subscribed assets. Data producers can ingest AWS Glue data quality scores on a schedule. The Amazon SageMaker Unified Studio business data catalog can also display data quality metrics from third-party systems through data quality APIs. 

1. (Optional) For **Metadata forms**, add forms to define the metadata that is collected and saved when the assets are imported into Amazon SageMaker Unified Studio. For more information, see [Create a metadata form in Amazon SageMaker Unified Studio](create-metadata-form.md).

1. Choose **Next**.

1. For **Run preference**, choose when to run the data source.
   + **Run on a schedule** – Specify the dates and time to run the data source.
   + **Run on demand** – You can manually initiate data source runs.

1. Choose **Next**.

1. Review your data source configuration and choose **Create**.

You can also create a Amazon SageMaker Unified Studio data source for Amazon Redshift by invoking the `CreateDataSource` API action or the `create-data-source` CLI action:

```
aws datazone create-data-source --cli-input-json file://create-sagemaker-datasource-example.json
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon DataZone domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "GLUE",
  "description": "Description of the datasource",
  "environmentIdentifier": "environment123",
  "configuration": {
    "glueRunConfiguration": {
        "autoImportDataQualityResult": "True",
        "relationalFilterConfigurations": [{
            "databaseName": "my_database",
            "filterExpressions": [{
                "expression": "*",
                "type": "INCLUDE"
            }]
        }]
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content",
      "renderingConfig": {
        "collapse": "True""
      }
    }
  ],
  "clientToken": "123456"
}
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon SageMaker unified domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "GLUE",
  "description": "Description of the datasource",
  "connectionIdentifier": "connection123",  
  "configuration": {
    "glueRunConfiguration": {
        "catalogName": "my_catalog",
        "autoImportDataQualityResult": "True",
        "relationalFilterConfigurations": [{
            "databaseName": "my_database",
            "filterExpressions": [{
                "expression": "*",
                "type": "INCLUDE"
            }]
        }]
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content"
    }
  ],
  "clientToken": "123456"
}
```

# Create an Amazon SageMaker Unified Studio data source for Amazon Redshift in the project catalog
Create a data source for Amazon Redshift

In Amazon SageMaker Unified Studio, you can create an Amazon Redshift data source in order to import technical metadata of database tables and views from the Amazon Redshift data warehouse. To add a Amazon SageMaker Unified Studio data source for Amazon Redshift, the source data warehouse must already exist in the Amazon Redshift.

When you create and run an Amazon Redshift data source, you add assets from the source Amazon Redshift data warehouse to your Amazon SageMaker Unified Studio project's inventory. You can run your Amazon Redshift data sources on a set schedule or on demand to create or update your assets' technical metadata. During the data source runs, you can optionally choose to publish your project inventory assets to the Amazon SageMaker Unified Studio catalog and thus make them discoverable by all domain users. You can also publish your inventory assets after editing their business metadata. Domain users can search for and discover your published assets and request subscriptions to these assets.

**Note**  
Adding a data source in the project catalog makes it possible to publish that data into the Amazon SageMaker Catalog. To add a data source for analyzing and editing within your project, use the **Data** page of your project. Data that you add to your connect to on the **Data** page can also be published to the Amazon SageMaker Catalog. For more information, see [The lakehouse architecture of Amazon SageMaker](lakehouse.md).

**To add an Amazon Redshift data source**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which you want to add the data source.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose **Create data source**.

1. Configure the following fields:
   + **Name** – The data source name.
   + **Description** – The data source description.

1. Under **Data source type**, choose **Amazon Redshift**.

1. Under **Connection**, select a connection for your data source. The connection cannot be changed after the data source is created.

1. Under **Data selection**, provide an Amazon Redshift database schema name and enter your table or view selection criteria. For example, if you choose **Include** and enter `*corporate`, the asset will include all source tables that end with the word `corporate`.

   You can add multiple include rules. You can also add another schema using the **Add another schema** button.

1. Choose **Next**.

1. For **Publishing settings**, choose whether assets are immediately discoverable in Amazon SageMaker Catalog. If you only add them to the inventory, you can choose subscription terms later and then publish them to the Amazon SageMaker Catalog. 

1. For **Metadata generation methods**, choose whether to automatically generate metadata for assets as they're published and updated from the source.

1. (Optional) For **Metadata forms**, add forms to define the metadata that is collected and saved when the assets are imported into Amazon SageMaker Unified Studio. For more information, see [Create a metadata form in Amazon SageMaker Unified Studio](create-metadata-form.md).

1. Choose **Next**.

1. For **Run preference**, choose when to run the data source.
   + **Run on a schedule** – Specify the dates and time to run the data source.
   + **Run on demand** – You can manually initiate data source runs.

1. Choose **Next**.

1. Review your data source configuration and choose **Create**.

You can also create a Amazon SageMaker Unified Studio data source for Amazon Redshift by invoking the `CreateDataSource` API action or the `create-data-source` CLI action:

```
aws datazone create-data-source --cli-input-json file://create-sagemaker-datasource-example.json
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon DataZone domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "REDSHIFT",
  "description": "Description of the datasource",
  "environmentIdentifier": "environment123",  
  "configuration": {
    "redshiftRunConfiguration": {
        "dataAccessRole": "arn:aws:iam::123456789012:role/my-data-access-role",
        "redshiftCredentialConfiguration": {
                "secretManagerArn": "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-secret"
            },
        "redshiftStorage": {
            "redshiftClusterSource": {
                "clusterName": "my-redshift-cluster"
            }
        },
        "relationalFilterConfigurations": [{
            "databaseName": "my_database",
            "filterExpressions": [{
                "expression": "*",
                "type": "INCLUDE"
            }],
            "schemaName": "my_schema"
        }]
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content"
    }
  ],
  "clientToken": "123456"
}
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon SageMaker unified domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "REDSHIFT",
  "description": "Description of the datasource",
  "connectionIdentifier": "connection123",  
  "configuration": {
    "redshiftRunConfiguration": {
        "dataAccessRole": "arn:aws:iam::123456789012:role/my-data-access-role",
        "redshiftCredentialConfiguration": {
                "secretManagerArn": "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-secret"
            },
        "redshiftStorage": {
            "redshiftClusterSource": {
                "clusterName": "my-redshift-cluster"
            }
        },
        "relationalFilterConfigurations": [{
            "databaseName": "my_database",
            "filterExpressions": [{
                "expression": "*",
                "type": "INCLUDE"
            }],
            "schemaName": "my_schema"
        }]
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content"
    }
  ],
  "clientToken": "123456"
}
```

# Create an Amazon SageMaker Unified Studio data source for Amazon SageMaker AI in the project catalog
Create a data source for Amazon SageMaker AI

In the current release of Amazon SageMaker Unified Studio, creating an Amazon Sagemaker AI data source is not supported via the UI and can only be done by envoking API or CLI actions.

In order to create a data source for Amazon SageMaker AI in Amazon SageMaker Unified Studio, you must first to create a RAM share between Amazon SageMaker and Amazon DataZone. This RAM share is necessary for Amazon SageMaker to successfully make Amazon DataZone API calls which are needed for various membership and security checks.

If you're using the Amazon DataZone domain, you can complete this step by [adding Amazon SageMaker as a trusted service in your Amazon DataZone domain](https://docs.aws.amazon.com/datazone/latest/userguide/add-sagemaker-as-trusted-service-associate.html).

 If you're using a Amazon SageMaker unified domain, you can do this by completing the following procedure:

1. Navigate to the RAM console at [https://us-east-1.console.aws.amazon.com/ram/home](https://us-east-1.console.aws.amazon.com/ram/home).

1. Choose **Create resource share**.

1. For resource share name enter a unique name. For example `DataZone-<DataZone DomainId>-SageMaker`.

1. Under **Resources** choose **DataZone Domains** from the drop down and then select the Amazon DataZone domain from the list and then choose **Next**.

1. From the Managed Permissions dropdown, choose **AWSRAMSageMakerServicePrincipalPermissionAmazonDataZoneDomain** and then choose **Next**.

1. Under **Principals** from the **Select principal type** dropdown, choose **Service principal**.

1. Enter `sagemaker.amazonaws.com` for the service principal name and choose **Add** and then choose **Next**.

1. Choose **Create resource share**.

Once this is completed, you can invoke the `CreateDataSource` API action or the `create-data-source` CLI action to create a new data source for Amazon SageMaker AI in Amazon SageMaker Unified Studio. 

```
aws datazone create-data-source --cli-input-json file://create-sagemaker-datasource-example.json
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon DataZone domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "SAGEMAKER",
  "description": "Description of the datasource",
  "environmentIdentifier": "environment123",  
  "configuration": {
    "sageMakerRunConfiguration": {
        "trackingAssets": {
            "SageMakerModelPackageGroupAssetType": [
                "arn:aws:sagemaker:us-east-1:123456789012:model-package-group/my-model-package-group",
            ],
            "SageMakerFeatureGroupAssetType": [
                "arn:aws:sagemaker:us-east-1:123456789012:feature-group/my-feature-group",
            ]
        }
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content"
    }
  ],
  "clientToken": "123456"
}
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon SageMaker unified domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "SAGEMAKER",
  "description": "Description of the datasource",
  "connectionIdentifier": "connection123",  
  "configuration": {
    "sageMakerRunConfiguration": {
        "trackingAssets": {
            "SageMakerModelPackageGroupAssetType": [
                "arn:aws:sagemaker:us-east-1:123456789012:model-package-group/my-model-package-group",
            ],
        }
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content"
    }
  ],
  "clientToken": "123456"
}
```

# Edit a data source in Amazon SageMaker Unified Studio
Edit a data source

After you create an Amazon SageMaker Unified Studio data source, you can modify it to change the source details or the data selection criteria. When you no longer need a data source, you can delete it.

**To edit a data source in the project catalog**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project that contains the data source that you want to edit.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose the data source that you want to modify.

1. Expand the **Actions** menu, then choose **Edit data source**.

1. Make your changes to the data source fields as desired, then choose **Save**.

# Delete a data source in Amazon SageMaker Unified Studio
Delete a data source

When you no longer need an Amazon DataZone data source, you can remove it permanently. After you delete a data source, all assets that originated from that data source are still available in the catalog, and users can still subscribe to them. However, the assets will stop receiving updates from the source. We recommend that you first move the dependent assets to a different data source before you delete it.

**To delete a data source in the project catalog**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project that contains the data source that you want to edit.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose the data source that you want to delete.

1. Expand the **Actions** menu, then choose **Delete data source**.

1. To confirm deletion, type `delete` in the text entry field. Then choose **Delete**.

# Publish assets to the Amazon SageMaker Unified Studio catalog from the project inventory
Publish assets to the catalog from the project inventory

You can publish Amazon SageMaker Unified Studio assets and their metadata from project inventories into the Amazon SageMaker Unified Studio catalog. You can only publish the most recent version of an asset to the catalog.

Consider the following when publishing assets to the catalog:
+ To publish an asset to the catalog, you must be the owner or contributor of the project that contains the asset.
+ For Amazon Redshift assets, ensure that the Amazon Redshift clusters associated with both publisher and subscriber clusters meet all the requirements for Amazon Redshift data sharing in order for Amazon SageMaker Unified Studio to manage access for Redshift tables and views. See [Data sharing concepts for Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/concepts.html).

## Publish an asset in Amazon SageMaker Unified Studio
Publish an asset

If you didn't choose to make assets immediately discoverable in the data catalog when you created a data source, perform the following steps to publish them later.

**To publish an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. Make sure you are on the **Inventory** tab, then choose the name of the asset that you want to publish. You are then brought to the asset details page.
**Note**  
By default, all assets require subscription approval, which means a data owner must approve all subscription requests to the asset. If you want to change this setting before publishing the asset, open the asset details and choose **Edit** next to **Subscription approval**. You can change this setting later by modifying and re-publishing the asset.

1. Choose **Publish asset**. The asset is directly published to the catalog.

   If you make changes to the asset, such as modifying its approval requirements, you can choose **Re-publish asset** to publish the updates to the catalog.

# Share assets


In the current release of Amazon SageMaker Unified Studio, you can share your Amazon S3 assets, AWS Glue (SageMaker Lakehouse) assets, and your Amazon QuickSight assets with other projects or users/groups.

For more information about sharing your Amazon QuickSight assets, see [Share Amazon QuickSight dashboards](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/share-qs-dashboard.html).

For more information about sharing your Amazon S3 assets, see [Sharing Amazon S3 data](data-s3-publish.md).

To share your AWS Glue (SageMaker Lakehouse) data, complete the following procedure:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. From the top center menu, choose **Browse all projects**.

1. Select the name of the project to navigate to that project. You can select either one of the projects that you created manually or the project that was automatically created when you've onboarded your Amazon SageMaker Lakehouse data.

1. Choose the **Data** tab, then choose the catalog that you want to work with under **Lakehose** and navigate down to the database and the table asset that you want to share. 

1. Choose the asset that you want to share, then expand **Actions**, and choose **Share**.

1. In the **Share table** window, specify the project with which you want to share this asset and then choose **Share**.

**Note**  
In the current release of Amazon SageMaker Unified Studio, sharing row and column filters is not supported.

# Manage inventory and curate assets in Amazon SageMaker Unified Studio
Manage inventory and curate assets

In order to use Amazon SageMaker Unified Studio to catalog your data, you must first bring your data (assets) as inventory of your project in Amazon SageMaker Unified Studio. Creating inventory for a particular project makes the assets discoverable only to that project’s members. 

After the assets are created in project inventory, their metadata can be curated. For example, you can edit the asset's name, description, or README. Each edit to the asset creates a new version of the asset. You can use the History tab on the asset's details page to view all asset versions. 

You can edit the **README** section and add rich descriptions for the asset. The **README** section supports markdown, thus enabling you to format your descriptions as required and describe key information about an asset to consumers. 

Glossary terms can be added at the asset level by filling out available forms. 

To curate the schema, you can review the columns, add business names, descriptions, and add glossary terms at column level. 

**To update the schema of an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. Make sure you are on the **Inventory** tab, then choose the name of the asset that you want to publish. You are then brought to the asset details page.

1. Choose the **Schema** tab and then on the schema details page, choose the **View/Edit** link of the column that you'd like to curate.

   In the right-hand pane that opens, you can edit the details, ReadMe, glossary terms, and metadata forms of the column.

If automated metadata generation is enabled when the data source is created, the business names for assets and columns are available to review and accept or reject individually or all at once. 

You can also edit the subscription terms to specify if approval for the asset is required or not. 

Metadata forms in Amazon SageMaker Unified Studio enable you to extend a data asset's metadata model by adding custom-defined attributes (for example, sales region, sales year, and sales quarter). The metadata forms that are attached to an asset type are applied to all assets created from that asset type. You can also add additional metadata forms to individual assets as part of the data source run or after it's created. For creating new forms, see [Create a metadata form in Amazon SageMaker Unified Studio](create-metadata-form.md). 

To update the metadata of an asset, you must be the owner or the contributor of the project to which the asset belongs.

**To update the metadata of an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. Make sure you are on the **Inventory** tab, then choose the name of the asset that you want to publish. You are then brought to the asset details page.

1. On the asset details page, under **Metadata forms**, choose **Edit values** to edit the existing forms as needed, or choose **Add metadata form** and enter values for each of the metadata fields to attach additional metadata forms to the asset. 

1. When you're done making updates, choose **Save**.

   When you save the form, Amazon SageMaker Unified Studio generates a new inventory version of the asset. To publish the updated version to the catalog, choose **Re-publish asset**.

By default, metadata forms attached to a domain are attached to all assets published to that domain. Data publishers can associate additional metadata forms to individual assets in order to provide additional context.

When you are satisfied with the asset curation, the data owner can publish an asset version to the Amazon SageMaker Unified Studio catalog and thus make it discoverable by all domain users. The asset in the project shows the inventory version and the published version. In the discovery catalog, only the latest published version appears. If the metadata is updated after publishing, then a new inventory version will be available for publishing to the catalog. 

# Manually create an asset in Amazon SageMaker Unified Studio
Manually create an asset

In Amazon SageMaker Unified Studio, an asset is an entity that presents a single physical data object (for example, a table, a dashboard, a file) or virtual data object (for example, a view). For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). Publishing an asset manually is a one-time operation. You don't specify a run schedule for the asset, so it's not updated automatically if its source changes.

To manually create an asset through a project, you must be the owner or contributor of that project.

**To create an asset manually**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project that you want to create an asset in.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. On the **Inventory** tab, choose **Create**, then choose **Create asset**. choose **Create asset**.

1. For **Data asset details**, configure the following settings:
   + **Name** – The name of the asset.
   + **Description** – A description of the asset.

1. Choose **Next**.

1. For **Asset type details**, configure the following settings:
   + **Asset type** – The type of asset.
   + **Revision**.

1. If you are adding an **S3 object collection**, for **S3 location**, enter the Amazon Resource Name (ARN) of the source S3 bucket.

   Optionally, enter an S3 access point. For more information, see [Managing data access with Amazon S3 access points](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html).

1. Choose **Next**.

1. Review the selections, then choose **Create**. 

   After the asset is created, it will be stored in the inventory until you decide to publish it.

# Unpublish an asset from the Amazon SageMaker Catalog
Unpublish an asset from the catalog

When you unpublish an Amazon SageMaker Unified Studio asset from the catalog, it no longer appears in global search results. New users won't be able to find or subscribe to the asset listing in the catalog, but all existing subscriptions remain the same.

To unpublish an asset, you must be the owner or the contributor of the project to which the asset belongs.

**To unpublish an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. On the **Inventory** tab, choose the name of the asset that you want to unpublish. This opens the asset details page.

1. Expand the **Actions** menu, then choose **Unpublish**.

1. In the pop-up window, confirm the action by choosing **Unpublish**.

   The asset is then removed from the catalog. You can re-publish the asset at any time by choosing **Publish asset**.

# Delete an Amazon SageMaker Unified Studio asset
Delete an asset

When you no longer need an asset in Amazon SageMaker Unified Studio, you can permanently delete it. Deleting an asset is different than unpublishing an asset from the catalog. You can delete an asset and its related listing in the catalog so that it's not visible in any search results. To delete the asset listing, you must first revoke all of its subscriptions. 

To delete an asset, you must be the owner or the contributor of the project to which the asset belongs.

**Note**  
In order to delete an asset listing, you must first revoke all existing subscriptions to the asset, and the asset must be removed from all data products. You can't delete an asset listing that has existing subscribers or that is included in a current data product.

**To delete an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. On the **Inventory** tab, choose the name of the asset that you want to unpublish. This opens the asset details page.

1. Expand the **Actions** menu, then choose **Delete**.

1. In the pop-up window, type `delete` to confirm deletion, then choose **Delete**.

   When the asset is deleted, it's no longer available to view or subscribe to.

# Manually start a data source run in Amazon SageMaker Unified Studio
Manually start a data source run

When you run a data source, Amazon SageMaker Unified Studio pulls all any new or modified metadata from the source and updates the associated assets in the inventory. When you add a data source to Amazon SageMaker Unified Studio, you specify the source's run preference, which defines whether the source runs on a schedule or on demand. If your source runs on demand, you must initiate a data source run manually.

Even if your source runs on a schedule, you can still run it manually at any time. After adding business metadata to the assets, you can select assets and publish them to the Amazon SageMaker Catalog in order for these assets to be discoverable by all domain users. Only published assets are searchable by other domain users.

**To run a data source manually**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the data source belongs.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose the data source that you want to run. This opens the data source details page.

1. Choose **Run**.

   The data source status changes as Amazon SageMaker Unified Studio updates the asset metadata with the most recent data from the source. You can monitor the status of the run on the **Data source runs** tab. 

# Asset revisions in Amazon SageMaker Unified Studio
Asset versioning

Amazon SageMaker Unified Studio increments the revision of an asset when you edit its business or technical metadata. These edits include modifying the asset name, description, glossary terms, column names, metadata forms, and metadata form field values. These changes can result from manual edits, data source job runs, or API operations. Amazon SageMaker Unified Studio automatically generates a new asset revision any time you make an edit to the asset.

After you update an asset and a new revision is generated, you must publish the new revision to the catalog for it to be updated and available to subscribers. For more information, see [Publish assets to the Amazon SageMaker Unified Studio catalog from the project inventory](publishing-data-asset.md). You can only publish the most recent version of an asset to the catalog.

**To view past revisions of an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. On the **Inventory** tab, choose the name of the asset that you want to unpublish. This opens the asset details page.

1. Navigate to the **History** tab, which displays a list of past revisions of the asset.

# Data quality in Amazon SageMaker Unified Studio
Data quality

Data quality metrics in Amazon SageMaker Unified Studio help you understand the different quality metrics such as completeness, timeliness, and accuracy of your data sources. Amazon SageMaker Unified Studio integrates with AWS Glue Data Quality and offers APIs to integrate data quality metrics from third-party data quality solutions. Data users can see how data quality metrics change over time for their subscribed assets. To author and run the data quality rules, you can use your data quality tool of choice such as AWS Glue data quality. With data quality metrics in Amazon DataZone, data consumers can visualize the data quality scores for the assets and columns, helping build trust in the data they use for decisions. 

**Prerequisites and IAM role changes**

If you are using Amazon SageMaker Unified Studio's AWS managed policies, there are no additional configuration steps and these managed policies are automatically updated to support data quality. If you are using your own policies for the roles that grant Amazon SageMaker Unified Studio the required permissions to interoperate with supported services, you must update the policies attached to these roles to enable support for reading the AWS Glue data quality information.

## Enabling data quality for AWS Glue assets


Amazon SageMaker Unified Studio pulls the data quality metrics from AWS Glue in order to provide context during a point in time, for example, during a business data catalog search. Data users can see how data quality metrics change over time for their subscribed assets. Data producers can ingest AWS Glue data quality scores on a schedule. The Amazon SageMaker Unified Studio business data catalog can also display data quality metrics from third-party systems through data quality APIs. For more information, see [AWS Glue Data Quality](https://docs.aws.amazon.com/glue/latest/dg/glue-data-quality.html) and [Getting started with AWS Glue Data Quality for the Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/data-quality-getting-started.html).

You can enable data quality metrics for your Amazon SageMaker Unified Studio assets in the following ways:
+ Use Amazon SageMaker Unified Studio or the Amazon DataZone APIs to enable data quality for your AWS Glue data source via the Amazon SageMaker Unified Studio either while creating new or editing existing AWS Glue data source.
**Note**  
You can use Amazon SageMaker Unified Studio to enable data quality only for your AWS Glue inventory assets. In this release of Amazon SageMaker Unified Studio, enabling data quality for custom types assets in Amazon SageMaker Unified Studio must be done using APIs.
+ You can also use the APIs to enable data quality for your new or existing data sources. You can do this by invoking the [CreateDataSource](https://docs.aws.amazon.com/datazone/datazone/latest/APIReference/API_CreateDataSource.htmlAPI) or [UpdateDataSource](https://docs.aws.amazon.com/datazone/datazone/latest/APIReference/API_UpdateDataSource.htmlAPI) APIs and setting the `autoImportDataQualityResult` parameter to 'True'.

After data quality is enabled, you can run the data source on demand or on schedule. Each run can bring in up to 100 metrics per asset. There is no need to create forms or add metrics manually when using data source for data quality. When the asset is published, the updates that were made to the data quality form (up to 30 data points per rule of history) are reflected in the listing for the consumers. Subsequently, each new addition of metrics to the asset is automatically added to the listing. There is no need to republish the asset to make the latest scores available to consumers. 

## Enabling data quality for custom asset types


You can use the Amazon SageMaker Unified Studio APIs to enable data quality for any of your custom type assets. For more information, see the following:
+ [PostTimeSeriesDataPoints](https://docs.aws.amazon.com/datazone/latest/APIReference/API_PostTimeSeriesDataPoints.html)
+ [ListTimeSeriesDataPoints](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListTimeSeriesDataPoints.html)
+ [GetTimeSeriesDataPoint](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetTimeSeriesDataPoint.html)
+ [DeleteTimeSeriesDataPoints](https://docs.aws.amazon.com/datazone/latest/APIReference/API_DeleteTimeSeriesDataPoints.html)

The following steps provide an example of using APIs or CLI to import third-party metrics for your assets in Amazon SageMaker Unified Studio:

1. Invoke the `PostTimeSeriesDataPoints` API as follows:

   ```
   aws datazone post-time-series-data-points  \
   --cli-input-json file://createTimeSeriesPayload.json \
   ```

   with the following payload:

   ```
   "domainId": "dzd_5oo7xzoqltu8mf",
       "entityId": "4wyh64k2n8czaf",
       "entityType": "ASSET",
       "form": {
           "content": "{\n  \"evaluations\" : [ {\n    \"types\" : [ \"MaximumLength\" ],\n    \"description\" : \"ColumnLength \\\"ShippingCountry\\\" <= 6\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"ShippingCountry\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"MaximumLength\" ],\n    \"description\" : \"ColumnLength \\\"ShippingState\\\" <= 2\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"ShippingState\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"MaximumLength\" ],\n    \"description\" : \"ColumnLength \\\"ShippingCity\\\" <= 8\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"ShippingCity\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"Completeness\" ],\n    \"description\" : \"Completeness \\\"ShippingStreet\\\" >= 0.59\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"ShippingStreet\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"MaximumLength\" ],\n    \"description\" : \"ColumnLength \\\"ShippingStreet\\\" <= 101\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"ShippingStreet\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"MaximumLength\" ],\n    \"description\" : \"ColumnLength \\\"BillingCountry\\\" <= 6\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"BillingCountry\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"Completeness\" ],\n    \"description\" : \"Completeness \\\"biLlingcountry\\\" >= 0.5\",\n    \"details\" : {\n      \"EVALUATION_MESSAGE\" : \"Value: 0.26666666666666666 does not meet the constraint requirement!\"\n    },\n    \"applicableFields\" : [ \"biLlingcountry\" ],\n    \"status\" : \"FAIL\"\n  }, {\n    \"types\" : [ \"Completeness\" ],\n    \"description\" : \"Completeness \\\"Billingstreet\\\" >= 0.5\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"Billingstreet\" ],\n    \"status\" : \"PASS\"\n  } ],\n  \"passingPercentage\" : 88.0,\n  \"evaluationsCount\" : 8\n}",
           "formName": "shortschemaruleset",
           "id": "athp9dyw75gzhj",
           "timestamp": 1.71700477757E9,
           "typeIdentifier": "amazon.datazone.DataQualityResultFormType",
           "typeRevision": "8"
       },
       "formName": "shortschemaruleset"
   }
   ```

   You can obtain this payload by invoking the `GetFormType` action:

   ```
   aws datazone get-form-type --domain-identifier <your_domain_id> --form-type-identifier amazon.datazone.DataQualityResultFormType --region <domain_region> --output text --query 'model.smithy'
   ```

1. Invoke the `DeleteTimeSeriesDataPoints` API as follows:

   ```
   aws datazone delete-time-series-data-points\
   --domain-identifier dzd_bqqlk3nz21zp2f \
   --entity-identifier dzd_bqqlk3nz21zp2f \
   --entity-type ASSET \
   --form-name rulesET1 \
   ```

# Using machine learning and generative AI in Amazon SageMaker Unified Studio


**Note**  
Powered by Amazon Bedrock: AWS implements automated abuse detection. Because the AI recommendations for assets in Amazon SageMaker Unified Studio is built on Amazon Bedrock, users inherit the controls implemented in Amazon Bedrock to enforce safety, security, and the responsible use of AI.

In the current release of Amazon SageMaker Unified Studio, you can use the AI recommendations for names, descriptions, and glossary terms functionality to automate data discovery and cataloging. 

Powered by Amazon Bedrock's large language models, the AI recommendations for data asset names, descriptions, and glossary terms in Amazon SageMaker Unified Studio help you to ensure that your data is comprehensible and easily discoverable. The AI recommendations also suggest the most pertinent analytical applications for datasets. By reducing manual documentation tasks and advising on appropriate data usage, auto-generated names and descriptions can help you to enhance the trustworthiness of your data and minimize overlooking valuable data to accelerate informed decision making.

AI recommendations for glossary terms is a feature that automatically analyzes asset metadata and context to determine the most relevant business glossary terms for each asset and its columns. Instead of relying on manual tagging or static rules, it reasons about the data and performs iterative searches across what already exists in the customer’s environment to identify the best-fit glossary term concepts. Because the system suggests terms only from glossaries and definitions already present in the system, customers are encouraged to maintain high-quality, well-described glossary entries so the AI can return accurate and meaningful suggestions. This improves metadata quality, strengthens governance, accelerates data onboarding, and reduces manual stewardship effort at scale.

## Supported Regions for the AI recommendations for names and descriptions


In the current Amazon SageMaker Unified Studio release, the AI recommendations for names and descriptions feature is supported in the following regions:
+ US East (N. Virginia)
+ US West (Oregon)
+ Asia Pacific (Tokyo)
+ Europe (Frankfurt)
+ Asia Pacific (Sydney)
+ Canada (Central)
+ Europe (London)
+ South America (Sao Paulo)
+ Europe (Ireland)
+ Asia Pacific (Singapore)
+ US East (Ohio)
+ Asia Pacific (Seoul)

Amazon SageMaker Unified Studio supports Business Description Generation in the following regions.
+ Asia Pacific (Mumbai)
+ Europe (Paris)

Amazon SageMaker Unified Studio supports Business Name Generation in the following regions.
+ Europe (Stockholm)

**Bedrock Cross Region Inference**  
Amazon SageMaker Unified Studio leverages Amazon Bedrock's Cross Region inference endpoint to serve recommendations for the US East (Ohio) region. All other regions use in-region endpoint.

## Supported Regions for the AI recommendations for glossary terms


In the current Amazon SageMaker Unified Studio release, the AI recommendations for glossary terms feature is supported in the following regions:
+ US East (N. Virginia)
+ US West (Oregon)
+ Asia Pacific (Tokyo)
+ Europe (Frankfurt)
+ Asia Pacific (Sydney)
+ Europe (London)
+ Europe (Ireland)
+ Asia Pacific (Singapore)
+ US East (Ohio)
+ Asia Pacific (Seoul)
+ Asia Pacific (Mumbai)
+ Europe (Paris)
+ Europe (Stockholm)

**Bedrock Cross Region Inference**  
Amazon SageMaker Unified Studio leverages Amazon Bedrock's Cross Region inference endpoint to serve recommendations for all of the supported regions for AI recommendations for glossary terms. 

## Steps to use GenAI


The following procedure describes how to generate AI recommendations for names, descriptions, and glossary terms in Amazon SageMaker Unified Studio:
+ Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 
+ Choose the project that contains the asset for which you want to generate AI recommendations for descriptions.

### Generating Business Descriptions and Summaries

+ Navigate to the **Data** tab for the project.
+ From **Project catalog**, choose **Assets** and chose the asset for which you want to generate AI recommendations for descriptions.
+ On the asset's details page, in the **Business metadata** tab, choose **Generate descriptions**.

### Generating glossary terms

+ Navigate to the **Data** tab for the project.
+ From **Project catalog**, choose **Assets** and chose the asset for which you want to generate AI recommendations for glossary terms.
+ On the asset's details page, in the **Business metadata** tab, choose **Generate terms**.

### Generating Business Names

+ Navigate to the **Data** tab for the project.
+ In the left navigation pane, choose **Data sources**, and then choose datasource for which you want to enable business name generation.
+ Go to the **details** tab and enable the **AUTOMATED BUSINESS NAME GENERATION** configuration.
+ BusinessNames can also be generated programmatically when creating an asset by enabling the businessNameGeneration flag under predictionConfiguration in the [CreateAsset API](https://docs.aws.amazon.com/datazone/latest/APIReference/API_CreateAsset.html) payload.

### Accepting/Rejecting Predictions

+ Once the metadata (name, description or terms) suggestions, are generated, you can either edit, accept, or reject them.
+ Sparkle icons are displayed next to each automatically generated metadata (name, description or terms), for the data asset. In the **Business metadata** tab, you can choose the sparkle icon next to the automatically generated **Summary**, and then choose **Edit**, **Accept**, or **Reject** to address the generated description.
+ You can also choose **Accept all** or **Reject all** options that are displayed at the top of the page when the **Business metadata** tab is selected, and thus perform the selected action on all automatically generated metadata (name, description or terms).
+ Or you can choose the **Schema** tab, and then address automatically generated metadata (name, description or terms) individually by choosing the sparkle icon for one suggested metadata change at a time and then choosing **Accept** or **Reject**.
+ In the **Schema** tab, you can also choose **Accept all** or **Reject all** and thus perform the selected action on all automatically generated metadata.

To publish the asset to the catalog with the generated descriptions, choose **Publish asset**, and then confirm this action by choosing **Publish asset** again in the **Publish asset** pop up window.

**Note**  
If you don't accept or reject the generated metadata for an asset, and then you publish this asset, this unreviewed automatically generated metadata is not included in the published data asset.

## Support for custom relational asset types


Amazon SageMaker Unified Studio supports genAI capabilities for custom asset types. Previously this feature was only supported for the managed AWS Glue and Amazon Redshift asset types.

In order to enable this feature, create your own asset type definition and attach `RelationalTableFormType` as one of the forms. Amazon SageMaker Unified Studio automatically detects the presence of such forms and enables GenAI capabilities for these assets. The overall experience remains the same for generating business names (via predictionConfiguration in the CreateAsset API), business description (via Generate Description button click on the asset details page), and glossary terms.

For more information about creating custom asset types see [Create custom asset types in Amazon SageMaker Unified Studio](create-asset-types.md). 

## Quotas


Amazon SageMaker Unified Studio supports different quotas for business name generation and business description generation. You can reach out to the AWS Support team for an increase in these quotas.
+ BusinessDescriptionGeneration: 10K invocations/month
+ BusinessNameGeneration: 50K invocations/month
+ GlossaryTermGeneration - 10k invocations/month

# Data lineage in Amazon SageMaker Unified Studio
Data lineage

Data lineage in Amazon SageMaker Unified Studio is an OpenLineage-compatible feature that can help you to capture and visualize lineage events, from OpenLineage-enabled systems or through APIs, to trace data origins, track transformations, and view cross-organizational data consumption. It provides you with an overarching view into your data assets to see the origin of assets and their chain of connections. The lineage data includes information on the activities inside the Amazon SageMaker Catalog, including information about the catalogued assets, the subscribers of those assets, and the activities that happen outside the business data catalog captured programmatically using the APIs.

**Topics**
+ [

# What is OpenLineage?
](datazone-data-lineage-what-is-openlineage.md)
+ [

# Data lineage support
](datazone-data-lineage-support.md)
+ [

# Data lineage support matrix
](datazone-support-matrix.md)
+ [

# Visualizing data lineage
](datazone-visualizing-data-lineage.md)
+ [

# Test drive data lineage
](datazone-data-lineage-sample-experience.md)
+ [

# Data lineage authorization
](datazone-data-lineage-authorization.md)
+ [

# Automate lineage capture from data connections
](datazone-data-lineage-automate-capture-from-data-connections.md)
+ [

# Automate lineage capture from tools
](datazone-data-lineage-automate-capture-from-tools.md)
+ [

# Permissions required for data lineage
](datazone-data-lineage-permissions.md)
+ [

# Publishing data lineage programmatically
](datazone-data-lineage-apis.md)
+ [

# The importance of the sourceIdentifier attribute to lineage nodes
](datazone-data-lineage-sourceIdentifier-attribute.md)
+ [

# Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio
](datazone-data-lineage-linking-nodes.md)
+ [

# Troubleshooting data lineage
](datazone-lineage-troubleshooting.md)

# What is OpenLineage?


[OpenLineage](https://openlineage.io/) is an open platform for collection and analysis of data lineage. It tracks metadata about datasets, jobs, and runs, giving users the information required to identify the root cause of complex issues and understand the impact of changes. It is an Open Standard for lineage metadata collection designed to record metadata for a job in execution.

The standard defines a generic model of dataset, job, and run entities uniquely identified using consistent naming strategies. The dataset and job entities are identified by combination of 'namespace' and 'name' attributes whereas run is identified by runId. The entities can be enriched with user-defined metadata via facets (similar to metadata forms in Amazon SageMaker Unified Studio).

OpenLineage supports three types of events: RunEvent, DatasetEvent and JobEvent.
+ RunEvent: this event is generated as a result of job-run execution. It contains details of the run, the job it belongs to, input datasets that run consumes and output datasets the run produces. Reference for samples run events. Currently, Amazon SageMaker Unified Studio only supports RunEvents.
+ DatasetEvent: this event represents the changes in dataset (like any static updates on the dataset)
+ JobEvent: this event represents the changes in job configuration/details

In the current release of Amazon SageMaker Unified Studio, OpenLineage 1.22.0\$1 versions are supported.

# Data lineage support


In Amazon SageMaker Unified Studio, domain administrators or data users can configure lineage in projects while setting up connections for data lake and data warehouse sources to ensure the data source runs created from those resources are enabled for automatic lineage capture. Data lineage is automatically captured from data sources, such as AWS Glue and Amazon Redshift, as well as from tools, such as Visual ETL and notebooks, as executions create, update, or transform data. Additionally, Amazon SageMaker Unified Studio captures the movement of data within the catalog as producers bring their asset into inventory, and publish them, as well as when consumers subscribe and get access, to indicate who the subscribing projects are for a given asset. With this automation, different stages of an asset in the catalog are captured including when schema changes are detected.

Using Amazon SageMaker Unified Studio's OpenLineage-compatible APIs, domain administrators and data producers can capture and store lineage events beyond what is available in Amazon SageMaker Unified Studio, including transformations in Amazon S3, AWS Glue, and other services. This provides a comprehensive view for the data consumers and helps them gain confidence of the asset's origin, while data producers can assess the impact of changes to an asset by understanding its usage. Additionally, Amazon SageMaker Unified Studio versions lineage with each event, enabling users to visualize lineage at any point in time or compare transformations across an asset's or job's history. This historical lineage provides a deeper understanding of how data has evolved, essential for troubleshooting, auditing, and ensuring the integrity of data assets.

With data lineage, you can accomplish the following in Amazon SageMaker Unified Studio: 
+ Understand the provenance of data: knowing where the data originated fosters trust in data by providing you with a clear understanding of its origins, dependencies, and transformations. This transparency helps in making confident data-driven decisions.
+ Understand the impact of changes to data pipelines: when changes are made to data pipelines, lineage can be used to identify all of the downstream consumers that are to be affected. This helps to ensure that changes are made without disrupting critical data flows.
+ Identify the root cause of data quality issues: if a data quality issue is detected in a downstream report, lineage, especially column-level lineage, can be used to trace the data back (at a column level) to identify the issue back to its source. This can help data engineers to identify and fix the problem.
+ Improve data governance and compliance: column-level lineage can be used to demonstrate compliance with data governance and privacy regulations. For example, column-level lineage can be used to show where sensitive data (such as PII) is stored and how it is processed in downstream activities.

**OpenLineage custom transport to send lineage events to SageMaker**

OpenLineage events which contain metadata about data pipelines, jobs, and runs, are typically sent to backend for storage and analysis. The transport mechanism handles this transmission. As an extension of the OpenLineage project, a custom transport is available to send lineage events directly Amazon SageMaker Unified Studio's endpoint. The custom transport was merged into OpenLineage version 1.33.0 ([https://openlineage.io/docs/releases/1\$133\$10/](https://openlineage.io/docs/releases/1_33_0/)). This allows the use of OpenLineage plugins with the transport to send lineage events collected directly to Amazon SageMaker Unified Studio. 

# Data lineage support matrix


Lineage capture is automated from the following tools in Amazon SageMaker Unified Studio:


**Tools support matrix**  

| **Tool** | **Compute** | **AWS Service** | **Service deployment option** | **Support status** | **Notes** | 
| --- | --- | --- | --- | --- | --- | 
| Jupyterlab notebook | Spark | EMR | EMR Serverless | Automated | Spark DataFrames only; remote workflow execution | 
| Jupyterlab notebook | Spark | AWS Glue | N/A | Automated | Spark DataFrames only; remote workflow execution | 
| Visual ETL | Spark | AWS Glue | compatibility mode | Automated | Spark DataFrames only | 
| Visual ETL | Spark | AWS Glue | fineGrained mode | Not supported | Spark DataFrames only | 
| Query Editor |  | Amazon Redshift |  | Automated |  | 

Lineage is captured from the following services: 


**Services support matrix**  

| **Data source** | **Lineage Support status** | **Required Configuration** | **Notes** | 
| --- | --- | --- | --- | 
| AWS Glue Crawler | Automated by default in SageMaker Unified Studio | None | Supported for assets crawled via AWS Glue Crawler for the following data sources: Amazon S3, Amazon DynamoDB, Amazon S3 Open Table Formats including: Delta Lake, Iceberg tables, Hudi tables, JDBC, PostgreSql, DocumentDB, and MongoDB. | 
| Amazon Redshift | Automated by default in SageMaker Unified Studio | None | Redshift System tables will be used to retrieve user queries and lineage is generated by parsing those queries | 
| AWS Glue jobs in AWS Glue console  | Not automated by default | User can select "generate lineage events" and pass domainId  |  | 
| EMR | Not automated by default | User has to pass spark conf parameters to publish lineage events | Supported versions: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/datazone-support-matrix.html)  More details in [Capture lineage from EMR Spark executions](datazone-data-lineage-automate-capture-from-tools.md#datazone-data-lineage-automate-capture-from-tools-emrnotebook) | 

# Visualizing data lineage


In Amazon SageMaker Unified Studio, nodes in the lineage graph contain lineage information while edges represent upstream/downstream directions of data propagation. The lineage information is present in metadata forms attached to the lineage node. Amazon SageMaker Unified Studio defines three types of lineage nodes: 
+ Dataset node - this node includes data lineage information about a specific dataset.
  + Dataset refers to any object such as table, view, Amazon S3 file, Amazon S3 bucket, etc. It also refers to the assets in Amazon SageMaker Unified Studio's inventory and catalog, and subscribed tables/views.
  + Each version of the dataset node represents an event happening on the dataset at that timestamp. The history tab on the dataset node shows all dataset versions.
+ Job node - this node includes job details such as type of the job (query, etl etc), processing type (batch, streaming) etc job-type (query, etl etc), processing-type etc.
+ JobRun node - this node represents the job run details such as the job it belongs to, status, start/end timestamps etc. Amazon SageMaker Unified Studio's lineage graph shows a combined for job and job-run which shows job details and latest run details along with a history of previous job-runs.

Lineage graph can be visualized with base node as an asset. In the SageMaker Unified Studio, search for the assets, open any asset and you can see lineage on the Asset Details page. 

Here is the sample lineage graph for a user who is a data producer:

![\[Sample lineage graph for a user who is a data producer.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot4datalineage.png)


Here is the sample lineage graph for a user who is a data consumer:

![\[Sample lineage graph for a user who is a data consumer.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot5datalineage.png)


The asset details page provides the following capabilities to navigate the graph:
+ Column-level lineage: expand column-level lineage when available in dataset nodes. This automatically shows relationships with upstream or downstream dataset nodes if source column information is available.
+ Column search: the default display for number of columns is 10. If there are more than 10 columns, pagination is activated to navigate to the rest of the columns. To quickly view a particular column, you can search on the dataset node that list just the searched column.
+ View dataset nodes only: if you want to toggle to view only dataset lineage nodes and filter out the job nodes, you can choose the **Open view control** icon on the top left of the graph viewer and toggle the **Display dataset nodes only** option. This will remove all the job nodes from the graph and lets you navigate just the dataset nodes. Note that when the view only dataset nodes is turned on, the graph cannot be expanded upstream or downstream.
+ Details pane: Each lineage node has details captured and displayed when selected.
  + Dataset node has a detail pane to display all the details captured for that node for a given timestamp. Every dataset node has 3 tabs, namely: Lineage info, Schema, and History tab. The history tab lists the different versions of lineage event captured for that node. All details captured from API are displayed using metadata forms or a JSON viewer.
  + Job node has a detail pane to display job details with tabs, namely: Job info, and History. The details pane also captures query or expressions captured as part of the job run. The history tab lists the different job runs captured for that job. All details captured from API are displayed using metadata forms or a JSON viewer.
+ History tab: all lineage nodes in Amazon SageMaker Unified Studio's lineage have versioning. For every dataset node or job node, the versions are captured as history and that enables you to navigate between the different versions to identify what has changed over time. Each version opens a new tab in the lineage page to help compare or contrast.

# Aggregated lineage view


You can view an asset's lineage in two ways:
+ **Aggregated view** - Shows all jobs that are currently contributing to an asset's lineage, providing a complete picture of the data transformations and dependencies across multiple levels of the lineage graph. Use this view to understand the full scope of jobs impacting your datasets and to identify all upstream sources and downstream consumers.
+ **Timestamp view** - Shows the lineage graph as it existed at a specific point in time, displaying only the latest job run for each job at that timestamp. This view includes column-level lineage and is useful for troubleshooting and investigating specific data processing events.

The aggregated view is the default in most regions and shows the current state of your data lineage. In Opt-In Regions, only the timestamp view is available.

To switch between views, choose the **Open view control** icon in the top left of the lineage graph viewer and toggle the **Display in event timestamp order** option. When enabled, the timestamp view is displayed. When disabled, the aggregated view is displayed. This toggle is not available in Opt-In Regions.

Here is a sample aggregated view of a lineage graph:

![\[Sample aggregated view of a lineage graph showing all jobs currently contributing to the asset.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot6datalineage.png)


Here is a sample timestamp view of a lineage graph:

![\[Sample timestamp view of a lineage graph showing the latest job run at a specific point in time.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot7datalineage.png)


# Test drive data lineage


You can use the data lineage sample experience to browse and understand data lineage in Amazon SageMaker Unified Studio, including traversing upstream or downstream in your data lineage graph, exploring versions and column-level lineage.

Complete the following procedure to try the sample data lineage experience in Amazon SageMaker Unified Studio:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project you want to view lineage in.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. On the **Inventory** tab, choose the name of the asset that you want to view lineage for. This opens the asset details page.

1. On the asset details page, choose the **Lineage** tab.

1. In the data lineage window, choose the info icon that says **Try sample data lineage**. Then choose **Launch**. A new pop-up window appears.

1. Choose **Start guided data lineage tour**.

1. Select a guided tour option, and then choose **Start tour**.

   At this point, a tab that provides all the space of lineage information is displayed. The sample data lineage graph is initially displayed with a base node with 1-depth at either ends, upstream and downstream. You can expand the graph upstream or downstream. The columns information is also available for you to choose and see how lineage flows through the nodes. 

# Data lineage authorization


**Write permissions** - to publish lineage events into Amazon SageMaker Unified Studio, you must have an IAM role with a policy that includes an ALLOW action on the PostLineageEvent API. To publish lineage data into Amazon SageMaker Unified Studio, you must have an IAM role with a permissions policy that includes an ALLOW action on the PostLineageEvent API. This IAM authorization happens at API Gateway layer.

**Read permissions to view lineage** - GetLineageNode and ListLineageNodeHistory are included in the AmazonSageMakerDomainExecution managed policy and therefore every user in an Amazon SageMaker unified domain can invoke these to view the data lineage graph in Amazon SageMaker Unified Studio.

**Read permissions to get lineage events:** you must have an IAM role with a policy that includes ALLOW action on ListLineageEvents and GetLineageEvent APIs to view lineage events posted to Amazon SageMaker Unified Studio.

# Automate lineage capture from data connections


**Topics**
+ [

## Configure automated lineage capture for AWS Glue (Lakehouse) connections
](#datazone-data-lineage-automate-capture-from-data-connections-glue)
+ [

## Configure automated lineage capture for Amazon Redshift connections
](#datazone-data-lineage-automate-capture-from-data-connections-redshift)

## Configure automated lineage capture for AWS Glue (Lakehouse) connections


As databases and tables are added to the Amazon SageMaker Unified Studio’s catalog, the lineage extraction can be automated from source for those assets using data source runs in Create Connection workflow. For every connection created, lineage is not automatically enabled. 

**To enable lineage capture for an AWS Glue connection**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials.

1. Choose **Select project** from the top navigation pane and select the project to which you want to add the data source.

1. Choose **Data sources** from the left navigation pane under Project catalog.

1. Choose the data source that you want to modify.

1. Expand the **Actions** menu, then choose **Edit data source** or click on the Data Source run name to view the details and go to **Data Source Definition** tab and choose **Edit** in **Connection** details. 

1. Go to the connections and select **Import data lineage** checkbox to configure lineage capture from the source. 

1. Make other changes to the data source fields as desired, then choose **Save**.

   **Limitations**

   Lineage is captured only for crawlers which imported less than 250 tables in a crawler run.

**Note**  
When enabled, the lineage runs asynchronously to capture metadata from the source and generate lineage events to be stored in SageMaker Catalog to be visualized from a particular asset. The status of lineage runs for the data source can be viewed along with data source run details. 

## Configure automated lineage capture for Amazon Redshift connections


Capturing lineage from Amazon Redshift can be automated when the connection is added to an Amazon Redshift source in Amazon SageMaker Unified Studio’s Data explorer. Lineage capture can be automated for a connection at the data source configuration. For every connection created, lineage is not automatically enabled. 

**To enable lineage capture for an Amazon Redshift connection**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials.

1. Choose **Select project** from the top navigation pane and select the project to which you want to add the data source.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose the data source that you want to modify.

1. Expand the **Actions** menu, then choose **Edit data source** or click on the data source run name to view the details and go to Data Source Definition tab and select **Edit** in **Connection details**. 

1. Go to the connections and select **Import data lineage** checkbox to configure lineage capture from the source. 

1. Make other changes to the data source fields as desired, then choose **Save**.

**Note**  
When enabled, the lineage runs captures queries executed for a given database and generates lineage events to be stored in Amazon DataZone to be visualized from a particular asset. The lineage run for Amazon Redshift is set up for a daily run to pull from the Amazon Redshift system tables to derive lineage. For the first run, after enabling the feature, the first pull is scheduled for \$15 minutes after and set for a daily run. You can configure specific time programmatically. 

# Automate lineage capture from tools


**Topics**
+ [

## Capture lineage for Spark executions in Visual ETL
](#datazone-data-lineage-automate-capture-from-tools-vetl)
+ [

## Capture lineage for AWS Glue Spark executions in Notebooks
](#datazone-data-lineage-automate-capture-from-tools-gluenotebook)
+ [

## Capture lineage from EMR Spark executions
](#datazone-data-lineage-automate-capture-from-tools-emrnotebook)

## Capture lineage for Spark executions in Visual ETL


When a new job is created in Visual ETL in Amazon SageMaker Unified Studio, lineage is automatically enabled. When a Visual ETL flow is created, lineage capture for that ETL flow is automatically enabled when you hit **Save to Project**. For every flow to capture lineage automatically, select **Save to Project** and then select **Run**.

**Note:** if you see that lineage is not getting captured, select **Save** and then move back to the Visual ETL flows section and then reopen the Visual ETL flow.

The following Spark configuration parameters are automatically added to the job being executed. When invoking Visual ETL programmatically, use the below configuration.

```
{
    "--conf":"spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener 
    --conf spark.openlineage.transport.type=amazon_datazone_api 
    --conf spark.openlineage.transport.domainId={DOMAIN_ID} 
    --conf spark.glue.accountId={ACCOUNT_ID} 
    --conf spark.openlineage.facets.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;]
    --conf spark.openlineage.columnLineage.datasetLineageEnabled=True
    --conf spark.glue.JOB_NAME={JOB_NAME}"
}
```

The parameters are auto-configured and do not need any updates from the user. To understand the parameters in detail: 
+ `spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener` - OpenLineageSparkListener will be created and registered with Spark's listener bus
+ `spark.openlineage.transport.type=amazon_datazone_api` - This is an OpenLineage specification to tell the OpenLineage Plugin to use DataZone API Transport to emit lineage events to DataZone’s PostLineageEvent API. For more information, see [https://openlineage.io/docs/integrations/spark/configuration/spark\$1conf/](https://openlineage.io/docs/integrations/spark/configuration/spark_conf/)
+ `spark.openlineage.transport.domainId={DOMAIN_ID}` - This parameter establishes the domain to which the API transport will submit the lineage events to.
+ `spark.openlineage.facets.custom_environment_variables [AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;]` - The following environment variables (`AWS_DEFAULT_REGION`, `GLUE_VERSION`, `GLUE_COMMAND_CRITERIA`, and `GLUE_PYTHON_VERSION`), which AWS Glue interactive session populates, will be added to the LineageEvent
+ `spark.glue.accountId=<ACCOUNT_ID>` - Account Id of the Glue Data Catalog where the metadata resides. This account id is used to construct Glue ARN in lineage event.
+ `spark.glue.JOB_NAME` - Job name of the lineage event. In vETL flow, the job name is configured automatically to be spark.glue.JOB\$1NAME: \$1\$1projectId\$1.\$1\$1pathToNotebook\$1

**Spark compute limitations**
+ OpenLineage libraries for Spark are built into AWS Glue v5.0\$1 for Spark DataFrames only. Dynamic DataFrames are not supported.
+ LineageEvent has a size limit of 300KB.

## Capture lineage for AWS Glue Spark executions in Notebooks


Sessions in notebooks does not have a concept of a job. You can map the Spark executions to lineage events by generating a unique job name for the notebook. You can use the %%configure magic with the below parameters to enable lineage capture for Spark executions in the notebook. 

Note: for AWS Glue Spark executions in notebooks, lineage capture is automated when scheduled with workflow in shared environment using remote workflows.

```
%%configure --name {COMPUTE_NAME} -f
{
"--conf":"spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=amazon_datazone_api --conf spark.openlineage.transport.domainId={DOMAIN_ID} --conf spark.glue.accountId={ACCOUNT_ID} --conf spark.openlineage.columnLineage.datasetLineageEnabled=True --conf spark.openlineage.facets.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] --conf spark.glue.JOB_NAME={JOB_NAME}" 
}
```

Examples of \$1COMPUTE\$1NAME\$1: project.spark.compatibility or project.spark.fineGrained

Here are these parameters and what they configure, in detail:
+ spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
  + OpenLineageSparkListener will be created and registered with Spark's listener bus
+ spark.openlineage.transport.type=amazon\$1datazone\$1api
  + [https://openlineage.io/docs/integrations/spark/configuration/spark\$1conf](https://openlineage.io/docs/integrations/spark/configuration/spark_conf)
  + This is an OpenLineage specification to tell the OpenLineage Plugin to use DataZone API Transport to emit lineage events to DataZone's PostLineageEvent API to be captured in SageMaker.
+ spark.openlineage.transport.domainId=\$1DOMAIN\$1ID\$1
  + This parameter establishes the domain to which the API transport will submit the lineage events to.
+ spark.openlineage.facets.custom\$1environment\$1variables [AWS\$1DEFAULT\$1REGION;GLUE\$1VERSION;GLUE\$1COMMAND\$1CRITERIA;GLUE\$1PYTHON\$1VERSION;]
  + The following environment variables (AWS\$1DEFAULT\$1REGION, GLUE\$1VERSION, GLUE\$1COMMAND\$1CRITERIA, and GLUE\$1PYTHON\$1VERSION), which Glue interactive session populates, will be added to the LineageEvent
+ spark.glue.accountId=\$1ACCOUNT\$1ID\$1
  + Account Id of the Glue Data Catalog where the metadata resides. This account id is used to construct Glue ARN in lineage event.
+ [optional] spark.openlineage.transport.region=\$1DOMAIN\$1REGION\$1
  + If domain region is different from that of the job's execution region, pass this parameter with value as domain's region
+ spark.glue.JOB\$1NAME
  + Job name of the lineage event. For example, the job name can be set to be spark.glue.JOB\$1NAME: \$1\$1projectId\$1.\$1\$1pathToNotebook\$1.

## Capture lineage from EMR Spark executions


EMR with Spark engine has the necessary OpenLineage libraries built in. You need to pass the following spark parameters. Be sure to replace the \$1Domain ID\$1 with your specific Amazon DataZone or Amazon SageMaker Unified Studio domain and to replace the \$1Account ID\$1 with the account id where the EMR job is run.

```
%%configure --name emr-s.{EMR_SERVERLESS_COMPUTE_NAME}
{   
    "conf": {
         "spark.extraListeners": "io.openlineage.spark.agent.OpenLineageSparkListener",
         "spark.openlineage.columnLineage.datasetLineageEnabled":"True",
         "spark.openlineage.transport.type":"amazon_datazone_api",
         "spark.openlineage.transport.domainId":"{DOMAIN_ID}",
         "spark.openlineage.transport.region":"{DOMAIN_REGION}" // Only needed if the domain is in different region than that of the job         
         "spark.glue.accountId":"{ACCOUNT_ID}", // needed if AWS Glue is being used as the Hive metastore
         "spark.jars":"/usr/share/aws/datazone-openlineage-spark/lib/DataZoneOpenLineageSpark-1.0.jar" // Only needed incase of EMR-S
    }
}
```
+ Lineage is supported for the following EMR versions:
  + EMR-S: 7.5\$1
  + EMR on EC2: 7.11\$1
  + EMR on EKS: 7.12\$1
+ The JOB\$1NAME is the Spark application name that is automatically set
+ Replace \$1DOMAIN\$1ID\$1, \$1ACCOUNT\$1ID\$1, \$1DOMAIN\$1REGION\$1
+ Amazon SageMaker Unified Studio VPC endpoint is deployed to EMR VPC endpoint

# Permissions required for data lineage


## Read permissions to view lineage


Permissions on following actions are needed to view lineage graph:
+ `datazone:GetLineageNode`
+ `datazone:ListLineageNodeHistory`
+ `datazone:QueryGraph`

Above permissions are included in the `AmazonSageMakerDomainExecution` managed policy and therefore every user in an Amazon SageMaker Unified Studio domain can invoke these to view the data lineage graph in Amazon SageMaker Unified Studio.

Permissions on following actions are needed to view lineage events:
+ `datazone:ListLineageEvents`
+ `datazone:GetLineageEvent`

User must have an IAM role with a policy that includes "Allow" action on these APIs to view lineage events posted to Amazon SageMaker Unified Studio.

## Write permissions to publish lineage


### Lineage for AWS Glue crawler


The project user role is used to fetch required data from AWS Glue. The project user role should contain the following permissions on Glue operations:
+ `glue:listCrawls`
+ `glue:getConnection`

**Note**  
`SageMakerStudioProjectUserRolePolicy` already contains above permissions.

### Lineage for Amazon Redshift


The project user role is used to execute queries on the cluster/workgroup defined in the connection. The project user role should contain the following permissions:
+ `redshift-data:BatchExecuteStatement`
+ `redshift-data:ExecuteStatement`
+ `redshift-data:DescribeStatement`
+ `redshift-data:GetStatementResult`

**Note**  
`SageMakerStudioProjectUserRolePolicy` already contains above permissions.

In addition, the credentials provided for Amazon Redshift connection in Amazon SageMaker Unified Studio should contain following permissions:
+ `sys:operator` role to access the data from system tables for all user queries performed on the cluster/workgroup
+ Has "SELECT" grant on all the tables

### Lineage for AWS Glue, EMR jobs


The IAM role used to execute the job should contain following permissions to publish lineage events to Amazon SageMaker Unified Studio:
+ ALLOW action on `datazone:PostLineageEvent`
+ If your Amazon SageMaker Unified Studio domain is encrypted with KMS CMK (customer managed key), the job role should have permissions to encrypt and decrypt as well
+ If the spark job is in an account different from Amazon SageMaker Unified Studio domain account, associate the account with domain prior to running the job. Follow [https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html](https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html) to set up account association

### Publish Lineage using API


IAM role with a policy to allow `datazone:PostLineageEvent` action is needed to post lineage events programmatically

# Publishing data lineage programmatically


You can also publish data lineage programmatically using [PostLineageEvent](https://docs.aws.amazon.com/datazone/latest/APIReference/API_PostLineageEvent.html) API. The API takes in open lineage run event as the payload. Additionally, the following APIs support retrieving lineage events and traversing lineage graph: 
+ [ GetLineageEvent](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetLineageEvent.html)
+ [ ListLineageEvents](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html)
+ [QueryGraph](https://docs.aws.amazon.com/datazone/latest/APIReference/API_QueryGraph.html): paginated API to return the aggregate view of the lineage graph 
+ [GetLineageNode](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetLineageNode.html): Gets the lineage node along with its immediate neighbors
+ [ListLineageNodeHistory](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageNodeHistory.html): lists the lineage node versions with each version derived from a data/metadata change event

The following is a sample PostLineageEvent operation payload:

```
{
  "producer": "https://github.com/OpenLineage/OpenLineage",
  "schemaURL": "https://openlineage.io/spec/2-0-0/OpenLineage.json#/definitions/RunEvent",    
  "eventType": "COMPLETE",
  "eventTime": "2024-05-04T10:15:30Z",
  "run": {
    "runId": "d2e7c111-8f3c-4f5b-9ebd-cb1d7995082a"
  },
  "job": {
    "namespace": "xyz.analytics",
    "name": "transform_sales_data"
  },
  "inputs": [
    {
      "namespace": "xyz.analytics",
      "name": "raw_sales",
      "facets": {
        "schema": {
          "_producer": "https://github.com/OpenLineage/OpenLineage",
          "_schemaURL": "https://openlineage.io/spec/facets/schema_dataset.json",
          "fields": [
            { "name": "region", "type": "string" },
            { "name": "year", "type": "int" },
            { "name": "created_at", "type": "timestamp" }
          ]
        }
      }
    }
  ],
  "outputs": [
    {
      "namespace": "xyz.analytics",
      "name": "clean_sales",
      "facets": {
        "schema": {
          "_producer": "https://github.com/OpenLineage/OpenLineage",
          "_schemaURL": "https://openlineage.io/spec/facets/schema_dataset.json",
          "fields": [
            { "name": "region", "type": "string" },
            { "name": "year", "type": "int" },
            { "name": "created_at", "type": "timestamp" }
            
          ]
        },
        "columnLineage": {
          "_producer": "https://github.com/OpenLineage/OpenLineage",
          "_schemaURL": "https://openlineage.io/spec/facets/columnLineage" + "DatasetFacet.json",
          "fields": {
            "id": {
              "inputFields": [
                {
                  "namespace": "xyz.analytics",
                  "name": "raw_sales",
                  "field": "id"
                }
              ]
            },
            "year": {
              "inputFields": [
                {
                  "namespace": "xyz.analytics",
                  "name": "raw_sales",
                  "field": "year"
                }
              ]
            },
            "created_at": {
              "inputFields": [
                {
                  "namespace": "xyz.analytics",
                  "name": "raw_sales",
                  "field": "created_at"
                }
              ]
            }
          }
        }
      }
    }
  ]
}
```

# The importance of the sourceIdentifier attribute to lineage nodes


Every lineage node is uniquely identified by its sourceIdentifier (usually provided as part of open-lineage event) in addition to system generated nodeId. sourceIdentifier is generated using <namespace>, <name> of the node in lineage event.

The following are examples of sourceIdentifier values for different types of nodes:
+ **Job nodes**
  + SourceIdentifier of job nodes is populated from <namespace>.<name> on the job node in open-lineage run event
+ **Jobrun nodes**
  + SourceIdentifier of jobrun nodes is populated from <job's namespace>.<job's name>/<run\$1id>
+ **Dataset nodes**
  + Dataset nodes representing AWS resources: sourceIdentifier is in ARN format
    + AWS Glue table: arn:aws:glue:<region>:<account-id>:table/<database>/<table-name>
    + AWS Glue table with federated sources: arn:aws:glue:<region>:<account-id>:table/<catalog><database>/<table-name>
      + Example: catalog can be "s3tablescatalog"/"s3tablesBucket", "lakehouse\$1catalog" etc
    + Amazon Redshift table:
      + serverless: arn:aws:redshift-serverless:<region>:<account-id>:table/workgroupName/<database>/<schema>/<table-name>
      + provisioned: arn:aws:redshift:<region>:<account-id>:table/clusterIdentifier/<database>/<schema>/<table-name>
    + Amazon Redshift view:
      + serverless: arn:aws:redshift-serverless:<region>:<account-id>:view/workgroupName/<database>/<schema>/<view-name>
      + provisioned: arn:aws:redshift:<region>:<account-id>:view/clusterIdentifier/<database>/<schema>/<view-name>
  + Dataset nodes representing SageMaker catalog resources:
    + Asset: amazon.datazone.asset/<assetId>
    + Listing (published asset): amazon.datazone.listing/<listingId>
  + In all other cases, dataset nodes' sourceIdentifier is populated using <namespace>/<name> of the dataset nodes in open-lineage run event
    + https://openlineage.io/docs/spec/naming/ contains naming convention for various datastores.

The following table contains the examples of how sourceIdentifier is generated for datasets of different types.


****  

| Source for lineage event | Sample OpenLineage event data | Source ID computed by Amazon DataZone | 
| --- | --- | --- | 
|  AWS Glue ETL  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />           "environment-properties":{<br />                 ....<br />                "environment-properties":{<br />                     "GLUE_VERSION":"3.0",<br />                     "GLUE_COMMAND_CRITERIA":"glueetl",<br />                     "GLUE_PYTHON_VERSION":"3"<br />                }<br />           }<br />       } <br />    },<br />    .....<br />   "outputs":[<br />      {<br />         "namespace":"namespace.output",<br />         "name":"output_name",<br />         "facets":{<br />             "symlinks":{<br />                 .... <br />                 "identifiers":[<br />                    {<br />                       "namespace":"arn:aws:glue:us-west-2:123456789012",<br />                       "name":"table/testdb/testtb-1",<br />                       "type":"TABLE"<br />                    }<br />                 ]<br />             }<br />        }<br />     }<br />   ]<br />    <br />}<br />                               </pre>  | arn:aws:glue:us-west-2:123456789012:table/testdb/testtb-1 If environment-properties contains GLUE\$1VERSION, GLUE\$1PYTHON\$1VERSION, etc, Amazon DataZone uses namespace and name in symlink of the dataset (input or output) to construct AWS Glue table ARN for sourceIdentifier. | 
|  Amazon Redshift (Provisioned)  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />          .......<br />       } <br />    },<br />    .....<br />   "inputs":[<br />      {<br />         "namespace":"redshift://cluster-20240715.123456789012.us-east-1.redshift.amazonaws.com:5439",<br />         "name":"tpcds_data.public.dws_tpcds_7"<br />         "facets":{<br />             .....<br />        }<br />     }<br />   ]<br />    <br />}<br />                                </pre>  | arn:aws:redshift:us-east-1:123456789012:table/cluster-20240715/tpcds\$1data/public/dws\$1tpcds\$17  If the namespace prefix is `redshift`, Amazon DataZone uses that to construct the Amazon Redshift ARN using values of namespace and name attributes. | 
|  Amazon Redshift (serverless)  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />          .......<br />       } <br />    },<br />    .....<br />   "outputs":[<br />      {<br />         "namespace":"redshift://workgroup-20240715.123456789012.us-east-1.redshift-serverless.amazonaws.com:5439",<br />         "name":"tpcds_data.public.dws_tpcds_7"<br />         "facets":{<br />             .....<br />        }<br />     }<br />   ]<br />}<br />                                </pre>  | arn:aws:redshift-serverless:us-east-1:123456789012:table/workgroup-20240715/tpcds\$1data/public/dws\$1tpcds\$17  As per OpenLineage naming convention, namespace for Amazon Redshift dataset should be `provider://{cluster_identifier or workgroup}.{region_name}:{port}`. If the namespace contains `redshift-serverless`, Amazon DataZone uses that to construct Amazon Redshift ARN using values of namespace and name attributes. | 
|  Any other datastore  |  Recommendation is to populate namespace and name as per OpenLineage convention defined in [https://openlineage.io/docs/spec/naming/](https://openlineage.io/docs/spec/naming/).  |  Amazon DataZone populates sourceIdentifier as <namespace>/<name>.  | 

# Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio


**Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio**

Every lineage node is uniquely identified by its sourceIdentifier. Previous section talks about formats of sourceIdentifier. Amazon SageMaker Unified Studio automatically links the dataset nodes with assets in inventory based on the sourceIdentifier value. Hence, use the same sourceIdentifier value of dataset node when creating/updating the asset (via AssetCommonDetailsForm::sourceIdentifier attribute).

Following images show the sourceIdentifier on asset details page along with lineage graph highlighting the same sourceIdentifier of dataset node with its downstream asset’s sourceIdentifier.

Asset details page:

![\[Asset details page.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot1datalineage.png)


Asset’s SourceIdentifier in node details:

![\[Asset’s SourceIdentifier in node details.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot2datalineage.png)


Amazon Redshift dataset/table’s sourceIdentifier in node details:

![\[Amazon Redshift dataset/table’s sourceIdentifier in node details.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot3datalineage.png)


# Troubleshooting data lineage
Data lineage troubleshooting

This comprehensive troubleshooting guide helps you resolve common data lineage visibility issues in Amazon SageMaker Unified Studio. The guide covers programmatically published events, data source configurations, and tool-specific lineage capture problems.

**Topics**
+ [

## Not seeing lineage graph for events published programmatically
](#lineage-troubleshooting-programmatic-events)
+ [

## Not seeing lineage for assets even though importLineage is shown as true in AWS Glue datasource
](#lineage-troubleshooting-glue-datasource)
+ [

## Not seeing lineage for assets even though importLineage is shown as true in Amazon Redshift datasource
](#lineage-troubleshooting-redshift-datasource)
+ [

## Troubleshooting lineage for lineage events published from AWS Glue ETL jobs/vETL/Notebooks
](#lineage-troubleshooting-glue-etl-jobs)
+ [

## Troubleshooting lineage for lineage events published from EMR-S/EC2/EKS
](#lineage-troubleshooting-emr)

## Not seeing lineage graph for events published programmatically


**Primary requirement:** Lineage graphs are only visible in Amazon SageMaker Unified Studio if at least one node of the graph is an asset. You must create assets for any dataset nodes and properly link them using the sourceIdentifier attribute.

**Troubleshooting steps:**

1. Create assets for any of the dataset nodes involved in your lineage events. Refer to the following sections for proper linking:
   + [The importance of the sourceIdentifier attribute to lineage nodes](datazone-data-lineage-sourceIdentifier-attribute.md) 
   + [Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio](datazone-data-lineage-linking-nodes.md)

1. Once the asset is created, verify that you can see the lineage on the asset details page.

1. If you are still not seeing lineage, verify that the asset's sourceIdentifier (present in AssetCommonDetailsForm) has the same value as the sourceIdentifier of any input/output dataset node in the lineage event.

   Use the following command to get asset details:

   ```
   aws datazone get-asset --domain-identifier {DOMAIN_ID} --identifier {ASSET_ID}
   ```

   The response appears as follows:

   ```
   {
       .....
       "formsOutput": [
           ..... 
           {
               "content": "{\"sourceIdentifier\":\"arn:aws:glue:eu-west-1:123456789012:table/testlfdb/testlftb-1\"}",
               "formName": "AssetCommonDetailsForm",
               "typeName": "amazon.datazone.AssetCommonDetailsFormType",
               "typeRevision": "6"
           },
           .....
       ],
       "id": "{ASSET_ID}",
       ....
   }
   ```

1. If both sourceIdentifiers are matching but you still cannot see lineage, retrieve the eventId from the PostLineageEvent response or use ListLineageEvents to find the eventId, then invoke GetLineageEvent:

   ```
   aws datazone list-lineage-events --domain-identifier {DOMAIN_ID}
   // You can apply additional filters like timerange etc. 
   // Refer https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html
   
   aws datazone get-lineage-event --domain-identifier {DOMAIN_ID} --identifier {EVENT_ID} --output json event.json
   ```

   The response appears as follows and the open-lineage event is written to the event.json file:

   ```
   {
       "domainId": "{DOMAIN_ID}",
       "id": "{EVENT_ID}",
       "createdBy": ....,
       "processingStatus": "SUCCESS"/"FAILED etc",
       "eventTime": "2024-05-04T10:15:30+00:00",
       "createdAt": "2025-05-04T22:18:27+00:00"
   }
   ```

1. If the GetLineageEvent response's processingStatus is FAILED, contact AWS Support by providing the GetLineageEvent response for the appropriate event and the response from GetAsset.

1. If the GetLineageEvent response's processingStatus is SUCCESS, double-check that the sourceIdentifier of the dataset node from the lineage event matches the value in the GetAsset response above. The following steps help verify this.

1. Invoke GetLineageNode for the job run where the identifier is composed of namespace, name of job and run\$1id in the lineage event:

   ```
   aws datazone get-lineage-node --domain-identifier {DOMAIN_ID} --identifier <job's namespace>.<job's name>/<run_id>
   ```

   The response appears as follows:

   ```
   {
       .....
       "downstreamNodes": [
           {
               "eventTimestamp": "2024-07-24T18:08:55+08:00",
               "id": "afymge5k4v0euf"
           }
       ],
       "formsOutput": [
           <some forms corresponding to run and job>
       ],
       "id": "<system generated node-id for run>",
       "sourceIdentifier": "<job's namespace>.<job's name>/<run_id>",
       "typeName": "amazon.datazone.JobRunLineageNodeType",
       ....
       "upstreamNodes": [
           {
               "eventTimestamp": "2024-07-24T18:08:55+08:00",
               "id": "6wf2z27c8hghev"
           },
           {
               "eventTimestamp": "2024-07-24T18:08:55+08:00",
               "id": "4tjbcsnre6banb"
           }
       ]
   }
   ```

1. Invoke GetLineageNode again by passing in the downstream/upstream node identifier (which you think should be linked to the asset node):

   ```
   aws datazone get-lineage-node --domain-identifier {DOMAIN_ID} --identifier afymge5k4v0euf
   ```

   This returns the lineage node details corresponding to the dataset: `afymge5k4v0euf`. Verify if the sourceIdentifier matches that of the asset. If not matching, fix the namespace and name of the dataset in the lineage event and publish the lineage event again. You will see the lineage graph on the asset.

   ```
   {
       .....
       "downstreamNodes": [],
       "eventTimestamp": "2024-07-24T18:08:55+08:00",
       "formsOutput": [
           .....
       ],
       "id": "afymge5k4v0euf",
       "sourceIdentifier": "<sample_sourceIdentifier_value>",
       "typeName": "amazon.datazone.DatasetLineageNodeType",
       "typeRevision": "1",
       ....
       "upstreamNodes": [
           ...
       ]
   }
   ```

## Not seeing lineage for assets even though importLineage is shown as true in AWS Glue datasource


Open the datasource run(s) associated with the AWS Glue datasource and you can see the assets imported as part of the run and the lineage import status along with error message in case of failure.

**Limitations:**
+ Lineage for crawler run importing more than 250 tables isn't supported.

## Not seeing lineage for assets even though importLineage is shown as true in Amazon Redshift datasource


Lineage on Amazon Redshift tables is captured by retrieving user queries performed on the Amazon Redshift database, from the system tables.

**Lineage is not supported in the following cases:**
+ External Tables
+ Unload / Copy
+ Merge / Update
+ Queries that produce Lineage Events larger than 16MB
+ Datashares
+  ColumnLineage limitations:
  +  Column Lineage is not supported for queries not containing specific columns such as (select \$1 from tableA) 
  +  Column Lineage is not supported for queries involving temp tables 
+ Any limitations pertaining to [OpenLineageSqlParser](https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/README.md) results in failure to process some queries

**Troubleshooting steps:**

1. On the Amazon Redshift connection details, you will see the lineageJobId along with the job run schedule. Alternatively, you can fetch it using the [get-connection](https://docs.aws.amazon.com/cli/latest/reference/datazone/get-connection.html) API.

1. Invoke [list-job-runs](https://docs.aws.amazon.com/cli/latest/reference/datazone/list-job-runs.html) to get the runs corresponding to the job:

   ```
   aws datazone list-job-runs --domain-identifier {DOMAIN_ID} --job-identifier {JOB_ID}
   ```

   The response appears as follows:

   ```
   {
      "items": [ 
         { 
            .....
            "error": { 
               "message": "string"
            },
            "jobId": {JOB_ID},
            "jobType": LINEAGE,
            "runId": ...,
            "runMode": SCHEDULED,
            "status": SCHEDULED | IN_PROGRESS | SUCCESS | PARTIALLY_SUCCEEDED | FAILED | ABORTED | TIMED_OUT | CANCELED
            .....
         }
      ],
      "nextToken": ...
   }
   ```

1. If no job-runs are returned, check your job run schedule on the Amazon Redshift connection details. Reach out to AWS Support providing the lineageJobId, connectionId, projectId and domainId if job runs are not executed per given schedule.

1. If job-runs are returned, pick the relevant jobRunId and invoke GetJobRun to get job run details:

   ```
   aws datazone get-job-run --domain-identifier {DOMAIN_ID} --identifier {JOB_RUN_ID}
   ```

   The response appears as follows:

   ```
   {
     ....
     "details": {
       "lineageRunDetails": {
         "sqlQueryRunDetails": {
           "totalQueriesProcessed": ..,
           "numQueriesFailed": ...,
           "errorMessages":....,
           "queryEndTime": ...,
           "queryStartTime": ...
         }
       }
     },
     .....
   }
   ```

1. The job run fails if none of the queries are successfully processed; partially succeeds if some queries are successfully processed; succeeds if all queries are successfully processed and response also contains start and end times of processed queries.

## Troubleshooting lineage for lineage events published from AWS Glue ETL jobs/vETL/Notebooks


**Limitations:**
+ OpenLineage libraries for Spark are built into AWS Glue v5.0\$1 for Spark DataFrames only. Does not support Glue DynamicFrames.
+ Lineage capture for Spark jobs with fine-grained permission mode are not supported.
+ Lineage event has a size limit of 300KB.

**Common Issues:**
+ Verify necessary permissions are given to your job execution role as per [Permissions required for data lineage](datazone-data-lineage-permissions.md) 
+ Your spark job working with S3 files would produce lineage events with s3 datasets, even when they are catalog'ed in AWS Glue. To generate events including AWS Glue tables and build proper lineage graph with AWS Glue assets, your spark job should instead work with glue tables. 
+ If the AWS Glue ETL is in VPC, make sure the Amazon DataZone VPC endpoint is deployed in that VPC.
+ In case your domain is using a CMK, make sure that the AWS Glue execution role has the appropriate KMS permissions. CMK can be found via [https://docs.aws.amazon.com/datazone/latest/APIReference/API\$1GetDomain.html](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetDomain.html)
+ Failed to publish Lineage Event because payload is greater than 300 kb:
  + Add the following to Spark conf:

    ` "spark.openlineage.columnLineage.datasetLineageEnabled": "True" `
  + **Important Note:**
    + Column lineage typically constitutes a significant portion of the payload and enabling this will efficiently generate the column lineage info.
    + Disabling it can help reduce payload size and avoid validation exceptions.
+ Cross account lineage event submission:
  + Follow [https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html](https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html) to set up account association
  + Ensure that RAM policy is using the latest policy
+ If your Amazon SageMaker Unified Studio domain is in different region from that of job
  + Add this spark parameter: `"spark.openlineage.transport.region":"{region of your domain}"`
+ When the same DataFrame is written to multiple destinations or formats in sequence, Lineage SparkListener may only capture the lineage for the first write operation:
  + For optimization purposes, Spark's internals may reuse execution plans definition for consecutive write operations on the same DataFrame. This can lead to only capturing first lineage event.

**Troubleshooting steps:**

1. Lineage graph can only be visualized if at least one node of the graph is an asset. Therefore, create assets for any of the datasets (such as tables) involved in the job and then attempt to visualize lineage on the asset.

1. First, invoke [ListLineageEvents](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html) to see if the lineage events were submitted (refer to the linked doc to pass filters).

1. If no events are submitted, check AWS CloudWatch logs to see if any exceptions are thrown from the Amazon DataZone Lineage lib:
   + Log groups: /aws-glue/jobs/error, /aws-glue/sessions/error
   + Make sure logging is enabled: [https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html)
   + Following is the AWS CloudWatch log insights query to check for exceptions:

     ```
     fields @timestamp, @message
     | filter @message like /(?i)exception/ and like /(?i)datazone/
     | sort @timestamp desc
     ```
   + Following is the AWS CloudWatch log insights query to confirm events are submitted:

     ```
     fields @timestamp, @message
     | filter @message like /Successfully posted a LineageEvent:/
     | sort @timestamp desc
     ```

1. Fetch the lineage events generated from this job by executing the python script:[ retrieve\$1lineage\$1events.py](https://github.com/aws-samples/amazon-datazone-examples/tree/main/data_lineage) 

1. Check if the dataset on which you expected lineage is present in any of the events
   + You can ignore empty events without any input/output nodes
   + Check if your dataset node has glue arn prefix in the namespace of the node or in the "symlink" facet of the node. If you don't see any node with glue arn prefix, it means your script is not using glue tables directly and hence lineage is not linked to glue asset. One way to workaround this is to update the script to work with glue tables. 

1. If you are still unable to see lineage and it doesn't fall under the limitations category, reach out to AWS Support by providing:
   + Spark config parameters
   + Lineage events file from executing retrieve\$1lineage\$1events.py script
   + GetAsset response

## Troubleshooting lineage for lineage events published from EMR-S/EC2/EKS


**Notes:**
+ Lineage is supported from following versions of EMR:
  + EMR-S: 7.5\$1
  + EMR-EC2: 7.11\$1
  + EMR-EKS: 7.12\$1
+ Lineage capture for Spark jobs with fine-grained permission mode are not supported.
+ If you are trying EMR outside of Amazon SageMaker Unified Studio, Amazon DataZone VPC endpoint needs to be deployed for EMR VPC.
+ Lineage event has a size limit of 300KB.

**Common Issues:**
+ Verify necessary permissions are given to your job execution role as per [Permissions required for data lineage](datazone-data-lineage-permissions.md) 
+ In case your domain is using a CMK, make sure that the job's execution role has the appropriate KMS permissions. CMK can be found via [https://docs.aws.amazon.com/datazone/latest/APIReference/API\$1GetDomain.html](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetDomain.html)
+ Failed to publish Lineage Event because payload is greater than 300 kb:
  + Add the following to Spark conf which efficiently generates the event payload:

    ` "spark.openlineage.columnLineage.datasetLineageEnabled": "true" `
+ Cross account lineage event submission:
  + Follow [https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html](https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html) to set up account association
  + Ensure that RAM policy is using the latest policy
+ If your Amazon SageMaker Unified Studio domain is in different region from that of job
  + Add this spark parameter: `"spark.openlineage.transport.region":"{region of your domain}"`
+ When the same DataFrame is written to multiple destinations or formats in sequence, Lineage SparkListener may only capture the lineage for the first write operation:
  + For optimization purposes, Spark's internals may reuse execution plans definition for consecutive write operations on the same DataFrame. This can lead to only capturing first lineage event.

**Troubleshooting steps:**

1. Lineage graph can only be visualized if at least one node of the graph is an asset. Therefore, create assets for any of the datasets (such as tables) involved in the job and then attempt to visualize lineage on the asset.

1. First, invoke [ListLineageEvents](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html) to see if the lineage events were submitted (refer to the linked doc to pass filters).

1. If no events are submitted, check AWS CloudWatch logs to see if any exceptions are thrown from the Amazon DataZone Lineage lib:
   + **EC2:**
     + You can provide the CloudWatch log group or log destination for S3 path at the time of creating EC2 cluster. Refer [https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html)
     + You will see logs in stderr file within cluster-id/containers/application\$1/ folder.
   + **EKS:**
     + You need to provide the CloudWatch log group or log destination for S3 path while submitting spark job. Refer [https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/getting-started.html](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/getting-started.html)
     + You will see logs in stderr file of spark\$1driver within virtual-cluster-id/jobs/job-id/containers/\$1 folder.
   + **EMR-S:**
     + You can enable logs by following [https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/logging.html](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/logging.html)
     + You will see logs in stderr files of spark\$1driver
   + Following is the AWS CloudWatch log insights query to check for exceptions:

     ```
     fields @timestamp, @message
     | filter @message like /(?i)exception/ and like /(?i)datazone/
     | sort @timestamp desc
     ```
   + Following is the AWS CloudWatch log insights query to inspect generated events:

     ```
     fields @timestamp, @message
     | filter @message like /Successfully posted a LineageEvent:/
     | sort @timestamp desc
     ```

1. Fetch the lineage events generated from this job by executing the python script:[ retrieve\$1lineage\$1events.py](https://github.com/aws-samples/amazon-datazone-examples/tree/main/data_lineage) 

1. Check if dataset on which you expected lineage is present in any of the events
   + You can ignore empty events without any input/output nodes
   + Check if your dataset node has namespace/name matching with the sourceIdentifier of the asset. If you don't see any node with asset's sourceIdentifier, refer to following docs on how to fix it:
     + [The importance of the sourceIdentifier attribute to lineage nodes](datazone-data-lineage-sourceIdentifier-attribute.md) 
     + [Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio](datazone-data-lineage-linking-nodes.md)

1. If you are still unable to see lineage and it doesn't fall under the limitations category, reach out to AWS Support team by providing:
   + Spark config parameters
   + Lineage events file from executing retrieve\$1lineage\$1events.py script
   + GetAsset response

# Analyze Amazon SageMaker Unified Studio data with external analytics applications via JDBC connection
Analyze your subscribed data with external analytics applications via JDBC connection

Amazon SageMaker Unified Studio enables data consumers to easily locate and subscribe to data from multiple sources within a single project and analyze this data using Amazon Athena, Amazon Redshift Query Editor, and Amazon SageMaker.

Amazon SageMaker Unified Studio also supports authentication via the Athena JDBC driver that enables users to query their subscribed Amazon SageMaker Unified Studio data using popular external SQL and analytics tools, such as SQL Workbench, DBeaver, Tableau, Domino, Power BI and many others. Users can authenticate using their corporate credentials through SSO or IAM and begin analyzing their subscribed data within their Amazon SageMaker Unified Studio projects.

Amazon SageMaker Unified Studio's support of the Athena JDBC driver provides the following benefits:
+ Greater tool choice for querying and visualization - data consumers can connect to Amazon SageMaker Unified Studio using their preferred tools from a wide range of analytics tools that support a JDBC connection. This enables them to continue using the software they are familiar with without the need to learn new tools for data consumption. 
+ Programmatic access - a JDBC connection to access-governed data via servers or custom applications enables data consumers to perform automated and more complex data operations.

You can use your JDBC URL to connect your external analytics tools to your Amazon SageMaker Unified Studio subscribed data. To obtain your JDBC URL, perform the following procedure:

**Important**  
In the current release, Amazon SageMaker Unified Studio supports authentication using the Amazon Athena JDBC Driver. To complete this procedure, make sure that you have downloaded and installed the latest [Athena JDBC driver](https://docs.aws.amazon.com/athena/latest/ug/jdbc-v3-driver.html) for your analytics application of choice. 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project where you have the data that you want to analyze.

1. In the **Project overview**, choose the **JDBC connection details** tab.

1. In **JDBC connection details** choose your authentication method (**Using IDC auth** or **Using IAM auth**) and then choose the icon next to **JDBC connection URL** to copy the string or the individual parameters of the JDBC URL. You can then use it to connect to your external analytics application. 

When you connect your external analytics application to Amazon DataZone using your JBDC query or parameters, you invoke the `RedeemAccessToken` API. The `RedeemAccessToken` API exchanges an Identity Center access token for the `AmazonDataZoneDomainExecutionRole` credentials, which are used to call the `GetEnvironmentCredentials` API.

For more information about the authentication mechanism that uses IAM credentials to connect to Amazon DataZone-governed data in Athena, see [DataZone IAM Credentials Provider](https://docs.aws.amazon.com/athena/latest/ug/jdbc-v3-driver-datazone-iamcp.html). For more information about the authentication mechanism that enables connecting to Amazon DataZone-governed data in Athena using IAM Identity Center, see [DataZone Idc Credentials Provider](https://docs.aws.amazon.com/athena/latest/ug/jdbc-v3-driver-datazone-idc.html).

## RedeemAccessToken API Reference


**Request syntax**

```
POST /sso/redeem-token HTTP/1.1
Content-type: application/json

{
   "domainId": "string",
   "accessToken": "string"
}
```

**Request parameters**

The request uses the following parameters.

**DomainId**  
The ID of the Amazon DataZone domain.  
Pattern: ^dzd[-\$1][a-zA-Z0-9\$1-]\$11,36\$1\$1   
Required: yes

**accessToken**  
The Identity Center access token.  
Type: string  
Required: yes

**Response syntax**

```
HTTP/1.1 200
Content-type: application/json

{
   "credentials": AwsCredentials
}
```

**Response elements**

**credentials**  
The `AmazonDataZoneDomainExecutionRole` credentials that are used to call the `GetEnvironmentCredentials` API.  
Type: Array of `AwsCredentials` objects. This data type includes the following properties:  
+ accessKeyId: AccessKeyId
+ secretAccessKey: SecretAccessKey
+ sessionToken: SessionToken
+ expiration: Timestamp

**accessToken**  
The Identity Center access token.  
Type: string  
Required: yes

**Errors**

**AccessDeniedException**  
You do not have sufficient access to perform this action.  
HTTP Status Code: 403

**ResourceNotFoundException**  
The specified resource cannot be found.  
HTTP Status Code: 404

**ValidationException**  
The input fails to satisfy the constraints specified by the AWS service.  
HTTP Status Code: 400

**InternalServerException**  
The request has failed because of an unknown error, exception or failure.  
HTTP Status Code: 500

# Metadata enforcement rules for publishing


The metadata enforcement rules for publishing in Amazon SageMaker Unified Studio strengthen data governance by enabling domain unit owners to establish clear metadata requirements for data producers, streamlining access requests and enhancing data governance.

The feature is supported in all the AWS commercial Regions where Amazon SageMaker Unified Studio is currently available.

Domain unit owners can can complete the following procedure to configure metadata enforcement in Amazon SageMaker Unified Studio:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Govern** -> **Domain units** from the top navigation pane and then choose the domain or the domain unit that you want to work with.

1. Choose the **Rules** tab and then choose **Add**.

1. On the **Rule configuration** page, do the following and then choose **Add rule**:
   + Specify a name for your rule.
   + Under **Action**, choose **Data asset and product publishing** or **Subscription request**.
   + If you chose **Subscription request**, then under Required metadata forms, choose **Add metadata form**, choose a metadata form within the domain / domain unit that you want to add to this rule, and then choose **Add**. You can add up to 5 metadata forms per rule.
   + If you chose **Data asset and product publishing**, then under **Rule requirements**, choose either **Metadata forms** or **Glossary association**. If you chose **Metadata forms**, then under Required metadata forms, choose **Add metadata form**, choose a metadata form within the domain / domain unit that you want to add to this rule, and then choose **Add**. You can add up to 5 metadata forms per rule. If you chose **Glossary association**, then choose **Add terms** and add your glossary terms to your rule. You can add up to 5 glossary terms per rule. 
   + Under **Scope**, specify with which data entities you want to associate these forms. You can choose data products and/or data assets.
   + Under **Data asset types**, specify whether the rule applies across all asset types or limit it to selected asset types. 
   + Under **Projects**, specify whether the required forms will be associated with data products and/or assets published by all projects or only selected projects in this domain unit. Also, check **Cascade rule to child domain units** if you want child domain units to inherit this requirement. 