

# Data


Data in Amazon SageMaker Unified Studio includes data in projects of which you are a member and data that you can discover and subscribe to from other projects.

The **Data** page in Amazon SageMaker Unified Studio displays a data browser in which you can explore datasets, files, and artifacts that you connect to your project. Projects configured with certain profiles contain an lakehouse architecture for accessing data within your project, as well as a default Amazon Redshift connection and an Amazon S3 bucket. You can add data to the project on the **Data** page by uploading data from your local desktop or by gaining access to existing data sources and then adding a connection to them in your Amazon SageMaker Unified Studio project. For more information about lakehouse architecture, see [What is the lakehouse architecture of Amazon SageMaker?](https://docs.aws.amazon.com/sagemaker-lakehouse-architecture/latest/userguide/what-is-smlh.html).

You can also connect to AWS Glue and Amazon Redshift data sources from within your project catalog. The project catalog contains your data as data products and assets with metadata. When you want to share your data with other projects in the domain, publish the data from your project catalog into the Amazon SageMaker Catalog. If you want to create more detailed access control for your data before allowing other users to subscribe to it, you can configure fine-grained access control. For more information, see [Data inventory and publishing](data-publishing.md) and [Fine-grained access control to data](fine-grained-access-control.md).

The Amazon SageMaker Catalog contains business glossaries and metadata forms. If you have been granted access through the authorization policies, you can create business glossaries and metadata forms. For more information, see [Domain units and authorization policies in Amazon SageMaker Unified Studio](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/domain-units.html) and [Amazon SageMaker Catalog](working-with-business-catalog.md).

You can use the Amazon SageMaker Catalog to discover and subscribe to assets and data products. For more information, see [Data discovery, subscription, and consumption](discover-data.md).

**Topics**
+ [

# Data in Identity Center-based domains
](data-identity-center-based-domains.md)
+ [

# Working with Catalog in IAM-based domains
](data-iam-based-domains.md)
+ [

# Data and catalog connections in IAM-based domains
](data-connections-iam-based-domains.md)
+ [

# Third-party business data catalog integrations
](third-party-catalog-integrations.md)

# Data in Identity Center-based domains


**Topics**
+ [

# Amazon S3 data in Amazon SageMaker Unified Studio
](data-s3.md)
+ [

# The lakehouse architecture of Amazon SageMaker
](lakehouse.md)
+ [

# Amazon SageMaker Catalog
](working-with-business-catalog.md)

# Amazon S3 data in Amazon SageMaker Unified Studio
S3 data

You can bring in Amazon S3 data to your project and access it on the **Data** page of your project in Amazon SageMaker Unified Studio.

To add S3 tables to your lakehouse in Amazon SageMaker Unified Studio, see [Amazon S3 tables integration](https://docs.aws.amazon.com/sagemaker-lakehouse-architecture/latest/userguide/lakehouse-s3-tables-integration.html).

To add S3 data as assets in your Amazon SageMaker Unified Studio project catalog, see [Adding Amazon S3 data](adding-existing-s3-data.md). In Amazon SageMaker Unified Studio, assets represent specific types of data resources such as database tables, dashboards, S3 buckets or prefixes, or machine learning models. 

For S3 data in projects, SageMaker Catalog supports the creation of an asset type of **S3 Object Collection** for an Amazon S3 bucket or S3 prefix in the project. The S3 Object Collection asset type can be curated with business context metadata by adding business names, descriptions, README, glossary terms, and metadata forms, including mandatory metadata forms. Assets in Amazon SageMaker Unified Studio are versioned as changes are made in metadata.

# Adding Amazon S3 data


To bring in Amazon S3 data to your project, you must first gain access to the data and then add the data to your project. You can gain access to the data by using the project role or an access role.

**Note**  
 If you are using a bucket in a different account than the account that contains the project tooling environment, you must use an access role to gain access to the data.

## Prerequisite option 1 (recommended): Gain access using an access role


Work with your admin to complete the following steps:

1. Retrieve the project role ARN and the project ID and send them to your admin.

   1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

   1. Navigate to the project that you want to add Amazon S3 data to. You can do this by choosing **Browse all projects** from the center menu, and then selecting the name of the project.

   1. On the **Project overview** page, copy the project role ARN and the project ID.

1. The admin then must go to the Amazon S3 console and add a CORS policy to the bucket that you want to access in your project.

   1. Navigate to the Amazon S3 console.

   1. Navigate to the bucket you want to grant access to.

   1. On the **Permissions** tab, under **Cross-origin resource sharing (CORS)**, choose **Edit**.

   1. Enter in the new CORS policy, then choose **Save changes**.

      ```
      [
          {
              "AllowedHeaders": [
                  "*"
              ],
              "AllowedMethods": [
                  "PUT",
                  "GET",
                  "POST",
                  "DELETE",
                  "HEAD"
              ],
              "AllowedOrigins": [
                  "domainUrl" // example: https://dzd_abcdefg1234567.sagemaker.us-east-1.on.aws
              ],
              "ExposeHeaders": [
                  "x-amz-version-id"
              ]
          }
      ]
      ```

   1. Choose the name of an object to view its details. On the **Properties** tab, note the resource name ARN and the S3 URI. You will need to use these later.

1. The admin then must go to the IAM console and create an access role.

   1. Navigate to the IAM console.

   1. On the **Roles** page, choose **Create role**.

   1. Under **Trusted entity type**, choose **Custom trust policy**.

   1. Edit the policy to include the project ID, the project ARN, and the AWS account ID to grant Amazon S3 access permissions.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Effect": "Allow",
                  "Principal": {
                      "Service": "access-grants.s3.amazonaws.com"
                  },
                  "Action": [
                      "sts:AssumeRole",
                      "sts:SetSourceIdentity"
                  ],
                  "Condition": {
                      "StringEquals": {
                      "aws:SourceAccount": "111122223333"
                      }
                  }
              },
              {
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": "project-role-arn"
                  },
                  "Action": "sts:AssumeRole",
                  "Condition": {
                      "StringEquals": {
                          "sts:ExternalId": "project-id"
                      }
                  }
              },
              {
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": "project-role-arn"
                  },
                  "Action": [
                      "sts:SetSourceIdentity"
                  ],
                  "Condition": {
                      "StringLike": {
                          "sts:SourceIdentity": "${aws:PrincipalTag/datazone:userId}"
                      }
                  }
              },
              {
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": "project-role-arn"
                  },
                  "Action": "sts:TagSession",
                  "Condition": {
                      "StringEquals": {
                          "aws:RequestTag/AmazonDataZoneProject": "project-id",
                          "aws:RequestTag/AmazonDataZoneDomain": "domain-id"
                      }
                  }
              }
          ]
      }
      ```

------

   1. Choose **Next** twice.

   1. Enter a name for the role, then choose **Create role**.

   1. Select the access role from the list on the **Roles** page.

   1. On the **Permissions** tab of the role, choose **Add permissions**, then **Create inline policy**.

   1. Use the JSON editor to create a policy that grants Amazon S3 access permissions.
**Note**  
Amazon SageMaker Unified Studio grants access to subscribed assets using S3 Access Grants. To enable granting access to data using S3 Access Grants, an S3 Access Grants instance is required. Amazon SageMaker Unified Studio will use an instance if one is already available or will create one. S3 Access Grants needs one instance per AWS Region in a single AWS account. For more information, see [Working with S3 Access Grants instances](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants-instance.html)

   1. Choose **Next**.

   1. Enter a name for the policy, then choose **Create policy**.

   1. Optional: if you want to support cross-account data sharing for S3, add the following to your policy:

      ```
      {
          "Sid": "CrossAccountS3AGResourceSharingPermissions",
          "Effect": "Allow",
          "Action": [
              "ram:CreateResourceShare"
          ],
          "Resource": "*",
          "Condition": {
              "StringEqualsIfExists": {
                  "ram:RequestedResourceType": [
                      "s3:AccessGrants"
                  ]
              },
              "StringEquals": {
                  "aws:ResourceAccount": "${aws:PrincipalAccount}"
              }
          }
      },
      {
          "Sid": "CrossAccountS3AGResourceSharingPolicyPermissions",
          "Effect": "Allow",
          "Action": [
              "s3:PutAccessGrantsInstanceResourcePolicy"
          ],
          "Resource": "arn:aws:s3:*:*:access-grants/default",
          "Condition": {
              "StringEquals": {
                  "aws:ResourceAccount": "${aws:PrincipalAccount}"
              }
          }
      }
      ```

   1. Choose **Next**.

   1. Enter a name for the policy, then choose **Create policy**.

   1. Optional: If the bucket is in a different account than the the access role, ensure cross-account bucket permissions are set by adding a bucket policy that grants cross-account permissions to the access role. For example:

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "S3AdditionalBucketPermissions",
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": "access-role-arn"
                  },
                  "Action": [
                      "s3:ListBucket",
                      "s3:GetBucketLocation"
                  ],
                  "Resource": [
                      "arn:aws:s3:::bucketName"
                  ]
              },
              {
                  "Sid": "S3AdditionalObjectPermissions",
                  "Effect": "Allow",
                  "Principal": {
                      "AWS": "access-role-arn"
                  },
                  "Action": [
                      "s3:GetObject*",
                      "s3:PutObject"
                  ],
                  "Resource": [
                      "arn:aws:s3:::bucketName/key/*"
                  ]
              }
          ]
      }
      ```

------

   1. Choose **Update policy**.

## Prerequisite option 2: Gain access using the project role


Work with your admin to complete the following steps:

1. Retrieve the project role ARN and send it to your admin.

   1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

   1. Navigate to the project that you want to add Amazon S3 data to. You can do this by choosing **Browse all projects** from the center menu, and then selecting the name of the project.

   1. On the **Project overview** page, copy the project role ARN.

1. The admin then must go to the Amazon S3 console and add a CORS policy to the bucket that you want to access in your project.

   1. Navigate to the Amazon S3 console.

   1. Navigate to the bucket you want to grant access to.

   1. On the **Permissions** tab, under **Cross-origin resource sharing (CORS)**, choose **Edit**.

   1. Enter in the new CORS policy, then choose **Save changes**.

      ```
      [
          {
              "AllowedHeaders": [
                  "*"
              ],
              "AllowedMethods": [
                  "PUT",
                  "GET",
                  "POST",
                  "DELETE",
                  "HEAD"
              ],
              "AllowedOrigins": [
                  "domainUrl" // example: https://dzd_abcdefg1234567.sagemaker.us-east-1.on.aws
              ],
              "ExposeHeaders": [
                  "x-amz-version-id"
              ]
          }
      ]
      ```

   1. Choose the name of an object to view its details. On the **Properties** tab, note the resource name ARN and the S3 URI. You will need to use these later.

1. The admin then must go to the IAM console and update the project role.

   1. Navigate to the IAM console.

   1. On the **Roles** page, search for the project role using the last string in the project role ARN, for example: `datazone_usr_role_1a2b3c45de6789_abcd1efghij2kl`.

   1. Select the project role to navigate to the project role details.

   1. Under the **Permissions** tab, choose **Add permissions**, then choose **Create inline policy**.

   1. Use the JSON editor to create a policy so that the project has access to an Amazon S3 location, using the Amazon S3 resource ARN that you noted in step 2.

------
#### [ JSON ]

****  

      ```
      {
          "Version":"2012-10-17",		 	 	 
          "Statement": [
              {
                  "Sid": "S3AdditionalBucketPermissions",
                  "Effect": "Allow",
                  "Action": [
                      "s3:ListBucket",
                      "s3:GetBucketLocation"
                  ],
                  "Resource": [
                      "arn:aws:s3:::bucketName"
                  ]
              },
              {
                  "Sid": "S3AdditionalObjectPermissions",
                  "Effect": "Allow",
                  "Action": [
                      "s3:GetObject*",
                      "s3:PutObject"
                  ],
                  "Resource": [
                      "arn:aws:s3:::bucketName/key/*"
                  ]
              }
          ]
      }
      ```

------

   1. Choose **Next**

   1. Enter a name for the policy, then choose **Create policy**.

1. Under the **Permissions** tab, choose **Add permissions**, then choose **Create inline policy**.

1. Use the JSON editor to create a policy so that the project has access to an Amazon S3 location, using the Amazon S3 resource ARN that you noted previously.

   ```
   {
             "Sid": "S3AGLocationManagement",
             "Effect": "Allow",
             "Action": [
               "s3:CreateAccessGrantsLocation",
               "s3:DeleteAccessGrantsLocation",
               "s3:GetAccessGrantsLocation"
             ],
             "Resource": [
               "arn:aws:s3:*:*:access-grants/default/*"
             ],
             "Condition": {
               "StringEquals": {
                 "s3:accessGrantsLocationScope": "s3://bucket/folder/"
               }
             }
           },
           {
             "Sid": "S3AGPermissionManagement",
             "Effect": "Allow",
             "Action": [
               "s3:CreateAccessGrant",
               "s3:DeleteAccessGrant"
             ],
             "Resource": [
               "arn:aws:s3:*:*:access-grants/default/location/*",
               "arn:aws:s3:*:*:access-grants/default/grant/*"
             ],
             "Condition": {
               "StringLike": {
                 "s3:accessGrantScope": "s3://bucket/folder/*"
               }
             }
           }
   ```
**Note**  
Amazon SageMaker Unified Studio grants access to subscribed assets using S3 Access Grants. To enable granting access to data using S3 Access Grants, an S3 Access Grants instance is required. Amazon SageMaker Unified Studio will use an instance if one is already available or will create one. S3 Access Grants needs one instance per AWS Region in a single AWS account. For more information, see [Working with S3 Access Grants instances](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-grants-instance.html)

1. Choose **Next**.

1. Enter a name for the policy, then choose **Create policy**.

## Add the data to your project


When your admin has granted your project access to the Amazon S3 resources, you can add them to your project.

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that you want to add Amazon S3 data to.

1. On the **Data** page, choose the plus icon **\$1**.

1. Select **Add S3 location**, then choose **Next**.

1. Enter a name for the location path.

1. (Optional) Add a description of the location path.

1. Use the S3 URI and Region provided by your admin.

1. If your admin has granted you access using an access role instead of the project role, enter the access role ARN from your admin. 

1. Choose **Add S3 location**.

The Amazon S3 data is then accessible within your project in the left navigation on the **Data** page.

# Sharing Amazon S3 data


Sharing data with other users in Amazon SageMaker Unified Studio means that you and other users can access the same data in multiple projects. There are two ways to share data with other users in Amazon SageMaker Unified Studio:
+ Publish Amazon S3 data to the catalog. This means that other projects can create subscription requests to request access to the data you publish. When you approve a subscription request, the other project will then have access to that data. 
+ Share Amazon S3 data directly with consumers. This means that the data you share is available to the projects you specify right away, without needing a subscription process.

In both cases, you can track and manage access to your data in the **Project catalog** page of your project in Amazon SageMaker Unified Studio. You have the option to choose whether to grant read-only or read and write access.

Amazon SageMaker Unified Studio grants access to subscribed assets using Amazon S3 Access Grants. When a subscription is revoked, a project member may still get Amazon S3 Access Grants credentials for up to 5 minutes, and credentials can be used for 15 minutes. As a result, a user may have access to the data for up to 20 minutes after the access is revoked in SageMaker Unified Studio.

## Publish Amazon S3 data to the catalog


When you publish data to the Amazon SageMaker Catalog, other projects in your Amazon SageMaker Unified Studio domain can create subscription requests to request access to the data you published. When you approve a subscription request, the other project will then have access to that data.

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains your Amazon S3 connection.

1. On the **Data** page, in the side navigation, choose **S3** to explore your S3 data assets.

1. Choose the name of the S3 folder or bucket you want to publish.

1. Choose **Actions**, then choose **Publish to Catalog**. A confirmation window appears.

1. Choose **Publish** to confirm that you want the S3 data to be discoverable in the Amazon SageMaker Catalog. This means that members of other projects in the domain can create subscription requests for the data asset. When you review subscription requests, you will have the option to grant read-only or read and write access to the data. If you approve the subscription request, they will have access to the S3 data asset in their project.

The S3 data folder or bucket you published then appears in the Amazon SageMaker Catalog as a data asset of type **S3 Object Collection**.

## Share Amazon S3 data directly with consumers


Sharing data directly in this way makes it so that the data you share is available to the projects you specify right away, without needing a subscription process. You will first publish your dataset, and then share it.

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains your Amazon S3 connection.

1. On the **Data** page, in the side navigation, explore your S3 data assets.

1. Choose the name of the S3 folder or bucket you want to publish.

1. Choose **Actions**, then choose **Publish to Catalog**. A confirmation window appears.

1. Choose **Publish**

1. Navigate to the **Assets**page from the left navigation and select the asset you want to share. 

1. Choose **Actions**, then choose **Share**.

1. Use the dropdown to select projects that you want to share the S3 data with, then choose **Next**.

1. Select the access type. For read-only access, move to the next step. To grant read and write access to users in the other project, select **Read and write access**.

1. Choose **Share**.

The S3 data asset is then shown in the project catalog under approved subscrion requests. You can choose to revoke access at any time. For more information about subscription requests, see [Data discovery, subscription, and consumption](discover-data.md).

# The lakehouse architecture of Amazon SageMaker
Lakehouse architecture

The lakehouse architecture of Amazon SageMaker is a unified data architecture built on AWS's cloud-native infrastructure that bridges Amazon S3 data lakes and Amazon Redshift data warehouses into a cohesive analytics platform. The architecture leverages Apache Iceberg table format for cross-service interoperability and implements a shared metadata catalog that provides consistent data access patterns across storage systems.

For more information about the lakehouse architecture of Amazon SageMaker, see [What is the lakehouse architecture of Amazon SageMaker?](https://docs.aws.amazon.com/sagemaker-lakehouse-architecture/latest/userguide/what-is-smlh.html).

# Amazon SageMaker Catalog
Amazon SageMaker Catalog

You can use the Amazon SageMaker Catalog to catalog data across your organization with business context and thus enable everyone in your organization to ﬁnd and understand data quickly.

In order to use Amazon SageMaker Unified Studio to catalog your data, you must first bring your data (assets) as inventory of your project in Amazon SageMaker Unified Studio. Creating inventory for a project makes the assets discoverable only to that project’s members. Project inventory assets are not available to all domain users in search/browse unless explicitly published.

After creating a project inventory, data owners can curate their inventory assets with the required business metadata by adding or updating business names (asset and schema), descriptions (asset and schema), read me, glossary terms (asset and schema), and metadata forms.

The next step of using Amazon SageMaker Unified Studio to catalog your data, is to make your project’s inventory assets discoverable by the domain users. You can do this by publishing the inventory assets to the Amazon SageMaker Unified Studio catalog. Only the latest version of the inventory asset can be published to the catalog and only the latest published version is active in the discovery catalog. If an inventory asset is updated after it's been published into the Amazon SageMaker Unified Studio catalog, you must explicitly publish it again in order for the latest version to be in the discovery catalog. 

For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md).

**Topics**
+ [

# Data governance and metadata
](data-governance.md)
+ [

# Data products
](data-products.md)
+ [

# Data inventory and publishing
](data-publishing.md)
+ [

# Data discovery, subscription, and consumption
](discover-data.md)
+ [

# Fine-grained access control to data
](fine-grained-access-control.md)
+ [

# Exporting asset metadata
](export-asset-metadata.md)

# Data governance and metadata
Data governance

Data governance in Amazon SageMaker Unified Studio encompasses the management of business glossaries, metadata forms, and asset classification to ensure consistent data understanding across your organization. Both business glossaries and metadata forms work together to provide comprehensive business context for your data assets, making it easier for users across your organization to discover, understand, and effectively use data in Amazon SageMaker Unified Studio. 

 A business glossary is a centralized collection of business terms and their definitions that ensures consistent vocabulary usage when analyzing data, while metadata forms are customizable forms that allow data owners to augment asset metadata with additional business context beyond the technical metadata automatically collected by the service. 

Glossaries and metadata forms are owned by the project that creates them. Only members of the owning project can edit or delete a glossary, its terms, or a metadata form. However, glossaries, their terms, and metadata forms are visible to all users in the domain. Users in any project within the domain can search for and view glossaries and terms, and can attach glossary terms and metadata forms to their assets to curate metadata. This allows a governance team to create and manage glossary terms and metadata forms in a dedicated project while users across the domain use them to describe and categorize their data assets.

**Topics**
+ [

# Create a business glossary in Amazon SageMaker Unified Studio
](create-maintain-business-glossary.md)
+ [

# Edit a business glossary in Amazon SageMaker Unified Studio
](edit-business-glossary.md)
+ [

# Delete a business glossary in Amazon SageMaker Unified Studio
](delete-business-glossary.md)
+ [

# Create a term in a glossary in Amazon SageMaker Unified Studio
](create-maintain-term.md)
+ [

# Edit a term in a glossary in Amazon SageMaker Unified Studio
](edit-term.md)
+ [

# Delete a term in a glossary in Amazon SageMaker Unified Studio
](delete-term.md)
+ [

# Create a metadata form in Amazon SageMaker Unified Studio
](create-metadata-form.md)
+ [

# Edit a metadata form in Amazon SageMaker Unified Studio
](edit-metadata-form.md)
+ [

# Delete a metadata form in Amazon SageMaker Unified Studio
](delete-metadata-form.md)
+ [

# Create a field in a metadata form in Amazon SageMaker Unified Studio
](create-field-in-metadata-form.md)
+ [

# Edit a field in a metadata form in Amazon SageMaker Unified Studio
](edit-field-in-metadata-form.md)
+ [

# Delete a field in a metadata form in Amazon SageMaker Unified Studio
](delete-field-in-metadata-form.md)
+ [

# Restricted asset classification Amazon SageMaker Unified Studio
](restricted-asset-classification.md)

# Create a business glossary in Amazon SageMaker Unified Studio
Create a business glossary

In Amazon SageMaker Unified Studio, a business glossary is a collection of business terms (words) that may be associated with assets (data). It provides appropriate vocabularies with a list of business terms and their definitions for business users to ensure the same definitions are used across the organization when analyzing data. Business glossaries are created in the catalog domain and can be applied to assets and columns to help understand key characteristics of that asset or column. One or more glossary terms can be applied. A business glossary can be a flat list of terms where any term in the business glossary can be associated with a sublist of other terms. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To create, edit, or delete a glossary in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project with the right permissions for that domain.

To create a glossary, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Glossaries**, and then choose **Create glossary**.

1. Specify a name, description, and owning project for the glossary and then choose **Create glossary**. 

1. Optional - if you want to create a [restricted glossary](restricted-asset-classification.md), then choose **Restrict this glossary for governed term use**. And then specify the usage permission by selecting one of the following options:
   + All projects - give permissions to all projects in this domain
   + (Default) Owning project - give permissions ONLY to the owning project
   + Selected projects or domain units - give permission to specific projects and/or domain units

1. Enable the new glossary by choosing the **Enabled** toggle.
**Note**  
After you create and enable a glossary, it is visible to all users in the domain. Users in other projects can search for, view, and attach the glossary's terms to their assets. Only members of the project that owns the glossary can edit or delete it.

To disable or enable a business glossary, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Glossaries**.

1. Select the business glossary that you want to disable or enable. 

1. On the glossary details page, locate the **Enabled** toggle and use it to enable or disable your selected glossary.
**Note**  
Disabling a glossary also disables all the terms that it contains.

# Edit a business glossary in Amazon SageMaker Unified Studio
Edit a business glossary

In Amazon SageMaker Unified Studio, a business glossary is a collection of business terms (words) that may be associated with assets (data). It provides appropriate vocabularies with a list of business terms and their definitions for business users to ensure the same definitions are used across the organization when analyzing data. Business glossaries are created in the catalog domain and can be applied to assets and columns to help understand key characteristics of that asset or column. One or more glossary terms can be applied. A business glossary can be a flat list of terms where any term in the business glossary can be associated with a sublist of other terms. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To edit a glossary in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project with the right permissions for that domain.

To edit a business glossary, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Glossaries**.

1. Select the business glossary that you want to edit. 

1. On the glossary details page, expand **Actions** and then choose **Edit business glossary** to edit the glossary. 

1. Make your updates to the name and description, and then choose **Update glossary**.

# Delete a business glossary in Amazon SageMaker Unified Studio
Delete a business glossary

In Amazon SageMaker Unified Studio, a business glossary is a collection of business terms (words) that may be associated with assets (data). It provides appropriate vocabularies with a list of business terms and their definitions for business users to ensure the same definitions are used across the organization when analyzing data. Business glossaries are created in the catalog domain and can be applied to assets and columns to help understand key characteristics of that asset or column. One or more glossary terms can be applied. A business glossary can be a flat list of terms where any term in the business glossary can be associated with a sublist of other terms. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To delete a glossary in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project with the right permissions for that domain.

To delete a business glossary, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Glossaries**.

1. Select the business glossary that you want to delete. 

1. On the glossary details page, make sure the **Enabled** toggle is off, so that the glossary is disabled.

1. Expand **Actions** and then choose **Delete** to delete the glossary. 
**Note**  
You must delete all existing terms in the glossary before you can delete the glossary.

1. Confirm the deletion of the glossary by choosing **Delete**.

# Create a term in a glossary in Amazon SageMaker Unified Studio
Create a term in a glossary

In Amazon SageMaker Unified Studio, a business glossary is a collection of business terms that may be associated with assets (data). For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To create, edit, or delete terms in a glossary in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project with the right permissions for that domain.

In Amazon SageMaker Unified Studio, business glossary terms can have close descriptions. To set the context of a particular term, you can specify relationships among terms. When you define a relationship for a term, it is automatically added to the definition of the related term. The glossary term relationships available in Amazon SageMaker Unified Studio include the following:
+ **Is a Type of** - indicates that the current term is a type of the identified term. Indicates that the identified term is a parent to the current term.
+ **Has Types** - indicates that the current term is a generic term for the indicated specific term or terms. This relationship can denote child terms for the generic term.

To create a new term, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Glossaries**.

1. Select the glossary where you want to create the new term. 

1. Choose **Create term**.

1. Specify a name and description for the term and then choose **Create term**. 

1. Enable the new term by choosing the **Enabled** toggle.

1. To add a **Readme**, select the name of the term to navigate to the term details page. Then choose **Create readme** to add some additional information about this glossary.

1. To add relationships, complete the following steps:

   1. Select the name of the term to navigate to the term details page.

   1. If this is the first relationship added to the term, under **Terms relationships**, choose **Add terms**. If there are other terms relationships listed, under **Term Relationships**, choose **Edit**, and then choose **Add terms**.

   1. In the dialog, choose the relationship and the terms you want to relate.

   1. Choose **Add terms** to add the selected terms to the appropriate relationship type. This relationship is also added to all the terms you made related.

# Edit a term in a glossary in Amazon SageMaker Unified Studio
Edit a term in a glossary

In Amazon SageMaker Unified Studio, a business glossary is a collection of business terms that may be associated with assets (data). For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To create, edit, or delete terms in a glossary in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project with the right permissions for that domain.

In Amazon SageMaker Unified Studio, business glossary terms can have close descriptions. To set the context of a particular term, you can specify relationships among terms. When you define a relationship for a term, it is automatically added to the definition of the related term. The glossary term relationships available in Amazon SageMaker Unified Studio include the following:
+ **Is a Type of** - indicates that the current term is a type of the identified term. Indicates that the identified term is a parent to the current term.
+ **Has Types** - indicates that the current term is a generic term for the indicated specific term or terms. This relationship can denote child terms for the generic term.

To edit a term in a glossary, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Glossaries**.

1. Select the glossary that contains the term that you you want to edit. 

1. Choose the name of the term to navigate to the term details page.

1. On the term details page, expand **Actions** and then choose **Edit** to edit the term. 

1. Make your updates to the name and description, and then choose **Update term**.

# Delete a term in a glossary in Amazon SageMaker Unified Studio
Delete a term in a glossary

In Amazon SageMaker Unified Studio, a business glossary is a collection of business terms that may be associated with assets (data). For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To create, edit, or delete terms in a glossary in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project with the right permissions for that domain.

In Amazon SageMaker Unified Studio, business glossary terms can have close descriptions. To set the context of a particular term, you can specify relationships among terms. When you define a relationship for a term, it is automatically added to the definition of the related term. The glossary term relationships available in Amazon SageMaker Unified Studio include the following:
+ **Is a Type of** - indicates that the current term is a type of the identified term. Indicates that the identified term is a parent to the current term.
+ **Has Types** - indicates that the current term is a generic term for the indicated specific term or terms. This relationship can denote child terms for the generic term.

To delete a term in a glossary, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Glossaries**.

1. Select the glossary that contains the term that you you want to delete. 

1. Select the name of the term to navigate to the term details page. 

1. Make sure that the **Enabled** toggle is off so that the term is disabled.

1. On the glossary term details page, expand **Actions** and then choose **Delete**. 

1. Confirm the deletion of the term by choosing **Delete**.

# Create a metadata form in Amazon SageMaker Unified Studio
Create a metadata form

In Amazon SageMaker Unified Studio, metadata forms are simple forms to augment additional business context to the asset metadata in the catalog. They serve as extensible mechanisms for data owners to enrich the asset with information that can help data users when they search and find that data. Metadata forms can also serve a mechanism to enforce consistency to all assets being published to the Amazon SageMaker Unified Studio catalog. 

A metadata form definition is composed of one or more field definitions, with support for boolean, date, decimal, integer, string, and business glossary field value data types. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To create, edit, or delete metadata forms in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project who has the right credentials. 

To create a metadata form, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Metadata forms**.

1. Choose **Create metadata form**.

1. Specify the metadata form technical name, owning project, and optional display name and description.

1. Choose **Create metadata form**.

**Note**  
After you create a metadata form, it is visible to all users in the domain. Users in other projects can attach the metadata form to their assets to add business context. Only members of the project that owns the metadata form can edit or delete it.

# Edit a metadata form in Amazon SageMaker Unified Studio
Edit a metadata form

In Amazon SageMaker Unified Studio, metadata forms are simple forms to augment additional business context to the asset metadata in the catalog. They serve as extensible mechanisms for data owners to enrich the asset with information that can help data users when they search and find that data. Metadata forms can also serve a mechanism to enforce consistency to all assets being published to the Amazon SageMaker Unified Studio catalog. 

A metadata form definition is composed of one or more field definitions, with support for boolean, date, decimal, integer, string, and business glossary field value data types. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To create, edit, or delete metadata forms in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project who has the right credentials. 

To edit a metadata form, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Metadata forms**.

1. Choose the name of the metadata form that you want to edit. This takes you to the metadata form details page.

1. On the metadata form details page, expand **Actions**, and then choose **Edit metadata form**.

1. Perform your updates to the name, description, and owning project.

1. Choose **Update form**.

# Delete a metadata form in Amazon SageMaker Unified Studio
Delete a metadata form

In Amazon SageMaker Unified Studio, metadata forms are simple forms to augment additional business context to the asset metadata in the catalog. They serve as extensible mechanisms for data owners to enrich the asset with information that can help data users when they search and find that data. Metadata forms can also serve a mechanism to enforce consistency to all assets being published to the Amazon SageMaker Unified Studio catalog. 

A metadata form definition is composed of one or more field definitions, with support for boolean, date, decimal, integer, string, and business glossary field value data types. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To create, edit, or delete metadata forms in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project who has the right credentials. 

To delete a metadata form, complete the following steps:

**Note**  
Before you can delete a metadata form, you must remove it from all asset types or assets to which it is applied. 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Metadata forms**.

1. Choose the name of the metadata form that you want to delete. This takes you to the metadata form details page.

1. If the metadata form that you want to delete is enabled, disable the metadata form by choosing the **Enabled** toggle.

1. On the metadata form's details page, expand **Actions**, and then choose **Delete**.

1. Confirm deletion by choosing **Delete**. 

# Create a field in a metadata form in Amazon SageMaker Unified Studio
Create a field in a metadata form

In Amazon SageMaker Unified Studio, metadata forms are simple forms to augment additional business context to the asset metadata in the catalog. They serve as extensible mechanisms for data owners to enrich the asset with information that can help data users when they search and find that data. Metadata forms can also serve an mechanism to enforce consistency to all assets being published to the Amazon SageMaker Unified Studio catalog. 

A metadata form definition is composed of one or more field definitions, with support for boolean, date, decimal, integer, string, and business glossary field value data types. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To create, edit, or delete fields in metadata forms in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project who has the right credentials. 

To create a field in a metadata form, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Metadata forms**.

1. Choose the name of the metadata form that you want to add a field to. This takes you to the metadata form details page.

1. On the metadata form details page, choose **Create field**.

1. Specify the field name, description, type, and whether this is a required field. Depending on the field type, you might be able to configure additional selections.

1. Choose **Create field**.

# Edit a field in a metadata form in Amazon SageMaker Unified Studio
Edit a field in a metadata form

In Amazon SageMaker Unified Studio, metadata forms are simple forms to augment additional business context to the asset metadata in the catalog. They serve as extensible mechanisms for data owners to enrich the asset with information that can help data users when they search and find that data. Metadata forms can also serve an mechanism to enforce consistency to all assets being published to the Amazon SageMaker Unified Studio catalog. 

A metadata form definition is composed of one or more field definitions, with support for boolean, date, decimal, integer, string, and business glossary field value data types. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To create, edit, or delete fields in metadata forms in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project who has the right credentials. 

To edit a field in a metadata form, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Metadata forms**.

1. Choose the name of the metadata form that contains the field that you want to edit. This takes you to the metadata form details page.

1. On the metadata form details page, choose the field that you want to edit.

1. Expand the **Actions** menu, and then choose **Edit field**.

1. Make your updates to the field name, description, type, and whether it is a required field. Make updates to other selections if more are available with the selected field type.

1. Choose **Save**.

# Delete a field in a metadata form in Amazon SageMaker Unified Studio
Delete a field in a metadata form

In Amazon SageMaker Unified Studio, metadata forms are simple forms to augment additional business context to the asset metadata in the catalog. They serve as extensible mechanisms for data owners to enrich the asset with information that can help data users when they search and find that data. Metadata forms can also serve an mechanism to enforce consistency to all assets being published to the Amazon SageMaker Unified Studio catalog. 

A metadata form definition is composed of one or more field definitions, with support for boolean, date, decimal, integer, string, and business glossary field value data types. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). To create, edit, or delete fields in metadata forms in your Amazon SageMaker Unified Studio domain, you must be a member of the owning project who has the right credentials. 

To delete a field in a metadata form, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Metadata forms**.

1. Choose the name of the metadata form that contains the field that you want to delete. This takes you to the metadata form details page.

1. On the metadata form details page, choose the field that you want to delete.

1. Expand the **Actions** menu, and choose **Delete**.

1. Confirm deletion by choosing **Delete**.

# Restricted asset classification Amazon SageMaker Unified Studio
Restricted asset classification

Restricted classification allows domain unit owners and glossary project owners to control who can apply specific classification terms to assets in the Amazon SageMaker Catalog. This feature helps maintain classification consistency and governance standards across your domain while enabling controlled workflows based on governed classification terms.

Restricted classification provides the following benefits:
+ Governance control - maintain consistent classification standards across your entire domain
+ Access management - control which users can apply sensitive or restricted classification terms to the assets
+ Workflow enablement - build automated workflows based on governed classification terms
+ Clear separation - distinguish between open and restricted classification terms

SageMaker catalog now separates classification terms into two categories:
+ Unrestricted terms - available for all users to apply to their assets
+ Restricted terms - only authorized users can apply these to assets they own

This functionality uses the following authorization model:
+ Restricted glossaries can be created and managed by glossary project owners and contributors. Project owners have the ability to grant usage permissions to specific domain units as well as to other project owners and contributors. If a restricted glossary is created by a contributor, only the project owner is granted permission to use it by default.
+ Project owners are by default granted access to use the glossary for the assets in their projects.
+ Authorized users are granted permissions to use the restricted terms for their projects by adding project specific grants.
+  All users can filter and discover assets using restricted classification terms.

When using restricted glossaries in Amazon SageMaker Unified Studio, you must abide by the following constraints:
+ Scope of application – restricted glossary terms are currently supported only at the asset level. Column-level terms, metadata form–level terms and data product-level terms are not currently supported. 
+ Term relationships – restricted glossary terms cannot be related to other terms.
+ Glossary usage permission conversion – once created, a restricted glossary cannot be converted into a regular glossary and a regular glossary cannot be converted into a restricted glossary.

## Creating restricted classification terms


As a project owner or contributor:
+ Navigate to the catalog governance section
+ [Create a new restricted classification glossary](create-maintain-business-glossary.md)
+ Define terms within the glossary
+ Set usage policies for the restricted glossary

## Applying restricted terms to assets


Complete the following procedure based on the configured usage permission for the restricted glossary:

**Apply restricted terms to assets**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Data catalog**.

1. Find the asset to which you want to assign restricted terms and on the asset's details page, choose **View inventory asset**. 

1. Under **Glossary terms**, choose **Add terms**, then search for the restricted term that you want to assign to this asset, choose it, and then choose **Add term**. 

   Once the term is successfully added, you can identify it as a restricted term by the presence of a lock icon next to its name.

You can associate or disassociate up to 5 restricted terms with/from an asset at any one time.

# Data products
Data products

In Amazon SageMaker Unified Studio, data producers can group data assets into well-defined, self-contained packages called *data products* that are tailored for specific business use cases. Using cohesive, business-aligned data products enhances both the publishing and the subscription processes. 

Data consumers can identify interconnected data assets by searching for and finding them as a single unit. This approach reduces the time and effort required to find all relevant information and lowers the risk of missing important data. Also, data products simplify access to data with a single request by implementing a unified access model. This eliminates the need for multiple permissions, thereby speeding up the initiation of data analysis. 

By cataloging assets as data products, data producers reduce administrative overhead by enabling metadata and access control management at the data product level, rather than individually. The ability to surface these purpose-built grouped assets for consumption makes access governance and data utilization more efficient, ensuring it aligns with business goals and is easily accessible for its intended use. Data governance teams can monitor consumption rates for these data products, providing valuable insights into data literacy maturity. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md).

**Topics**
+ [

# Create new data products in Amazon SageMaker Unified Studio
](create-new-data-product.md)
+ [

# Publish data products in Amazon SageMaker Unified Studio
](publish-data-product.md)
+ [

# Edit data products in Amazon SageMaker Unified Studio
](edit-data-product.md)
+ [

# Unpublish data products in Amazon SageMaker Unified Studio
](unpublish-data-product.md)
+ [

# Delete data products in Amazon SageMaker Unified Studio
](delete-data-product.md)
+ [

# Subscribe to a data product in Amazon SageMaker Unified Studio
](subscribe-data-product.md)
+ [

# Review a subscription request and grant a subscription to a data product in Amazon SageMaker Unified Studio
](review-grant-subscription-to-data-product.md)
+ [

# Republish data products in Amazon SageMaker Unified Studio
](republish-data-product.md)

# Create new data products in Amazon SageMaker Unified Studio
Create new data products

Amazon SageMaker Unified Studio enables data producers to group data assets into well-defined, self-contained packages called data products that are tailored for specific business use cases. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). 

Any Amazon SageMaker Unified Studio user with the required permissions can create a Amazon SageMaker Unified Studio data product.

To create a new data product complete the following steps. 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project in which you'd like to create a data product. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. Under **Project catalog**, choose **Assets**.

1. Expand the **Create** menu and then choose **Create data product**. 

1. In the **Create new data product** page, complete the following steps.

   1. Specify the name and the description for the data product

   1. Choose **Select assets** to add various assets to your data product.

   1. In the **Select assets** pop-up window, choose **Choose** next to the assets that you want to add to this data product.

   1. Choose **Choose** at the bottom of the pop-up window.

1. To complete creating the data product, choose **Create**.

# Publish data products in Amazon SageMaker Unified Studio
Publish data products

Amazon SageMaker Unified Studio enables data producers to group data assets into well-defined, self-contained packages called data products that are tailored for specific business use cases. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). 

Any Amazon SageMaker Unified Studio user with the required permissions can publish an Amazon SageMaker Unified Studio data product.

To publish a data product complete the following steps. 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains the data product that you want to publish. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. Under **Project catalog**, choose **Assets**.

1. Choose the **Inventory** tab, and then choose the **Data products** filter. This displays existing data products in the project inventory.

1. Choose the data product that you want to publish. This opens the data product details page.

1. Choose **Publish**. Confirm the publishing of this data product by choosing **Publish data product**. 
**Note**  
Any unpublished data assets that are in this data product will become published, but will only be available through this data product.

# Edit data products in Amazon SageMaker Unified Studio
Edit data products

Amazon SageMaker Unified Studio enables data producers to group data assets into well-defined, self-contained packages called data products that are tailored for specific business use cases. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). 

Any Amazon SageMaker Unified Studio user with the required permissions can edit an Amazon SageMaker Unified Studio data product.

To edit a data product complete the following steps. 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains the data product that you want to edit. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. Under **Project catalog**, choose **Assets**.

1. Choose the **Inventory** tab, and then choose the **Data products** filter. This displays existing data products in the project inventory.

1. Choose the data product that you want to edit. As part of editing a data product, you can do the following: 
   + Choose **Create README** to add a README that will help users understand the page better.
   + Choose **Add terms** to add glossary terms. Make your selections of glossary terms in the **Add terms** window and then choose **Add terms**.
   + Choose **Add metadata form** and then select your form in the **Add metadata form** window and choose **Add**.
   + Expand **Actions**, choose **Edit**, make your edits to the name and description of the data product, and then choose **Update**.
   + On the **Assets** tab, remove one of the existing assets in the data product by choosing that asset, then expanding the three-dot action icon and choosing **Remove asset**. Confirm the asset removal by choosing **Remove** in the **Remove asset** pop-up window. When you republish, this asset will be removed from all subscribers to this data product.
   + On the **Assets** tab, add a new asset to the data product by choosing the **Add** button and then selecting one or more assets to be added to the data product. 

# Unpublish data products in Amazon SageMaker Unified Studio
Unpublish data products

Amazon SageMaker Unified Studio enables data producers to group data assets into well-defined, self-contained packages called data products that are tailored for specific business use cases. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). 

Any Amazon SageMaker Unified Studio user with the required permissions can unpublish an Amazon SageMaker Unified Studio data product.

To unpublish a data product complete the following steps. 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains the data product that you want to unpublish. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. Under **Project catalog**, choose **Assets**.

1. Choose the **Inventory** tab, and then choose the **Data products** filter. This displays existing data products in the project inventory.

1. Choose the data product that you want to unpublish. This opens the data product details page.

1.  Expand **Actions** and choose **Unpublish**. Confirm the unpublishing of this data product by choosing **Unpublish**.
**Note**  
Unpublishing a data product has the following effects:   
This data product will no longer be available to view or to subscribe to.
Any data assets that are only available through this data product will no longer be available.
All active subscriptions to this data product will remain.
Any individually published data assets will not be affected.

# Delete data products in Amazon SageMaker Unified Studio
Delete data products

Amazon SageMaker Unified Studio enables data producers to group data assets into well-defined, self-contained packages called data products that are tailored for specific business use cases. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). 

Any Amazon SageMaker Unified Studio user with the required permissions can delete an Amazon SageMaker Unified Studio data product.

To delete a data product complete the following steps. 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains the data product that you want to delete. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. Under **Project catalog**, choose **Assets**.

1. Choose the **Inventory** tab, and then choose the **Data products** filter. This displays existing data products in the project inventory.

1. Choose the data product that you want to delete.

1.  Expand **Actions** and choose **Delete**. Confirm the deletion of this data product by typing `delete` in the text field and then choosing **Delete**. 
**Note**  
Deleting a data product has the following effects:  
The data product will no longer be available to publish, view, or subscribe to.
Any data assets that are only available through this data product will no longer be visible in the data catalog. They will not be deleted from your inventory assets.

# Subscribe to a data product in Amazon SageMaker Unified Studio
Subscribe to a data product

Amazon SageMaker Unified Studio enables data producers to group data assets into well-defined, self-contained packages called data products that are tailored for specific business use cases. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). 

Any Amazon SageMaker Unified Studio user with the required permissions can subscribe to an Amazon SageMaker Unified Studio data product.

To subscribe to a data product, complete the following steps. 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Data catalog**.

1. Choose **Browse data products**.

1. Find the data product to which you want to subscribe and then choose that data product.

1. On the data product details page, choose **Subscribe**.

1. Specify the project and the reason for requesting a subscription. Then choose **Request**.

When the owning project grants the subscription request, you will be subscribed to the data product.

# Review a subscription request and grant a subscription to a data product in Amazon SageMaker Unified Studio
Review a subscription request and grant a subscription to a data product

Amazon SageMaker Unified Studio enables data producers to group data assets into well-defined, self-contained packages called data products that are tailored for specific business use cases. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). 

The owning project of the data product can review and grant the subscription to an Amazon SageMaker Unified Studio data product.

To review a subscription request and grant or reject a subscription to a data product, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains the data product that has a subscription request. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. Under **Project catalog**, choose **Assets**.

1. Choose the **Inventory** tab, and then choose the **Data products** filter. This displays existing data products in the project inventory.

1. Choose the data product that has a subscription request.

1. Choose the **Subscription requests** tab.

1. Locate the request that you want to review and then choose **View request**.

1. In the **Subscription request** window, type in a decision comment. Then choose either **Approve** or **Reject**.

# Republish data products in Amazon SageMaker Unified Studio
Republish data products

Amazon SageMaker Unified Studio enables data producers to group data assets into well-defined, self-contained packages called data products that are tailored for specific business use cases. For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). 

Any Amazon SageMaker Unified Studio user with the required permissions can republish an Amazon SageMaker Unified Studio data product.

To republish a data product complete the following steps. 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains the data product that you want to edit. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. Under **Project catalog**, choose **Assets**.

1. Choose the **Inventory** tab, and then choose the **Data products** filter. This displays existing data products in the project inventory.

1. Choose the data product that you want to republish.

1. Make the desired edits to the data product. For more information, see [Edit data products in Amazon SageMaker Unified Studio](edit-data-product.md).

1. On the data product's details page, choose **Re-publish**. Confirm this action by choosing **Re-publish data product** in the **Re-publish data product** pop-up window. 
**Note**  
Republishing this data product will update the following for all subscribers:   
If assets have been removed from the data product, subscribers will no longer have access to these assets.
If assets have been added to the data product, subscribers will get access to these assets.
New published versions of data assets will be available.

# Data inventory and publishing


This section describes the tasks and procedures to create an inventory of your data in Amazon SageMaker Unified Studio and to publish your data in Amazon SageMaker Unified Studio.

To use Amazon SageMaker Unified Studio to catalog your data, you must first bring your data (assets) as inventory of your project in Amazon SageMaker Unified Studio. Creating an inventory for a particular project makes the assets discoverable only to that project’s members. Project inventory assets are not available to all domain users in search or browse unless it is published to the Amazon SageMaker Catalog. After creating a project inventory, data owners can curate their inventory assets with the required business metadata by adding or updating business names (asset and schema), descriptions (asset and schema), README, glossary terms (asset and schema), and metadata forms. 

The next step of using Amazon SageMaker Unified Studio to catalog your data is to make your project’s inventory assets discoverable by the domain users. You can do this by publishing the inventory assets to the Amazon SageMaker Unified Studio catalog. Only the latest version of the inventory asset can be published to the catalog and only the latest published version is active in the discovery catalog. If an inventory asset is updated after it's been published into the Amazon SageMaker Unified Studio catalog, you must publish it again for the latest version to be in the discovery catalog. 

For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md)

**Topics**
+ [

# Configure Lake Formation permissions for Amazon SageMaker Unified Studio
](lake-formation-permissions-for-amazon-sagemaker-unified-studio.md)
+ [

# Create custom asset types in Amazon SageMaker Unified Studio
](create-asset-types.md)
+ [

# Create an Amazon SageMaker Unified Studio data source for AWS Glue in the project catalog
](data-source-glue.md)
+ [

# Create an Amazon SageMaker Unified Studio data source for Amazon Redshift in the project catalog
](create-redshift-data-source.md)
+ [

# Create an Amazon SageMaker Unified Studio data source for Amazon SageMaker AI in the project catalog
](create-sagemaker-data-source.md)
+ [

# Edit a data source in Amazon SageMaker Unified Studio
](editing-a-data-source.md)
+ [

# Delete a data source in Amazon SageMaker Unified Studio
](removing-a-data-source.md)
+ [

# Publish assets to the Amazon SageMaker Unified Studio catalog from the project inventory
](publishing-data-asset.md)
+ [

# Share assets
](share-assets.md)
+ [

# Manage inventory and curate assets in Amazon SageMaker Unified Studio
](update-metadata.md)
+ [

# Manually create an asset in Amazon SageMaker Unified Studio
](create-data-asset-manually.md)
+ [

# Unpublish an asset from the Amazon SageMaker Catalog
](archive-data-asset.md)
+ [

# Delete an Amazon SageMaker Unified Studio asset
](delete-data-asset.md)
+ [

# Manually start a data source run in Amazon SageMaker Unified Studio
](manually-start-data-source-run.md)
+ [

# Asset revisions in Amazon SageMaker Unified Studio
](asset-versioning.md)
+ [

# Data quality in Amazon SageMaker Unified Studio
](data-quality.md)
+ [

# Using machine learning and generative AI in Amazon SageMaker Unified Studio
](autodoc.md)
+ [

# Data lineage in Amazon SageMaker Unified Studio
](datazone-data-lineage.md)
+ [

# Analyze Amazon SageMaker Unified Studio data with external analytics applications via JDBC connection
](query-with-jdbc.md)
+ [

# Metadata enforcement rules for publishing
](metadata-rules-publishing.md)

# Configure Lake Formation permissions for Amazon SageMaker Unified Studio


When you create a project in Amazon SageMaker Unified Studio, an AWS Glue database is added as part of this project. If you want to publish assets from this AWS Glue database, no additional permissions are needed.

However, if you want to publish assets and subscribe to assets from an AWS Glue database that exists outside of your Amazon SageMaker Unified Studio project, you must explicitly provide your project with the permissions to access tables in the external AWS Glue database. To do this, you must complete the following settings in AWS Lake Formation and attach necessary AWS Lake Formation permissions to the project's IAM role role.
+ Configure the Amazon S3 location for your data lake in AWS Lake Formation with **Lake Formation** permission mode or **Hybrid access mode**. For more information, see [https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html](https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html).
+ Attach the following AWS Lake Formation permissions to the AWS Glue manage access role:
  + `Describe` and `Describe grantable` permissions on the database where the tables exist.
  + `Describe`, `Select`, `Describe Grantable`, `Select Grantable` permissions on the all the tables in the above database that you want DataZone to manage access on your behalf.

**Note**  
Amazon SageMaker Unified Studio supports the AWS Lake Formation hybrid mode. Lake Formation hybrid mode enables you to start managing permissions on you AWS Glue databases and tables through Lake Formation, while continuing to maintain any existing IAM permissions on these tables and databases. 

# Create custom asset types in Amazon SageMaker Unified Studio
Create custom asset types

In Amazon SageMaker Unified Studio, assets represent specific types of data resources such as database tables, dashboards, or machine learning models. To provide consistency and standardization when describing catalog assets, an Amazon SageMaker Unified Studio domain must have a set of asset types that define how assets are represented in the catalog. An asset type defines the schema for a specific type of asset. An asset type has a set of required and optional nameable metadata form types. Asset types in Amazon SageMaker Unified Studio are versioned. When assets are created, they are validated against the schema defined by their asset type (typically latest version), and if an invalid structure is specified, asset creation fails. 

**System asset types** - Amazon SageMaker Unified Studio provisions service-owned system asset types. System asset types cannot be altered. Amazon SageMaker Unified Studio includes the following system asset types:
+ Amazon Bedrock chat app
+ Amazon Bedrock flow app
+ Amazon Bedrock inference only
+ Amazon Bedrock model
+ Amazon Bedrock prompt
+ Databricks table
+ Databricks view
+ AWS Glue table
+ AWS Glue view
+ Amazon Redshift table
+ Amazon Redshift view
+ Amazon S3 object collection
+ SageMaker feature group
+ SageMaker model package group
+ Snowflake table
+ Snowflake view
+ Data product

**Custom asset types** - to create custom asset types, you start by creating the required metadata form types and glossaries to use in the form types. You can then create custom asset types by specifying a name, description, and associated metadata forms that can be required or optional. 

For asset types with structured data, to represent the column schema in Amazon SageMaker Unified Studio, you can use the `RelationalTableFormType` to add the technical metadata to your columns, including column names, descriptions, and data types, and the ` ColumnBusinessMetadataForm` to add the business descriptions of the columns, including business names, glossary terms, and custom key value pairs. 

To create a custom asset type in Amazon SageMaker Unified Studio, complete the following steps:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project where you want to create a custom asset type.

1. Navigate to the **Discover** menu in the top navigation.

1. Choose **Data catalog**.

1. Choose **View asset types**.

1. Choose **Create asset type**.

1. Specify the following:
   + **Name** - the name of the custom asset type 
   + **Description** - the description of the custom asset type.
   + Choose **Add metadata form** to add metadata forms to this custom asset type.
   + Under **Usage permission**, restrict access, by specify which projects or domain units are authorized to use this asset type. 
**Note**  
You must be a domain unit owner or a project owner in order to modify usage permissions. Project contributors can view usage permissions but cannot edit them.

     You can choose the following:
     + **All projects** - give permissions to all projects in this domain
     + **Owning project** - give permissions only to the owning project
     + **Selected projects or domain units** - give permissions to specific projects and/or domain units

       If you select this option, choose **Add usage permission**, and in the **Add projects and designations** pop up window, specify the authorized projects (you can choose **Select projects in a domain unit** or **All project in a domain unit**), the specific domain unit, and the allowed designations - which designations a project member must have to use this policy. You can choose **Owner** or **Contributor**.

1. Choose **Create**. After the custom asset type is created, you can use it to create assets.

# Create an Amazon SageMaker Unified Studio data source for AWS Glue in the project catalog
Create a data source for AWS Glue

In Amazon SageMaker Unified Studio, you can create an AWS Glue Data Catalog data source in order to import technical metadata of database tables from AWS Glue. To add a data source for the AWS Glue Data Catalog, the source database must already exist in AWS Glue. Your Amazon SageMaker Unified Studio project’s IAM role also needs certain permissions to be able to create a data source, as described in the section [Configure Lake Formation permissions for Amazon SageMaker Unified Studio](lake-formation-permissions-for-amazon-sagemaker-unified-studio.md).

When you create and run an AWS Glue data source, you add assets from the source AWS Glue database to your Amazon SageMaker Unified Studio project's inventory. You can run your AWS Glue data sources on a set schedule or on demand to create or update your assets' technical metadata. During the data source runs, you can optionally choose to publish your assets to the Amazon SageMaker Unified Studio catalog and thus make them discoverable by all domain users. You can also publish your project inventory assets after editing their business metadata. Domain users can search for and discover your published assets, and request subscriptions to these assets. 

**Note**  
Adding a data source in the project catalog makes it possible to publish that data into the Amazon SageMaker Catalog. To add a data source for analyzing and editing within your project, use the **Data** page of your project. Data that you add to your connect to on the **Data** page can also be published to the Amazon SageMaker Catalog. For more information, see [The lakehouse architecture of Amazon SageMaker](lakehouse.md).

**To create an AWS Glue data source**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which you want to add the data source.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose **Create data source**.

1. Configure the following fields:
   + **Name** – The data source name.
   + **Description** – The data source description.

1. Under **Data source type**, choose **AWS Glue**.

1. (Optional) Under **Connection**, select **Import data lineage** if you want to import lineage for the data sources that use the connection.

1. Under **Data selection**, provide an AWS Glue database and provide a catalog, database names, and criteria for tables. For example, if you choose **Include** and enter `*corporate`, the database will include all source tables that end with the word `corporate`.

   You can either choose an AWS Glue catalog from the dropdown or type a catalog name. The dropdown includes the default AWS Glue catalog for the connection account. 

   You can add multiple include and exclude rules for tables. You can also add multiple databases using the **Add another database** button.

   

1. Choose **Next**.

1. For **Publishing settings**, choose whether assets are immediately discoverable in the Amazon SageMaker Catalog. If you only add them to the inventory, you can choose subscription terms later and then publish them to the Amazon SageMaker Catalog. 

1. For **Metadata generation methods**, choose whether to automatically generate metadata for assets as they're imported from the source.

1. Under **Data quality**, you can choose to **Enable data quality for this data source**. If you do this, Amazon SageMaker Unified Studio imports your existing AWS Glue data quality output into your Amazon SageMaker Unified Studio catalog. By default, Amazon SageMaker Unified Studio imports the latest existing 100 quality reports with no expiration date from AWS Glue.

   Data quality metrics in Amazon SageMaker Unified Studio help you understand the completeness and accuracy of your data sources. Amazon SageMaker Unified Studio pulls these data quality metrics from AWS Glue in order to provide context during a point in time, for example, during a business data catalog search. Data users can see how data quality metrics change over time for their subscribed assets. Data producers can ingest AWS Glue data quality scores on a schedule. The Amazon SageMaker Unified Studio business data catalog can also display data quality metrics from third-party systems through data quality APIs. 

1. (Optional) For **Metadata forms**, add forms to define the metadata that is collected and saved when the assets are imported into Amazon SageMaker Unified Studio. For more information, see [Create a metadata form in Amazon SageMaker Unified Studio](create-metadata-form.md).

1. Choose **Next**.

1. For **Run preference**, choose when to run the data source.
   + **Run on a schedule** – Specify the dates and time to run the data source.
   + **Run on demand** – You can manually initiate data source runs.

1. Choose **Next**.

1. Review your data source configuration and choose **Create**.

You can also create a Amazon SageMaker Unified Studio data source for Amazon Redshift by invoking the `CreateDataSource` API action or the `create-data-source` CLI action:

```
aws datazone create-data-source --cli-input-json file://create-sagemaker-datasource-example.json
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon DataZone domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "GLUE",
  "description": "Description of the datasource",
  "environmentIdentifier": "environment123",
  "configuration": {
    "glueRunConfiguration": {
        "autoImportDataQualityResult": "True",
        "relationalFilterConfigurations": [{
            "databaseName": "my_database",
            "filterExpressions": [{
                "expression": "*",
                "type": "INCLUDE"
            }]
        }]
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content",
      "renderingConfig": {
        "collapse": "True""
      }
    }
  ],
  "clientToken": "123456"
}
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon SageMaker unified domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "GLUE",
  "description": "Description of the datasource",
  "connectionIdentifier": "connection123",  
  "configuration": {
    "glueRunConfiguration": {
        "catalogName": "my_catalog",
        "autoImportDataQualityResult": "True",
        "relationalFilterConfigurations": [{
            "databaseName": "my_database",
            "filterExpressions": [{
                "expression": "*",
                "type": "INCLUDE"
            }]
        }]
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content"
    }
  ],
  "clientToken": "123456"
}
```

# Create an Amazon SageMaker Unified Studio data source for Amazon Redshift in the project catalog
Create a data source for Amazon Redshift

In Amazon SageMaker Unified Studio, you can create an Amazon Redshift data source in order to import technical metadata of database tables and views from the Amazon Redshift data warehouse. To add a Amazon SageMaker Unified Studio data source for Amazon Redshift, the source data warehouse must already exist in the Amazon Redshift.

When you create and run an Amazon Redshift data source, you add assets from the source Amazon Redshift data warehouse to your Amazon SageMaker Unified Studio project's inventory. You can run your Amazon Redshift data sources on a set schedule or on demand to create or update your assets' technical metadata. During the data source runs, you can optionally choose to publish your project inventory assets to the Amazon SageMaker Unified Studio catalog and thus make them discoverable by all domain users. You can also publish your inventory assets after editing their business metadata. Domain users can search for and discover your published assets and request subscriptions to these assets.

**Note**  
Adding a data source in the project catalog makes it possible to publish that data into the Amazon SageMaker Catalog. To add a data source for analyzing and editing within your project, use the **Data** page of your project. Data that you add to your connect to on the **Data** page can also be published to the Amazon SageMaker Catalog. For more information, see [The lakehouse architecture of Amazon SageMaker](lakehouse.md).

**To add an Amazon Redshift data source**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which you want to add the data source.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose **Create data source**.

1. Configure the following fields:
   + **Name** – The data source name.
   + **Description** – The data source description.

1. Under **Data source type**, choose **Amazon Redshift**.

1. Under **Connection**, select a connection for your data source. The connection cannot be changed after the data source is created.

1. Under **Data selection**, provide an Amazon Redshift database schema name and enter your table or view selection criteria. For example, if you choose **Include** and enter `*corporate`, the asset will include all source tables that end with the word `corporate`.

   You can add multiple include rules. You can also add another schema using the **Add another schema** button.

1. Choose **Next**.

1. For **Publishing settings**, choose whether assets are immediately discoverable in Amazon SageMaker Catalog. If you only add them to the inventory, you can choose subscription terms later and then publish them to the Amazon SageMaker Catalog. 

1. For **Metadata generation methods**, choose whether to automatically generate metadata for assets as they're published and updated from the source.

1. (Optional) For **Metadata forms**, add forms to define the metadata that is collected and saved when the assets are imported into Amazon SageMaker Unified Studio. For more information, see [Create a metadata form in Amazon SageMaker Unified Studio](create-metadata-form.md).

1. Choose **Next**.

1. For **Run preference**, choose when to run the data source.
   + **Run on a schedule** – Specify the dates and time to run the data source.
   + **Run on demand** – You can manually initiate data source runs.

1. Choose **Next**.

1. Review your data source configuration and choose **Create**.

You can also create a Amazon SageMaker Unified Studio data source for Amazon Redshift by invoking the `CreateDataSource` API action or the `create-data-source` CLI action:

```
aws datazone create-data-source --cli-input-json file://create-sagemaker-datasource-example.json
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon DataZone domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "REDSHIFT",
  "description": "Description of the datasource",
  "environmentIdentifier": "environment123",  
  "configuration": {
    "redshiftRunConfiguration": {
        "dataAccessRole": "arn:aws:iam::123456789012:role/my-data-access-role",
        "redshiftCredentialConfiguration": {
                "secretManagerArn": "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-secret"
            },
        "redshiftStorage": {
            "redshiftClusterSource": {
                "clusterName": "my-redshift-cluster"
            }
        },
        "relationalFilterConfigurations": [{
            "databaseName": "my_database",
            "filterExpressions": [{
                "expression": "*",
                "type": "INCLUDE"
            }],
            "schemaName": "my_schema"
        }]
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content"
    }
  ],
  "clientToken": "123456"
}
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon SageMaker unified domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "REDSHIFT",
  "description": "Description of the datasource",
  "connectionIdentifier": "connection123",  
  "configuration": {
    "redshiftRunConfiguration": {
        "dataAccessRole": "arn:aws:iam::123456789012:role/my-data-access-role",
        "redshiftCredentialConfiguration": {
                "secretManagerArn": "arn:aws:secretsmanager:us-east-1:123456789012:secret:my-secret"
            },
        "redshiftStorage": {
            "redshiftClusterSource": {
                "clusterName": "my-redshift-cluster"
            }
        },
        "relationalFilterConfigurations": [{
            "databaseName": "my_database",
            "filterExpressions": [{
                "expression": "*",
                "type": "INCLUDE"
            }],
            "schemaName": "my_schema"
        }]
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content"
    }
  ],
  "clientToken": "123456"
}
```

# Create an Amazon SageMaker Unified Studio data source for Amazon SageMaker AI in the project catalog
Create a data source for Amazon SageMaker AI

In the current release of Amazon SageMaker Unified Studio, creating an Amazon Sagemaker AI data source is not supported via the UI and can only be done by envoking API or CLI actions.

In order to create a data source for Amazon SageMaker AI in Amazon SageMaker Unified Studio, you must first to create a RAM share between Amazon SageMaker and Amazon DataZone. This RAM share is necessary for Amazon SageMaker to successfully make Amazon DataZone API calls which are needed for various membership and security checks.

If you're using the Amazon DataZone domain, you can complete this step by [adding Amazon SageMaker as a trusted service in your Amazon DataZone domain](https://docs.aws.amazon.com/datazone/latest/userguide/add-sagemaker-as-trusted-service-associate.html).

 If you're using a Amazon SageMaker unified domain, you can do this by completing the following procedure:

1. Navigate to the RAM console at [https://us-east-1.console.aws.amazon.com/ram/home](https://us-east-1.console.aws.amazon.com/ram/home).

1. Choose **Create resource share**.

1. For resource share name enter a unique name. For example `DataZone-<DataZone DomainId>-SageMaker`.

1. Under **Resources** choose **DataZone Domains** from the drop down and then select the Amazon DataZone domain from the list and then choose **Next**.

1. From the Managed Permissions dropdown, choose **AWSRAMSageMakerServicePrincipalPermissionAmazonDataZoneDomain** and then choose **Next**.

1. Under **Principals** from the **Select principal type** dropdown, choose **Service principal**.

1. Enter `sagemaker.amazonaws.com` for the service principal name and choose **Add** and then choose **Next**.

1. Choose **Create resource share**.

Once this is completed, you can invoke the `CreateDataSource` API action or the `create-data-source` CLI action to create a new data source for Amazon SageMaker AI in Amazon SageMaker Unified Studio. 

```
aws datazone create-data-source --cli-input-json file://create-sagemaker-datasource-example.json
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon DataZone domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "SAGEMAKER",
  "description": "Description of the datasource",
  "environmentIdentifier": "environment123",  
  "configuration": {
    "sageMakerRunConfiguration": {
        "trackingAssets": {
            "SageMakerModelPackageGroupAssetType": [
                "arn:aws:sagemaker:us-east-1:123456789012:model-package-group/my-model-package-group",
            ],
            "SageMakerFeatureGroupAssetType": [
                "arn:aws:sagemaker:us-east-1:123456789012:feature-group/my-feature-group",
            ]
        }
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content"
    }
  ],
  "clientToken": "123456"
}
```

Sample payload (create-sagemaker-datasource-example.json per example above) to create an Amazon Sagemaker data sources in an Amazon SageMaker unified domain:

```
{
  "name": "my-data-source",
  "projectIdentifier": "project123",
  "type": "SAGEMAKER",
  "description": "Description of the datasource",
  "connectionIdentifier": "connection123",  
  "configuration": {
    "sageMakerRunConfiguration": {
        "trackingAssets": {
            "SageMakerModelPackageGroupAssetType": [
                "arn:aws:sagemaker:us-east-1:123456789012:model-package-group/my-model-package-group",
            ],
        }
    }
  },
  "recommendation": {
    "enableBusinessNameGeneration": "True"
  },
  "enableSetting": "ENABLED",
  "schedule": {
    "timezone": "UTC",
    "schedule": "cron(7 22 * * ? *)"
  },
  "publishOnImport": "True",
  "assetFormsInput": [
    {
      "formName": "AssetCommonDetailsForm"
      "typeIdentifier": "amazon.datazone.AssetCommonDetailsFormType",
      "typeRevision": "3",
      "content": "form-content"
    }
  ],
  "clientToken": "123456"
}
```

# Edit a data source in Amazon SageMaker Unified Studio
Edit a data source

After you create an Amazon SageMaker Unified Studio data source, you can modify it to change the source details or the data selection criteria. When you no longer need a data source, you can delete it.

**To edit a data source in the project catalog**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project that contains the data source that you want to edit.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose the data source that you want to modify.

1. Expand the **Actions** menu, then choose **Edit data source**.

1. Make your changes to the data source fields as desired, then choose **Save**.

# Delete a data source in Amazon SageMaker Unified Studio
Delete a data source

When you no longer need an Amazon DataZone data source, you can remove it permanently. After you delete a data source, all assets that originated from that data source are still available in the catalog, and users can still subscribe to them. However, the assets will stop receiving updates from the source. We recommend that you first move the dependent assets to a different data source before you delete it.

**To delete a data source in the project catalog**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project that contains the data source that you want to edit.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose the data source that you want to delete.

1. Expand the **Actions** menu, then choose **Delete data source**.

1. To confirm deletion, type `delete` in the text entry field. Then choose **Delete**.

# Publish assets to the Amazon SageMaker Unified Studio catalog from the project inventory
Publish assets to the catalog from the project inventory

You can publish Amazon SageMaker Unified Studio assets and their metadata from project inventories into the Amazon SageMaker Unified Studio catalog. You can only publish the most recent version of an asset to the catalog.

Consider the following when publishing assets to the catalog:
+ To publish an asset to the catalog, you must be the owner or contributor of the project that contains the asset.
+ For Amazon Redshift assets, ensure that the Amazon Redshift clusters associated with both publisher and subscriber clusters meet all the requirements for Amazon Redshift data sharing in order for Amazon SageMaker Unified Studio to manage access for Redshift tables and views. See [Data sharing concepts for Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/concepts.html).

## Publish an asset in Amazon SageMaker Unified Studio
Publish an asset

If you didn't choose to make assets immediately discoverable in the data catalog when you created a data source, perform the following steps to publish them later.

**To publish an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. Make sure you are on the **Inventory** tab, then choose the name of the asset that you want to publish. You are then brought to the asset details page.
**Note**  
By default, all assets require subscription approval, which means a data owner must approve all subscription requests to the asset. If you want to change this setting before publishing the asset, open the asset details and choose **Edit** next to **Subscription approval**. You can change this setting later by modifying and re-publishing the asset.

1. Choose **Publish asset**. The asset is directly published to the catalog.

   If you make changes to the asset, such as modifying its approval requirements, you can choose **Re-publish asset** to publish the updates to the catalog.

# Share assets


In the current release of Amazon SageMaker Unified Studio, you can share your Amazon S3 assets, AWS Glue (SageMaker Lakehouse) assets, and your Amazon QuickSight assets with other projects or users/groups.

For more information about sharing your Amazon QuickSight assets, see [Share Amazon QuickSight dashboards](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/share-qs-dashboard.html).

For more information about sharing your Amazon S3 assets, see [Sharing Amazon S3 data](data-s3-publish.md).

To share your AWS Glue (SageMaker Lakehouse) data, complete the following procedure:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. From the top center menu, choose **Browse all projects**.

1. Select the name of the project to navigate to that project. You can select either one of the projects that you created manually or the project that was automatically created when you've onboarded your Amazon SageMaker Lakehouse data.

1. Choose the **Data** tab, then choose the catalog that you want to work with under **Lakehose** and navigate down to the database and the table asset that you want to share. 

1. Choose the asset that you want to share, then expand **Actions**, and choose **Share**.

1. In the **Share table** window, specify the project with which you want to share this asset and then choose **Share**.

**Note**  
In the current release of Amazon SageMaker Unified Studio, sharing row and column filters is not supported.

# Manage inventory and curate assets in Amazon SageMaker Unified Studio
Manage inventory and curate assets

In order to use Amazon SageMaker Unified Studio to catalog your data, you must first bring your data (assets) as inventory of your project in Amazon SageMaker Unified Studio. Creating inventory for a particular project makes the assets discoverable only to that project’s members. 

After the assets are created in project inventory, their metadata can be curated. For example, you can edit the asset's name, description, or README. Each edit to the asset creates a new version of the asset. You can use the History tab on the asset's details page to view all asset versions. 

You can edit the **README** section and add rich descriptions for the asset. The **README** section supports markdown, thus enabling you to format your descriptions as required and describe key information about an asset to consumers. 

Glossary terms can be added at the asset level by filling out available forms. 

To curate the schema, you can review the columns, add business names, descriptions, and add glossary terms at column level. 

**To update the schema of an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. Make sure you are on the **Inventory** tab, then choose the name of the asset that you want to publish. You are then brought to the asset details page.

1. Choose the **Schema** tab and then on the schema details page, choose the **View/Edit** link of the column that you'd like to curate.

   In the right-hand pane that opens, you can edit the details, ReadMe, glossary terms, and metadata forms of the column.

If automated metadata generation is enabled when the data source is created, the business names for assets and columns are available to review and accept or reject individually or all at once. 

You can also edit the subscription terms to specify if approval for the asset is required or not. 

Metadata forms in Amazon SageMaker Unified Studio enable you to extend a data asset's metadata model by adding custom-defined attributes (for example, sales region, sales year, and sales quarter). The metadata forms that are attached to an asset type are applied to all assets created from that asset type. You can also add additional metadata forms to individual assets as part of the data source run or after it's created. For creating new forms, see [Create a metadata form in Amazon SageMaker Unified Studio](create-metadata-form.md). 

To update the metadata of an asset, you must be the owner or the contributor of the project to which the asset belongs.

**To update the metadata of an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. Make sure you are on the **Inventory** tab, then choose the name of the asset that you want to publish. You are then brought to the asset details page.

1. On the asset details page, under **Metadata forms**, choose **Edit values** to edit the existing forms as needed, or choose **Add metadata form** and enter values for each of the metadata fields to attach additional metadata forms to the asset. 

1. When you're done making updates, choose **Save**.

   When you save the form, Amazon SageMaker Unified Studio generates a new inventory version of the asset. To publish the updated version to the catalog, choose **Re-publish asset**.

By default, metadata forms attached to a domain are attached to all assets published to that domain. Data publishers can associate additional metadata forms to individual assets in order to provide additional context.

When you are satisfied with the asset curation, the data owner can publish an asset version to the Amazon SageMaker Unified Studio catalog and thus make it discoverable by all domain users. The asset in the project shows the inventory version and the published version. In the discovery catalog, only the latest published version appears. If the metadata is updated after publishing, then a new inventory version will be available for publishing to the catalog. 

# Manually create an asset in Amazon SageMaker Unified Studio
Manually create an asset

In Amazon SageMaker Unified Studio, an asset is an entity that presents a single physical data object (for example, a table, a dashboard, a file) or virtual data object (for example, a view). For more information, see [Amazon SageMaker Unified Studio terminology and concepts](concepts.md). Publishing an asset manually is a one-time operation. You don't specify a run schedule for the asset, so it's not updated automatically if its source changes.

To manually create an asset through a project, you must be the owner or contributor of that project.

**To create an asset manually**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project that you want to create an asset in.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. On the **Inventory** tab, choose **Create**, then choose **Create asset**. choose **Create asset**.

1. For **Data asset details**, configure the following settings:
   + **Name** – The name of the asset.
   + **Description** – A description of the asset.

1. Choose **Next**.

1. For **Asset type details**, configure the following settings:
   + **Asset type** – The type of asset.
   + **Revision**.

1. If you are adding an **S3 object collection**, for **S3 location**, enter the Amazon Resource Name (ARN) of the source S3 bucket.

   Optionally, enter an S3 access point. For more information, see [Managing data access with Amazon S3 access points](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html).

1. Choose **Next**.

1. Review the selections, then choose **Create**. 

   After the asset is created, it will be stored in the inventory until you decide to publish it.

# Unpublish an asset from the Amazon SageMaker Catalog
Unpublish an asset from the catalog

When you unpublish an Amazon SageMaker Unified Studio asset from the catalog, it no longer appears in global search results. New users won't be able to find or subscribe to the asset listing in the catalog, but all existing subscriptions remain the same.

To unpublish an asset, you must be the owner or the contributor of the project to which the asset belongs.

**To unpublish an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. On the **Inventory** tab, choose the name of the asset that you want to unpublish. This opens the asset details page.

1. Expand the **Actions** menu, then choose **Unpublish**.

1. In the pop-up window, confirm the action by choosing **Unpublish**.

   The asset is then removed from the catalog. You can re-publish the asset at any time by choosing **Publish asset**.

# Delete an Amazon SageMaker Unified Studio asset
Delete an asset

When you no longer need an asset in Amazon SageMaker Unified Studio, you can permanently delete it. Deleting an asset is different than unpublishing an asset from the catalog. You can delete an asset and its related listing in the catalog so that it's not visible in any search results. To delete the asset listing, you must first revoke all of its subscriptions. 

To delete an asset, you must be the owner or the contributor of the project to which the asset belongs.

**Note**  
In order to delete an asset listing, you must first revoke all existing subscriptions to the asset, and the asset must be removed from all data products. You can't delete an asset listing that has existing subscribers or that is included in a current data product.

**To delete an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. On the **Inventory** tab, choose the name of the asset that you want to unpublish. This opens the asset details page.

1. Expand the **Actions** menu, then choose **Delete**.

1. In the pop-up window, type `delete` to confirm deletion, then choose **Delete**.

   When the asset is deleted, it's no longer available to view or subscribe to.

# Manually start a data source run in Amazon SageMaker Unified Studio
Manually start a data source run

When you run a data source, Amazon SageMaker Unified Studio pulls all any new or modified metadata from the source and updates the associated assets in the inventory. When you add a data source to Amazon SageMaker Unified Studio, you specify the source's run preference, which defines whether the source runs on a schedule or on demand. If your source runs on demand, you must initiate a data source run manually.

Even if your source runs on a schedule, you can still run it manually at any time. After adding business metadata to the assets, you can select assets and publish them to the Amazon SageMaker Catalog in order for these assets to be discoverable by all domain users. Only published assets are searchable by other domain users.

**To run a data source manually**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the data source belongs.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose the data source that you want to run. This opens the data source details page.

1. Choose **Run**.

   The data source status changes as Amazon SageMaker Unified Studio updates the asset metadata with the most recent data from the source. You can monitor the status of the run on the **Data source runs** tab. 

# Asset revisions in Amazon SageMaker Unified Studio
Asset versioning

Amazon SageMaker Unified Studio increments the revision of an asset when you edit its business or technical metadata. These edits include modifying the asset name, description, glossary terms, column names, metadata forms, and metadata form field values. These changes can result from manual edits, data source job runs, or API operations. Amazon SageMaker Unified Studio automatically generates a new asset revision any time you make an edit to the asset.

After you update an asset and a new revision is generated, you must publish the new revision to the catalog for it to be updated and available to subscribers. For more information, see [Publish assets to the Amazon SageMaker Unified Studio catalog from the project inventory](publishing-data-asset.md). You can only publish the most recent version of an asset to the catalog.

**To view past revisions of an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. On the **Inventory** tab, choose the name of the asset that you want to unpublish. This opens the asset details page.

1. Navigate to the **History** tab, which displays a list of past revisions of the asset.

# Data quality in Amazon SageMaker Unified Studio
Data quality

Data quality metrics in Amazon SageMaker Unified Studio help you understand the different quality metrics such as completeness, timeliness, and accuracy of your data sources. Amazon SageMaker Unified Studio integrates with AWS Glue Data Quality and offers APIs to integrate data quality metrics from third-party data quality solutions. Data users can see how data quality metrics change over time for their subscribed assets. To author and run the data quality rules, you can use your data quality tool of choice such as AWS Glue data quality. With data quality metrics in Amazon DataZone, data consumers can visualize the data quality scores for the assets and columns, helping build trust in the data they use for decisions. 

**Prerequisites and IAM role changes**

If you are using Amazon SageMaker Unified Studio's AWS managed policies, there are no additional configuration steps and these managed policies are automatically updated to support data quality. If you are using your own policies for the roles that grant Amazon SageMaker Unified Studio the required permissions to interoperate with supported services, you must update the policies attached to these roles to enable support for reading the AWS Glue data quality information.

## Enabling data quality for AWS Glue assets


Amazon SageMaker Unified Studio pulls the data quality metrics from AWS Glue in order to provide context during a point in time, for example, during a business data catalog search. Data users can see how data quality metrics change over time for their subscribed assets. Data producers can ingest AWS Glue data quality scores on a schedule. The Amazon SageMaker Unified Studio business data catalog can also display data quality metrics from third-party systems through data quality APIs. For more information, see [AWS Glue Data Quality](https://docs.aws.amazon.com/glue/latest/dg/glue-data-quality.html) and [Getting started with AWS Glue Data Quality for the Data Catalog](https://docs.aws.amazon.com/glue/latest/dg/data-quality-getting-started.html).

You can enable data quality metrics for your Amazon SageMaker Unified Studio assets in the following ways:
+ Use Amazon SageMaker Unified Studio or the Amazon DataZone APIs to enable data quality for your AWS Glue data source via the Amazon SageMaker Unified Studio either while creating new or editing existing AWS Glue data source.
**Note**  
You can use Amazon SageMaker Unified Studio to enable data quality only for your AWS Glue inventory assets. In this release of Amazon SageMaker Unified Studio, enabling data quality for custom types assets in Amazon SageMaker Unified Studio must be done using APIs.
+ You can also use the APIs to enable data quality for your new or existing data sources. You can do this by invoking the [CreateDataSource](https://docs.aws.amazon.com/datazone/datazone/latest/APIReference/API_CreateDataSource.htmlAPI) or [UpdateDataSource](https://docs.aws.amazon.com/datazone/datazone/latest/APIReference/API_UpdateDataSource.htmlAPI) APIs and setting the `autoImportDataQualityResult` parameter to 'True'.

After data quality is enabled, you can run the data source on demand or on schedule. Each run can bring in up to 100 metrics per asset. There is no need to create forms or add metrics manually when using data source for data quality. When the asset is published, the updates that were made to the data quality form (up to 30 data points per rule of history) are reflected in the listing for the consumers. Subsequently, each new addition of metrics to the asset is automatically added to the listing. There is no need to republish the asset to make the latest scores available to consumers. 

## Enabling data quality for custom asset types


You can use the Amazon SageMaker Unified Studio APIs to enable data quality for any of your custom type assets. For more information, see the following:
+ [PostTimeSeriesDataPoints](https://docs.aws.amazon.com/datazone/latest/APIReference/API_PostTimeSeriesDataPoints.html)
+ [ListTimeSeriesDataPoints](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListTimeSeriesDataPoints.html)
+ [GetTimeSeriesDataPoint](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetTimeSeriesDataPoint.html)
+ [DeleteTimeSeriesDataPoints](https://docs.aws.amazon.com/datazone/latest/APIReference/API_DeleteTimeSeriesDataPoints.html)

The following steps provide an example of using APIs or CLI to import third-party metrics for your assets in Amazon SageMaker Unified Studio:

1. Invoke the `PostTimeSeriesDataPoints` API as follows:

   ```
   aws datazone post-time-series-data-points  \
   --cli-input-json file://createTimeSeriesPayload.json \
   ```

   with the following payload:

   ```
   "domainId": "dzd_5oo7xzoqltu8mf",
       "entityId": "4wyh64k2n8czaf",
       "entityType": "ASSET",
       "form": {
           "content": "{\n  \"evaluations\" : [ {\n    \"types\" : [ \"MaximumLength\" ],\n    \"description\" : \"ColumnLength \\\"ShippingCountry\\\" <= 6\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"ShippingCountry\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"MaximumLength\" ],\n    \"description\" : \"ColumnLength \\\"ShippingState\\\" <= 2\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"ShippingState\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"MaximumLength\" ],\n    \"description\" : \"ColumnLength \\\"ShippingCity\\\" <= 8\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"ShippingCity\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"Completeness\" ],\n    \"description\" : \"Completeness \\\"ShippingStreet\\\" >= 0.59\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"ShippingStreet\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"MaximumLength\" ],\n    \"description\" : \"ColumnLength \\\"ShippingStreet\\\" <= 101\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"ShippingStreet\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"MaximumLength\" ],\n    \"description\" : \"ColumnLength \\\"BillingCountry\\\" <= 6\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"BillingCountry\" ],\n    \"status\" : \"PASS\"\n  }, {\n    \"types\" : [ \"Completeness\" ],\n    \"description\" : \"Completeness \\\"biLlingcountry\\\" >= 0.5\",\n    \"details\" : {\n      \"EVALUATION_MESSAGE\" : \"Value: 0.26666666666666666 does not meet the constraint requirement!\"\n    },\n    \"applicableFields\" : [ \"biLlingcountry\" ],\n    \"status\" : \"FAIL\"\n  }, {\n    \"types\" : [ \"Completeness\" ],\n    \"description\" : \"Completeness \\\"Billingstreet\\\" >= 0.5\",\n    \"details\" : { },\n    \"applicableFields\" : [ \"Billingstreet\" ],\n    \"status\" : \"PASS\"\n  } ],\n  \"passingPercentage\" : 88.0,\n  \"evaluationsCount\" : 8\n}",
           "formName": "shortschemaruleset",
           "id": "athp9dyw75gzhj",
           "timestamp": 1.71700477757E9,
           "typeIdentifier": "amazon.datazone.DataQualityResultFormType",
           "typeRevision": "8"
       },
       "formName": "shortschemaruleset"
   }
   ```

   You can obtain this payload by invoking the `GetFormType` action:

   ```
   aws datazone get-form-type --domain-identifier <your_domain_id> --form-type-identifier amazon.datazone.DataQualityResultFormType --region <domain_region> --output text --query 'model.smithy'
   ```

1. Invoke the `DeleteTimeSeriesDataPoints` API as follows:

   ```
   aws datazone delete-time-series-data-points\
   --domain-identifier dzd_bqqlk3nz21zp2f \
   --entity-identifier dzd_bqqlk3nz21zp2f \
   --entity-type ASSET \
   --form-name rulesET1 \
   ```

# Using machine learning and generative AI in Amazon SageMaker Unified Studio


**Note**  
Powered by Amazon Bedrock: AWS implements automated abuse detection. Because the AI recommendations for assets in Amazon SageMaker Unified Studio is built on Amazon Bedrock, users inherit the controls implemented in Amazon Bedrock to enforce safety, security, and the responsible use of AI.

In the current release of Amazon SageMaker Unified Studio, you can use the AI recommendations for names, descriptions, and glossary terms functionality to automate data discovery and cataloging. 

Powered by Amazon Bedrock's large language models, the AI recommendations for data asset names, descriptions, and glossary terms in Amazon SageMaker Unified Studio help you to ensure that your data is comprehensible and easily discoverable. The AI recommendations also suggest the most pertinent analytical applications for datasets. By reducing manual documentation tasks and advising on appropriate data usage, auto-generated names and descriptions can help you to enhance the trustworthiness of your data and minimize overlooking valuable data to accelerate informed decision making.

AI recommendations for glossary terms is a feature that automatically analyzes asset metadata and context to determine the most relevant business glossary terms for each asset and its columns. Instead of relying on manual tagging or static rules, it reasons about the data and performs iterative searches across what already exists in the customer’s environment to identify the best-fit glossary term concepts. Because the system suggests terms only from glossaries and definitions already present in the system, customers are encouraged to maintain high-quality, well-described glossary entries so the AI can return accurate and meaningful suggestions. This improves metadata quality, strengthens governance, accelerates data onboarding, and reduces manual stewardship effort at scale.

## Supported Regions for the AI recommendations for names and descriptions


In the current Amazon SageMaker Unified Studio release, the AI recommendations for names and descriptions feature is supported in the following regions:
+ US East (N. Virginia)
+ US West (Oregon)
+ Asia Pacific (Tokyo)
+ Europe (Frankfurt)
+ Asia Pacific (Sydney)
+ Canada (Central)
+ Europe (London)
+ South America (Sao Paulo)
+ Europe (Ireland)
+ Asia Pacific (Singapore)
+ US East (Ohio)
+ Asia Pacific (Seoul)

Amazon SageMaker Unified Studio supports Business Description Generation in the following regions.
+ Asia Pacific (Mumbai)
+ Europe (Paris)

Amazon SageMaker Unified Studio supports Business Name Generation in the following regions.
+ Europe (Stockholm)

**Bedrock Cross Region Inference**  
Amazon SageMaker Unified Studio leverages Amazon Bedrock's Cross Region inference endpoint to serve recommendations for the US East (Ohio) region. All other regions use in-region endpoint.

## Supported Regions for the AI recommendations for glossary terms


In the current Amazon SageMaker Unified Studio release, the AI recommendations for glossary terms feature is supported in the following regions:
+ US East (N. Virginia)
+ US West (Oregon)
+ Asia Pacific (Tokyo)
+ Europe (Frankfurt)
+ Asia Pacific (Sydney)
+ Europe (London)
+ Europe (Ireland)
+ Asia Pacific (Singapore)
+ US East (Ohio)
+ Asia Pacific (Seoul)
+ Asia Pacific (Mumbai)
+ Europe (Paris)
+ Europe (Stockholm)

**Bedrock Cross Region Inference**  
Amazon SageMaker Unified Studio leverages Amazon Bedrock's Cross Region inference endpoint to serve recommendations for all of the supported regions for AI recommendations for glossary terms. 

## Steps to use GenAI


The following procedure describes how to generate AI recommendations for names, descriptions, and glossary terms in Amazon SageMaker Unified Studio:
+ Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 
+ Choose the project that contains the asset for which you want to generate AI recommendations for descriptions.

### Generating Business Descriptions and Summaries

+ Navigate to the **Data** tab for the project.
+ From **Project catalog**, choose **Assets** and chose the asset for which you want to generate AI recommendations for descriptions.
+ On the asset's details page, in the **Business metadata** tab, choose **Generate descriptions**.

### Generating glossary terms

+ Navigate to the **Data** tab for the project.
+ From **Project catalog**, choose **Assets** and chose the asset for which you want to generate AI recommendations for glossary terms.
+ On the asset's details page, in the **Business metadata** tab, choose **Generate terms**.

### Generating Business Names

+ Navigate to the **Data** tab for the project.
+ In the left navigation pane, choose **Data sources**, and then choose datasource for which you want to enable business name generation.
+ Go to the **details** tab and enable the **AUTOMATED BUSINESS NAME GENERATION** configuration.
+ BusinessNames can also be generated programmatically when creating an asset by enabling the businessNameGeneration flag under predictionConfiguration in the [CreateAsset API](https://docs.aws.amazon.com/datazone/latest/APIReference/API_CreateAsset.html) payload.

### Accepting/Rejecting Predictions

+ Once the metadata (name, description or terms) suggestions, are generated, you can either edit, accept, or reject them.
+ Sparkle icons are displayed next to each automatically generated metadata (name, description or terms), for the data asset. In the **Business metadata** tab, you can choose the sparkle icon next to the automatically generated **Summary**, and then choose **Edit**, **Accept**, or **Reject** to address the generated description.
+ You can also choose **Accept all** or **Reject all** options that are displayed at the top of the page when the **Business metadata** tab is selected, and thus perform the selected action on all automatically generated metadata (name, description or terms).
+ Or you can choose the **Schema** tab, and then address automatically generated metadata (name, description or terms) individually by choosing the sparkle icon for one suggested metadata change at a time and then choosing **Accept** or **Reject**.
+ In the **Schema** tab, you can also choose **Accept all** or **Reject all** and thus perform the selected action on all automatically generated metadata.

To publish the asset to the catalog with the generated descriptions, choose **Publish asset**, and then confirm this action by choosing **Publish asset** again in the **Publish asset** pop up window.

**Note**  
If you don't accept or reject the generated metadata for an asset, and then you publish this asset, this unreviewed automatically generated metadata is not included in the published data asset.

## Support for custom relational asset types


Amazon SageMaker Unified Studio supports genAI capabilities for custom asset types. Previously this feature was only supported for the managed AWS Glue and Amazon Redshift asset types.

In order to enable this feature, create your own asset type definition and attach `RelationalTableFormType` as one of the forms. Amazon SageMaker Unified Studio automatically detects the presence of such forms and enables GenAI capabilities for these assets. The overall experience remains the same for generating business names (via predictionConfiguration in the CreateAsset API), business description (via Generate Description button click on the asset details page), and glossary terms.

For more information about creating custom asset types see [Create custom asset types in Amazon SageMaker Unified Studio](create-asset-types.md). 

## Quotas


Amazon SageMaker Unified Studio supports different quotas for business name generation and business description generation. You can reach out to the AWS Support team for an increase in these quotas.
+ BusinessDescriptionGeneration: 10K invocations/month
+ BusinessNameGeneration: 50K invocations/month
+ GlossaryTermGeneration - 10k invocations/month

# Data lineage in Amazon SageMaker Unified Studio
Data lineage

Data lineage in Amazon SageMaker Unified Studio is an OpenLineage-compatible feature that can help you to capture and visualize lineage events, from OpenLineage-enabled systems or through APIs, to trace data origins, track transformations, and view cross-organizational data consumption. It provides you with an overarching view into your data assets to see the origin of assets and their chain of connections. The lineage data includes information on the activities inside the Amazon SageMaker Catalog, including information about the catalogued assets, the subscribers of those assets, and the activities that happen outside the business data catalog captured programmatically using the APIs.

**Topics**
+ [

# What is OpenLineage?
](datazone-data-lineage-what-is-openlineage.md)
+ [

# Data lineage support
](datazone-data-lineage-support.md)
+ [

# Data lineage support matrix
](datazone-support-matrix.md)
+ [

# Visualizing data lineage
](datazone-visualizing-data-lineage.md)
+ [

# Test drive data lineage
](datazone-data-lineage-sample-experience.md)
+ [

# Data lineage authorization
](datazone-data-lineage-authorization.md)
+ [

# Automate lineage capture from data connections
](datazone-data-lineage-automate-capture-from-data-connections.md)
+ [

# Automate lineage capture from tools
](datazone-data-lineage-automate-capture-from-tools.md)
+ [

# Permissions required for data lineage
](datazone-data-lineage-permissions.md)
+ [

# Publishing data lineage programmatically
](datazone-data-lineage-apis.md)
+ [

# The importance of the sourceIdentifier attribute to lineage nodes
](datazone-data-lineage-sourceIdentifier-attribute.md)
+ [

# Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio
](datazone-data-lineage-linking-nodes.md)
+ [

# Troubleshooting data lineage
](datazone-lineage-troubleshooting.md)

# What is OpenLineage?


[OpenLineage](https://openlineage.io/) is an open platform for collection and analysis of data lineage. It tracks metadata about datasets, jobs, and runs, giving users the information required to identify the root cause of complex issues and understand the impact of changes. It is an Open Standard for lineage metadata collection designed to record metadata for a job in execution.

The standard defines a generic model of dataset, job, and run entities uniquely identified using consistent naming strategies. The dataset and job entities are identified by combination of 'namespace' and 'name' attributes whereas run is identified by runId. The entities can be enriched with user-defined metadata via facets (similar to metadata forms in Amazon SageMaker Unified Studio).

OpenLineage supports three types of events: RunEvent, DatasetEvent and JobEvent.
+ RunEvent: this event is generated as a result of job-run execution. It contains details of the run, the job it belongs to, input datasets that run consumes and output datasets the run produces. Reference for samples run events. Currently, Amazon SageMaker Unified Studio only supports RunEvents.
+ DatasetEvent: this event represents the changes in dataset (like any static updates on the dataset)
+ JobEvent: this event represents the changes in job configuration/details

In the current release of Amazon SageMaker Unified Studio, OpenLineage 1.22.0\$1 versions are supported.

# Data lineage support


In Amazon SageMaker Unified Studio, domain administrators or data users can configure lineage in projects while setting up connections for data lake and data warehouse sources to ensure the data source runs created from those resources are enabled for automatic lineage capture. Data lineage is automatically captured from data sources, such as AWS Glue and Amazon Redshift, as well as from tools, such as Visual ETL and notebooks, as executions create, update, or transform data. Additionally, Amazon SageMaker Unified Studio captures the movement of data within the catalog as producers bring their asset into inventory, and publish them, as well as when consumers subscribe and get access, to indicate who the subscribing projects are for a given asset. With this automation, different stages of an asset in the catalog are captured including when schema changes are detected.

Using Amazon SageMaker Unified Studio's OpenLineage-compatible APIs, domain administrators and data producers can capture and store lineage events beyond what is available in Amazon SageMaker Unified Studio, including transformations in Amazon S3, AWS Glue, and other services. This provides a comprehensive view for the data consumers and helps them gain confidence of the asset's origin, while data producers can assess the impact of changes to an asset by understanding its usage. Additionally, Amazon SageMaker Unified Studio versions lineage with each event, enabling users to visualize lineage at any point in time or compare transformations across an asset's or job's history. This historical lineage provides a deeper understanding of how data has evolved, essential for troubleshooting, auditing, and ensuring the integrity of data assets.

With data lineage, you can accomplish the following in Amazon SageMaker Unified Studio: 
+ Understand the provenance of data: knowing where the data originated fosters trust in data by providing you with a clear understanding of its origins, dependencies, and transformations. This transparency helps in making confident data-driven decisions.
+ Understand the impact of changes to data pipelines: when changes are made to data pipelines, lineage can be used to identify all of the downstream consumers that are to be affected. This helps to ensure that changes are made without disrupting critical data flows.
+ Identify the root cause of data quality issues: if a data quality issue is detected in a downstream report, lineage, especially column-level lineage, can be used to trace the data back (at a column level) to identify the issue back to its source. This can help data engineers to identify and fix the problem.
+ Improve data governance and compliance: column-level lineage can be used to demonstrate compliance with data governance and privacy regulations. For example, column-level lineage can be used to show where sensitive data (such as PII) is stored and how it is processed in downstream activities.

**OpenLineage custom transport to send lineage events to SageMaker**

OpenLineage events which contain metadata about data pipelines, jobs, and runs, are typically sent to backend for storage and analysis. The transport mechanism handles this transmission. As an extension of the OpenLineage project, a custom transport is available to send lineage events directly Amazon SageMaker Unified Studio's endpoint. The custom transport was merged into OpenLineage version 1.33.0 ([https://openlineage.io/docs/releases/1\$133\$10/](https://openlineage.io/docs/releases/1_33_0/)). This allows the use of OpenLineage plugins with the transport to send lineage events collected directly to Amazon SageMaker Unified Studio. 

# Data lineage support matrix


Lineage capture is automated from the following tools in Amazon SageMaker Unified Studio:


**Tools support matrix**  

| **Tool** | **Compute** | **AWS Service** | **Service deployment option** | **Support status** | **Notes** | 
| --- | --- | --- | --- | --- | --- | 
| Jupyterlab notebook | Spark | EMR | EMR Serverless | Automated | Spark DataFrames only; remote workflow execution | 
| Jupyterlab notebook | Spark | AWS Glue | N/A | Automated | Spark DataFrames only; remote workflow execution | 
| Visual ETL | Spark | AWS Glue | compatibility mode | Automated | Spark DataFrames only | 
| Visual ETL | Spark | AWS Glue | fineGrained mode | Not supported | Spark DataFrames only | 
| Query Editor |  | Amazon Redshift |  | Automated |  | 

Lineage is captured from the following services: 


**Services support matrix**  

| **Data source** | **Lineage Support status** | **Required Configuration** | **Notes** | 
| --- | --- | --- | --- | 
| AWS Glue Crawler | Automated by default in SageMaker Unified Studio | None | Supported for assets crawled via AWS Glue Crawler for the following data sources: Amazon S3, Amazon DynamoDB, Amazon S3 Open Table Formats including: Delta Lake, Iceberg tables, Hudi tables, JDBC, PostgreSql, DocumentDB, and MongoDB. | 
| Amazon Redshift | Automated by default in SageMaker Unified Studio | None | Redshift System tables will be used to retrieve user queries and lineage is generated by parsing those queries | 
| AWS Glue jobs in AWS Glue console  | Not automated by default | User can select "generate lineage events" and pass domainId  |  | 
| EMR | Not automated by default | User has to pass spark conf parameters to publish lineage events | Supported versions: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/datazone-support-matrix.html)  More details in [Capture lineage from EMR Spark executions](datazone-data-lineage-automate-capture-from-tools.md#datazone-data-lineage-automate-capture-from-tools-emrnotebook) | 

# Visualizing data lineage


In Amazon SageMaker Unified Studio, nodes in the lineage graph contain lineage information while edges represent upstream/downstream directions of data propagation. The lineage information is present in metadata forms attached to the lineage node. Amazon SageMaker Unified Studio defines three types of lineage nodes: 
+ Dataset node - this node includes data lineage information about a specific dataset.
  + Dataset refers to any object such as table, view, Amazon S3 file, Amazon S3 bucket, etc. It also refers to the assets in Amazon SageMaker Unified Studio's inventory and catalog, and subscribed tables/views.
  + Each version of the dataset node represents an event happening on the dataset at that timestamp. The history tab on the dataset node shows all dataset versions.
+ Job node - this node includes job details such as type of the job (query, etl etc), processing type (batch, streaming) etc job-type (query, etl etc), processing-type etc.
+ JobRun node - this node represents the job run details such as the job it belongs to, status, start/end timestamps etc. Amazon SageMaker Unified Studio's lineage graph shows a combined for job and job-run which shows job details and latest run details along with a history of previous job-runs.

Lineage graph can be visualized with base node as an asset. In the SageMaker Unified Studio, search for the assets, open any asset and you can see lineage on the Asset Details page. 

Here is the sample lineage graph for a user who is a data producer:

![\[Sample lineage graph for a user who is a data producer.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot4datalineage.png)


Here is the sample lineage graph for a user who is a data consumer:

![\[Sample lineage graph for a user who is a data consumer.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot5datalineage.png)


The asset details page provides the following capabilities to navigate the graph:
+ Column-level lineage: expand column-level lineage when available in dataset nodes. This automatically shows relationships with upstream or downstream dataset nodes if source column information is available.
+ Column search: the default display for number of columns is 10. If there are more than 10 columns, pagination is activated to navigate to the rest of the columns. To quickly view a particular column, you can search on the dataset node that list just the searched column.
+ View dataset nodes only: if you want to toggle to view only dataset lineage nodes and filter out the job nodes, you can choose the **Open view control** icon on the top left of the graph viewer and toggle the **Display dataset nodes only** option. This will remove all the job nodes from the graph and lets you navigate just the dataset nodes. Note that when the view only dataset nodes is turned on, the graph cannot be expanded upstream or downstream.
+ Details pane: Each lineage node has details captured and displayed when selected.
  + Dataset node has a detail pane to display all the details captured for that node for a given timestamp. Every dataset node has 3 tabs, namely: Lineage info, Schema, and History tab. The history tab lists the different versions of lineage event captured for that node. All details captured from API are displayed using metadata forms or a JSON viewer.
  + Job node has a detail pane to display job details with tabs, namely: Job info, and History. The details pane also captures query or expressions captured as part of the job run. The history tab lists the different job runs captured for that job. All details captured from API are displayed using metadata forms or a JSON viewer.
+ History tab: all lineage nodes in Amazon SageMaker Unified Studio's lineage have versioning. For every dataset node or job node, the versions are captured as history and that enables you to navigate between the different versions to identify what has changed over time. Each version opens a new tab in the lineage page to help compare or contrast.

# Aggregated lineage view


You can view an asset's lineage in two ways:
+ **Aggregated view** - Shows all jobs that are currently contributing to an asset's lineage, providing a complete picture of the data transformations and dependencies across multiple levels of the lineage graph. Use this view to understand the full scope of jobs impacting your datasets and to identify all upstream sources and downstream consumers.
+ **Timestamp view** - Shows the lineage graph as it existed at a specific point in time, displaying only the latest job run for each job at that timestamp. This view includes column-level lineage and is useful for troubleshooting and investigating specific data processing events.

The aggregated view is the default in most regions and shows the current state of your data lineage. In Opt-In Regions, only the timestamp view is available.

To switch between views, choose the **Open view control** icon in the top left of the lineage graph viewer and toggle the **Display in event timestamp order** option. When enabled, the timestamp view is displayed. When disabled, the aggregated view is displayed. This toggle is not available in Opt-In Regions.

Here is a sample aggregated view of a lineage graph:

![\[Sample aggregated view of a lineage graph showing all jobs currently contributing to the asset.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot6datalineage.png)


Here is a sample timestamp view of a lineage graph:

![\[Sample timestamp view of a lineage graph showing the latest job run at a specific point in time.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot7datalineage.png)


# Test drive data lineage


You can use the data lineage sample experience to browse and understand data lineage in Amazon SageMaker Unified Studio, including traversing upstream or downstream in your data lineage graph, exploring versions and column-level lineage.

Complete the following procedure to try the sample data lineage experience in Amazon SageMaker Unified Studio:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project you want to view lineage in.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. On the **Inventory** tab, choose the name of the asset that you want to view lineage for. This opens the asset details page.

1. On the asset details page, choose the **Lineage** tab.

1. In the data lineage window, choose the info icon that says **Try sample data lineage**. Then choose **Launch**. A new pop-up window appears.

1. Choose **Start guided data lineage tour**.

1. Select a guided tour option, and then choose **Start tour**.

   At this point, a tab that provides all the space of lineage information is displayed. The sample data lineage graph is initially displayed with a base node with 1-depth at either ends, upstream and downstream. You can expand the graph upstream or downstream. The columns information is also available for you to choose and see how lineage flows through the nodes. 

# Data lineage authorization


**Write permissions** - to publish lineage events into Amazon SageMaker Unified Studio, you must have an IAM role with a policy that includes an ALLOW action on the PostLineageEvent API. To publish lineage data into Amazon SageMaker Unified Studio, you must have an IAM role with a permissions policy that includes an ALLOW action on the PostLineageEvent API. This IAM authorization happens at API Gateway layer.

**Read permissions to view lineage** - GetLineageNode and ListLineageNodeHistory are included in the AmazonSageMakerDomainExecution managed policy and therefore every user in an Amazon SageMaker unified domain can invoke these to view the data lineage graph in Amazon SageMaker Unified Studio.

**Read permissions to get lineage events:** you must have an IAM role with a policy that includes ALLOW action on ListLineageEvents and GetLineageEvent APIs to view lineage events posted to Amazon SageMaker Unified Studio.

# Automate lineage capture from data connections


**Topics**
+ [

## Configure automated lineage capture for AWS Glue (Lakehouse) connections
](#datazone-data-lineage-automate-capture-from-data-connections-glue)
+ [

## Configure automated lineage capture for Amazon Redshift connections
](#datazone-data-lineage-automate-capture-from-data-connections-redshift)

## Configure automated lineage capture for AWS Glue (Lakehouse) connections


As databases and tables are added to the Amazon SageMaker Unified Studio’s catalog, the lineage extraction can be automated from source for those assets using data source runs in Create Connection workflow. For every connection created, lineage is not automatically enabled. 

**To enable lineage capture for an AWS Glue connection**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials.

1. Choose **Select project** from the top navigation pane and select the project to which you want to add the data source.

1. Choose **Data sources** from the left navigation pane under Project catalog.

1. Choose the data source that you want to modify.

1. Expand the **Actions** menu, then choose **Edit data source** or click on the Data Source run name to view the details and go to **Data Source Definition** tab and choose **Edit** in **Connection** details. 

1. Go to the connections and select **Import data lineage** checkbox to configure lineage capture from the source. 

1. Make other changes to the data source fields as desired, then choose **Save**.

   **Limitations**

   Lineage is captured only for crawlers which imported less than 250 tables in a crawler run.

**Note**  
When enabled, the lineage runs asynchronously to capture metadata from the source and generate lineage events to be stored in SageMaker Catalog to be visualized from a particular asset. The status of lineage runs for the data source can be viewed along with data source run details. 

## Configure automated lineage capture for Amazon Redshift connections


Capturing lineage from Amazon Redshift can be automated when the connection is added to an Amazon Redshift source in Amazon SageMaker Unified Studio’s Data explorer. Lineage capture can be automated for a connection at the data source configuration. For every connection created, lineage is not automatically enabled. 

**To enable lineage capture for an Amazon Redshift connection**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials.

1. Choose **Select project** from the top navigation pane and select the project to which you want to add the data source.

1. Choose **Data sources** from the left navigation pane under **Project catalog**.

1. Choose the data source that you want to modify.

1. Expand the **Actions** menu, then choose **Edit data source** or click on the data source run name to view the details and go to Data Source Definition tab and select **Edit** in **Connection details**. 

1. Go to the connections and select **Import data lineage** checkbox to configure lineage capture from the source. 

1. Make other changes to the data source fields as desired, then choose **Save**.

**Note**  
When enabled, the lineage runs captures queries executed for a given database and generates lineage events to be stored in Amazon DataZone to be visualized from a particular asset. The lineage run for Amazon Redshift is set up for a daily run to pull from the Amazon Redshift system tables to derive lineage. For the first run, after enabling the feature, the first pull is scheduled for \$15 minutes after and set for a daily run. You can configure specific time programmatically. 

# Automate lineage capture from tools


**Topics**
+ [

## Capture lineage for Spark executions in Visual ETL
](#datazone-data-lineage-automate-capture-from-tools-vetl)
+ [

## Capture lineage for AWS Glue Spark executions in Notebooks
](#datazone-data-lineage-automate-capture-from-tools-gluenotebook)
+ [

## Capture lineage from EMR Spark executions
](#datazone-data-lineage-automate-capture-from-tools-emrnotebook)

## Capture lineage for Spark executions in Visual ETL


When a new job is created in Visual ETL in Amazon SageMaker Unified Studio, lineage is automatically enabled. When a Visual ETL flow is created, lineage capture for that ETL flow is automatically enabled when you hit **Save to Project**. For every flow to capture lineage automatically, select **Save to Project** and then select **Run**.

**Note:** if you see that lineage is not getting captured, select **Save** and then move back to the Visual ETL flows section and then reopen the Visual ETL flow.

The following Spark configuration parameters are automatically added to the job being executed. When invoking Visual ETL programmatically, use the below configuration.

```
{
    "--conf":"spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener 
    --conf spark.openlineage.transport.type=amazon_datazone_api 
    --conf spark.openlineage.transport.domainId={DOMAIN_ID} 
    --conf spark.glue.accountId={ACCOUNT_ID} 
    --conf spark.openlineage.facets.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;]
    --conf spark.openlineage.columnLineage.datasetLineageEnabled=True
    --conf spark.glue.JOB_NAME={JOB_NAME}"
}
```

The parameters are auto-configured and do not need any updates from the user. To understand the parameters in detail: 
+ `spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener` - OpenLineageSparkListener will be created and registered with Spark's listener bus
+ `spark.openlineage.transport.type=amazon_datazone_api` - This is an OpenLineage specification to tell the OpenLineage Plugin to use DataZone API Transport to emit lineage events to DataZone’s PostLineageEvent API. For more information, see [https://openlineage.io/docs/integrations/spark/configuration/spark\$1conf/](https://openlineage.io/docs/integrations/spark/configuration/spark_conf/)
+ `spark.openlineage.transport.domainId={DOMAIN_ID}` - This parameter establishes the domain to which the API transport will submit the lineage events to.
+ `spark.openlineage.facets.custom_environment_variables [AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;]` - The following environment variables (`AWS_DEFAULT_REGION`, `GLUE_VERSION`, `GLUE_COMMAND_CRITERIA`, and `GLUE_PYTHON_VERSION`), which AWS Glue interactive session populates, will be added to the LineageEvent
+ `spark.glue.accountId=<ACCOUNT_ID>` - Account Id of the Glue Data Catalog where the metadata resides. This account id is used to construct Glue ARN in lineage event.
+ `spark.glue.JOB_NAME` - Job name of the lineage event. In vETL flow, the job name is configured automatically to be spark.glue.JOB\$1NAME: \$1\$1projectId\$1.\$1\$1pathToNotebook\$1

**Spark compute limitations**
+ OpenLineage libraries for Spark are built into AWS Glue v5.0\$1 for Spark DataFrames only. Dynamic DataFrames are not supported.
+ LineageEvent has a size limit of 300KB.

## Capture lineage for AWS Glue Spark executions in Notebooks


Sessions in notebooks does not have a concept of a job. You can map the Spark executions to lineage events by generating a unique job name for the notebook. You can use the %%configure magic with the below parameters to enable lineage capture for Spark executions in the notebook. 

Note: for AWS Glue Spark executions in notebooks, lineage capture is automated when scheduled with workflow in shared environment using remote workflows.

```
%%configure --name {COMPUTE_NAME} -f
{
"--conf":"spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener --conf spark.openlineage.transport.type=amazon_datazone_api --conf spark.openlineage.transport.domainId={DOMAIN_ID} --conf spark.glue.accountId={ACCOUNT_ID} --conf spark.openlineage.columnLineage.datasetLineageEnabled=True --conf spark.openlineage.facets.custom_environment_variables=[AWS_DEFAULT_REGION;GLUE_VERSION;GLUE_COMMAND_CRITERIA;GLUE_PYTHON_VERSION;] --conf spark.glue.JOB_NAME={JOB_NAME}" 
}
```

Examples of \$1COMPUTE\$1NAME\$1: project.spark.compatibility or project.spark.fineGrained

Here are these parameters and what they configure, in detail:
+ spark.extraListeners=io.openlineage.spark.agent.OpenLineageSparkListener
  + OpenLineageSparkListener will be created and registered with Spark's listener bus
+ spark.openlineage.transport.type=amazon\$1datazone\$1api
  + [https://openlineage.io/docs/integrations/spark/configuration/spark\$1conf](https://openlineage.io/docs/integrations/spark/configuration/spark_conf)
  + This is an OpenLineage specification to tell the OpenLineage Plugin to use DataZone API Transport to emit lineage events to DataZone's PostLineageEvent API to be captured in SageMaker.
+ spark.openlineage.transport.domainId=\$1DOMAIN\$1ID\$1
  + This parameter establishes the domain to which the API transport will submit the lineage events to.
+ spark.openlineage.facets.custom\$1environment\$1variables [AWS\$1DEFAULT\$1REGION;GLUE\$1VERSION;GLUE\$1COMMAND\$1CRITERIA;GLUE\$1PYTHON\$1VERSION;]
  + The following environment variables (AWS\$1DEFAULT\$1REGION, GLUE\$1VERSION, GLUE\$1COMMAND\$1CRITERIA, and GLUE\$1PYTHON\$1VERSION), which Glue interactive session populates, will be added to the LineageEvent
+ spark.glue.accountId=\$1ACCOUNT\$1ID\$1
  + Account Id of the Glue Data Catalog where the metadata resides. This account id is used to construct Glue ARN in lineage event.
+ [optional] spark.openlineage.transport.region=\$1DOMAIN\$1REGION\$1
  + If domain region is different from that of the job's execution region, pass this parameter with value as domain's region
+ spark.glue.JOB\$1NAME
  + Job name of the lineage event. For example, the job name can be set to be spark.glue.JOB\$1NAME: \$1\$1projectId\$1.\$1\$1pathToNotebook\$1.

## Capture lineage from EMR Spark executions


EMR with Spark engine has the necessary OpenLineage libraries built in. You need to pass the following spark parameters. Be sure to replace the \$1Domain ID\$1 with your specific Amazon DataZone or Amazon SageMaker Unified Studio domain and to replace the \$1Account ID\$1 with the account id where the EMR job is run.

```
%%configure --name emr-s.{EMR_SERVERLESS_COMPUTE_NAME}
{   
    "conf": {
         "spark.extraListeners": "io.openlineage.spark.agent.OpenLineageSparkListener",
         "spark.openlineage.columnLineage.datasetLineageEnabled":"True",
         "spark.openlineage.transport.type":"amazon_datazone_api",
         "spark.openlineage.transport.domainId":"{DOMAIN_ID}",
         "spark.openlineage.transport.region":"{DOMAIN_REGION}" // Only needed if the domain is in different region than that of the job         
         "spark.glue.accountId":"{ACCOUNT_ID}", // needed if AWS Glue is being used as the Hive metastore
         "spark.jars":"/usr/share/aws/datazone-openlineage-spark/lib/DataZoneOpenLineageSpark-1.0.jar" // Only needed incase of EMR-S
    }
}
```
+ Lineage is supported for the following EMR versions:
  + EMR-S: 7.5\$1
  + EMR on EC2: 7.11\$1
  + EMR on EKS: 7.12\$1
+ The JOB\$1NAME is the Spark application name that is automatically set
+ Replace \$1DOMAIN\$1ID\$1, \$1ACCOUNT\$1ID\$1, \$1DOMAIN\$1REGION\$1
+ Amazon SageMaker Unified Studio VPC endpoint is deployed to EMR VPC endpoint

# Permissions required for data lineage


## Read permissions to view lineage


Permissions on following actions are needed to view lineage graph:
+ `datazone:GetLineageNode`
+ `datazone:ListLineageNodeHistory`
+ `datazone:QueryGraph`

Above permissions are included in the `AmazonSageMakerDomainExecution` managed policy and therefore every user in an Amazon SageMaker Unified Studio domain can invoke these to view the data lineage graph in Amazon SageMaker Unified Studio.

Permissions on following actions are needed to view lineage events:
+ `datazone:ListLineageEvents`
+ `datazone:GetLineageEvent`

User must have an IAM role with a policy that includes "Allow" action on these APIs to view lineage events posted to Amazon SageMaker Unified Studio.

## Write permissions to publish lineage


### Lineage for AWS Glue crawler


The project user role is used to fetch required data from AWS Glue. The project user role should contain the following permissions on Glue operations:
+ `glue:listCrawls`
+ `glue:getConnection`

**Note**  
`SageMakerStudioProjectUserRolePolicy` already contains above permissions.

### Lineage for Amazon Redshift


The project user role is used to execute queries on the cluster/workgroup defined in the connection. The project user role should contain the following permissions:
+ `redshift-data:BatchExecuteStatement`
+ `redshift-data:ExecuteStatement`
+ `redshift-data:DescribeStatement`
+ `redshift-data:GetStatementResult`

**Note**  
`SageMakerStudioProjectUserRolePolicy` already contains above permissions.

In addition, the credentials provided for Amazon Redshift connection in Amazon SageMaker Unified Studio should contain following permissions:
+ `sys:operator` role to access the data from system tables for all user queries performed on the cluster/workgroup
+ Has "SELECT" grant on all the tables

### Lineage for AWS Glue, EMR jobs


The IAM role used to execute the job should contain following permissions to publish lineage events to Amazon SageMaker Unified Studio:
+ ALLOW action on `datazone:PostLineageEvent`
+ If your Amazon SageMaker Unified Studio domain is encrypted with KMS CMK (customer managed key), the job role should have permissions to encrypt and decrypt as well
+ If the spark job is in an account different from Amazon SageMaker Unified Studio domain account, associate the account with domain prior to running the job. Follow [https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html](https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html) to set up account association

### Publish Lineage using API


IAM role with a policy to allow `datazone:PostLineageEvent` action is needed to post lineage events programmatically

# Publishing data lineage programmatically


You can also publish data lineage programmatically using [PostLineageEvent](https://docs.aws.amazon.com/datazone/latest/APIReference/API_PostLineageEvent.html) API. The API takes in open lineage run event as the payload. Additionally, the following APIs support retrieving lineage events and traversing lineage graph: 
+ [ GetLineageEvent](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetLineageEvent.html)
+ [ ListLineageEvents](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html)
+ [QueryGraph](https://docs.aws.amazon.com/datazone/latest/APIReference/API_QueryGraph.html): paginated API to return the aggregate view of the lineage graph 
+ [GetLineageNode](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetLineageNode.html): Gets the lineage node along with its immediate neighbors
+ [ListLineageNodeHistory](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageNodeHistory.html): lists the lineage node versions with each version derived from a data/metadata change event

The following is a sample PostLineageEvent operation payload:

```
{
  "producer": "https://github.com/OpenLineage/OpenLineage",
  "schemaURL": "https://openlineage.io/spec/2-0-0/OpenLineage.json#/definitions/RunEvent",    
  "eventType": "COMPLETE",
  "eventTime": "2024-05-04T10:15:30Z",
  "run": {
    "runId": "d2e7c111-8f3c-4f5b-9ebd-cb1d7995082a"
  },
  "job": {
    "namespace": "xyz.analytics",
    "name": "transform_sales_data"
  },
  "inputs": [
    {
      "namespace": "xyz.analytics",
      "name": "raw_sales",
      "facets": {
        "schema": {
          "_producer": "https://github.com/OpenLineage/OpenLineage",
          "_schemaURL": "https://openlineage.io/spec/facets/schema_dataset.json",
          "fields": [
            { "name": "region", "type": "string" },
            { "name": "year", "type": "int" },
            { "name": "created_at", "type": "timestamp" }
          ]
        }
      }
    }
  ],
  "outputs": [
    {
      "namespace": "xyz.analytics",
      "name": "clean_sales",
      "facets": {
        "schema": {
          "_producer": "https://github.com/OpenLineage/OpenLineage",
          "_schemaURL": "https://openlineage.io/spec/facets/schema_dataset.json",
          "fields": [
            { "name": "region", "type": "string" },
            { "name": "year", "type": "int" },
            { "name": "created_at", "type": "timestamp" }
            
          ]
        },
        "columnLineage": {
          "_producer": "https://github.com/OpenLineage/OpenLineage",
          "_schemaURL": "https://openlineage.io/spec/facets/columnLineage" + "DatasetFacet.json",
          "fields": {
            "id": {
              "inputFields": [
                {
                  "namespace": "xyz.analytics",
                  "name": "raw_sales",
                  "field": "id"
                }
              ]
            },
            "year": {
              "inputFields": [
                {
                  "namespace": "xyz.analytics",
                  "name": "raw_sales",
                  "field": "year"
                }
              ]
            },
            "created_at": {
              "inputFields": [
                {
                  "namespace": "xyz.analytics",
                  "name": "raw_sales",
                  "field": "created_at"
                }
              ]
            }
          }
        }
      }
    }
  ]
}
```

# The importance of the sourceIdentifier attribute to lineage nodes


Every lineage node is uniquely identified by its sourceIdentifier (usually provided as part of open-lineage event) in addition to system generated nodeId. sourceIdentifier is generated using <namespace>, <name> of the node in lineage event.

The following are examples of sourceIdentifier values for different types of nodes:
+ **Job nodes**
  + SourceIdentifier of job nodes is populated from <namespace>.<name> on the job node in open-lineage run event
+ **Jobrun nodes**
  + SourceIdentifier of jobrun nodes is populated from <job's namespace>.<job's name>/<run\$1id>
+ **Dataset nodes**
  + Dataset nodes representing AWS resources: sourceIdentifier is in ARN format
    + AWS Glue table: arn:aws:glue:<region>:<account-id>:table/<database>/<table-name>
    + AWS Glue table with federated sources: arn:aws:glue:<region>:<account-id>:table/<catalog><database>/<table-name>
      + Example: catalog can be "s3tablescatalog"/"s3tablesBucket", "lakehouse\$1catalog" etc
    + Amazon Redshift table:
      + serverless: arn:aws:redshift-serverless:<region>:<account-id>:table/workgroupName/<database>/<schema>/<table-name>
      + provisioned: arn:aws:redshift:<region>:<account-id>:table/clusterIdentifier/<database>/<schema>/<table-name>
    + Amazon Redshift view:
      + serverless: arn:aws:redshift-serverless:<region>:<account-id>:view/workgroupName/<database>/<schema>/<view-name>
      + provisioned: arn:aws:redshift:<region>:<account-id>:view/clusterIdentifier/<database>/<schema>/<view-name>
  + Dataset nodes representing SageMaker catalog resources:
    + Asset: amazon.datazone.asset/<assetId>
    + Listing (published asset): amazon.datazone.listing/<listingId>
  + In all other cases, dataset nodes' sourceIdentifier is populated using <namespace>/<name> of the dataset nodes in open-lineage run event
    + https://openlineage.io/docs/spec/naming/ contains naming convention for various datastores.

The following table contains the examples of how sourceIdentifier is generated for datasets of different types.


****  

| Source for lineage event | Sample OpenLineage event data | Source ID computed by Amazon DataZone | 
| --- | --- | --- | 
|  AWS Glue ETL  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />           "environment-properties":{<br />                 ....<br />                "environment-properties":{<br />                     "GLUE_VERSION":"3.0",<br />                     "GLUE_COMMAND_CRITERIA":"glueetl",<br />                     "GLUE_PYTHON_VERSION":"3"<br />                }<br />           }<br />       } <br />    },<br />    .....<br />   "outputs":[<br />      {<br />         "namespace":"namespace.output",<br />         "name":"output_name",<br />         "facets":{<br />             "symlinks":{<br />                 .... <br />                 "identifiers":[<br />                    {<br />                       "namespace":"arn:aws:glue:us-west-2:123456789012",<br />                       "name":"table/testdb/testtb-1",<br />                       "type":"TABLE"<br />                    }<br />                 ]<br />             }<br />        }<br />     }<br />   ]<br />    <br />}<br />                               </pre>  | arn:aws:glue:us-west-2:123456789012:table/testdb/testtb-1 If environment-properties contains GLUE\$1VERSION, GLUE\$1PYTHON\$1VERSION, etc, Amazon DataZone uses namespace and name in symlink of the dataset (input or output) to construct AWS Glue table ARN for sourceIdentifier. | 
|  Amazon Redshift (Provisioned)  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />          .......<br />       } <br />    },<br />    .....<br />   "inputs":[<br />      {<br />         "namespace":"redshift://cluster-20240715.123456789012.us-east-1.redshift.amazonaws.com:5439",<br />         "name":"tpcds_data.public.dws_tpcds_7"<br />         "facets":{<br />             .....<br />        }<br />     }<br />   ]<br />    <br />}<br />                                </pre>  | arn:aws:redshift:us-east-1:123456789012:table/cluster-20240715/tpcds\$1data/public/dws\$1tpcds\$17  If the namespace prefix is `redshift`, Amazon DataZone uses that to construct the Amazon Redshift ARN using values of namespace and name attributes. | 
|  Amazon Redshift (serverless)  |  <pre><br />{<br />   "run": {<br />      "runId":"4e3da9e8-6228-4679-b0a2-fa916119fthr",<br />      "facets":{<br />          .......<br />       } <br />    },<br />    .....<br />   "outputs":[<br />      {<br />         "namespace":"redshift://workgroup-20240715.123456789012.us-east-1.redshift-serverless.amazonaws.com:5439",<br />         "name":"tpcds_data.public.dws_tpcds_7"<br />         "facets":{<br />             .....<br />        }<br />     }<br />   ]<br />}<br />                                </pre>  | arn:aws:redshift-serverless:us-east-1:123456789012:table/workgroup-20240715/tpcds\$1data/public/dws\$1tpcds\$17  As per OpenLineage naming convention, namespace for Amazon Redshift dataset should be `provider://{cluster_identifier or workgroup}.{region_name}:{port}`. If the namespace contains `redshift-serverless`, Amazon DataZone uses that to construct Amazon Redshift ARN using values of namespace and name attributes. | 
|  Any other datastore  |  Recommendation is to populate namespace and name as per OpenLineage convention defined in [https://openlineage.io/docs/spec/naming/](https://openlineage.io/docs/spec/naming/).  |  Amazon DataZone populates sourceIdentifier as <namespace>/<name>.  | 

# Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio


**Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio**

Every lineage node is uniquely identified by its sourceIdentifier. Previous section talks about formats of sourceIdentifier. Amazon SageMaker Unified Studio automatically links the dataset nodes with assets in inventory based on the sourceIdentifier value. Hence, use the same sourceIdentifier value of dataset node when creating/updating the asset (via AssetCommonDetailsForm::sourceIdentifier attribute).

Following images show the sourceIdentifier on asset details page along with lineage graph highlighting the same sourceIdentifier of dataset node with its downstream asset’s sourceIdentifier.

Asset details page:

![\[Asset details page.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot1datalineage.png)


Asset’s SourceIdentifier in node details:

![\[Asset’s SourceIdentifier in node details.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot2datalineage.png)


Amazon Redshift dataset/table’s sourceIdentifier in node details:

![\[Amazon Redshift dataset/table’s sourceIdentifier in node details.\]](http://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/images/Screenshot3datalineage.png)


# Troubleshooting data lineage
Data lineage troubleshooting

This comprehensive troubleshooting guide helps you resolve common data lineage visibility issues in Amazon SageMaker Unified Studio. The guide covers programmatically published events, data source configurations, and tool-specific lineage capture problems.

**Topics**
+ [

## Not seeing lineage graph for events published programmatically
](#lineage-troubleshooting-programmatic-events)
+ [

## Not seeing lineage for assets even though importLineage is shown as true in AWS Glue datasource
](#lineage-troubleshooting-glue-datasource)
+ [

## Not seeing lineage for assets even though importLineage is shown as true in Amazon Redshift datasource
](#lineage-troubleshooting-redshift-datasource)
+ [

## Troubleshooting lineage for lineage events published from AWS Glue ETL jobs/vETL/Notebooks
](#lineage-troubleshooting-glue-etl-jobs)
+ [

## Troubleshooting lineage for lineage events published from EMR-S/EC2/EKS
](#lineage-troubleshooting-emr)

## Not seeing lineage graph for events published programmatically


**Primary requirement:** Lineage graphs are only visible in Amazon SageMaker Unified Studio if at least one node of the graph is an asset. You must create assets for any dataset nodes and properly link them using the sourceIdentifier attribute.

**Troubleshooting steps:**

1. Create assets for any of the dataset nodes involved in your lineage events. Refer to the following sections for proper linking:
   + [The importance of the sourceIdentifier attribute to lineage nodes](datazone-data-lineage-sourceIdentifier-attribute.md) 
   + [Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio](datazone-data-lineage-linking-nodes.md)

1. Once the asset is created, verify that you can see the lineage on the asset details page.

1. If you are still not seeing lineage, verify that the asset's sourceIdentifier (present in AssetCommonDetailsForm) has the same value as the sourceIdentifier of any input/output dataset node in the lineage event.

   Use the following command to get asset details:

   ```
   aws datazone get-asset --domain-identifier {DOMAIN_ID} --identifier {ASSET_ID}
   ```

   The response appears as follows:

   ```
   {
       .....
       "formsOutput": [
           ..... 
           {
               "content": "{\"sourceIdentifier\":\"arn:aws:glue:eu-west-1:123456789012:table/testlfdb/testlftb-1\"}",
               "formName": "AssetCommonDetailsForm",
               "typeName": "amazon.datazone.AssetCommonDetailsFormType",
               "typeRevision": "6"
           },
           .....
       ],
       "id": "{ASSET_ID}",
       ....
   }
   ```

1. If both sourceIdentifiers are matching but you still cannot see lineage, retrieve the eventId from the PostLineageEvent response or use ListLineageEvents to find the eventId, then invoke GetLineageEvent:

   ```
   aws datazone list-lineage-events --domain-identifier {DOMAIN_ID}
   // You can apply additional filters like timerange etc. 
   // Refer https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html
   
   aws datazone get-lineage-event --domain-identifier {DOMAIN_ID} --identifier {EVENT_ID} --output json event.json
   ```

   The response appears as follows and the open-lineage event is written to the event.json file:

   ```
   {
       "domainId": "{DOMAIN_ID}",
       "id": "{EVENT_ID}",
       "createdBy": ....,
       "processingStatus": "SUCCESS"/"FAILED etc",
       "eventTime": "2024-05-04T10:15:30+00:00",
       "createdAt": "2025-05-04T22:18:27+00:00"
   }
   ```

1. If the GetLineageEvent response's processingStatus is FAILED, contact AWS Support by providing the GetLineageEvent response for the appropriate event and the response from GetAsset.

1. If the GetLineageEvent response's processingStatus is SUCCESS, double-check that the sourceIdentifier of the dataset node from the lineage event matches the value in the GetAsset response above. The following steps help verify this.

1. Invoke GetLineageNode for the job run where the identifier is composed of namespace, name of job and run\$1id in the lineage event:

   ```
   aws datazone get-lineage-node --domain-identifier {DOMAIN_ID} --identifier <job's namespace>.<job's name>/<run_id>
   ```

   The response appears as follows:

   ```
   {
       .....
       "downstreamNodes": [
           {
               "eventTimestamp": "2024-07-24T18:08:55+08:00",
               "id": "afymge5k4v0euf"
           }
       ],
       "formsOutput": [
           <some forms corresponding to run and job>
       ],
       "id": "<system generated node-id for run>",
       "sourceIdentifier": "<job's namespace>.<job's name>/<run_id>",
       "typeName": "amazon.datazone.JobRunLineageNodeType",
       ....
       "upstreamNodes": [
           {
               "eventTimestamp": "2024-07-24T18:08:55+08:00",
               "id": "6wf2z27c8hghev"
           },
           {
               "eventTimestamp": "2024-07-24T18:08:55+08:00",
               "id": "4tjbcsnre6banb"
           }
       ]
   }
   ```

1. Invoke GetLineageNode again by passing in the downstream/upstream node identifier (which you think should be linked to the asset node):

   ```
   aws datazone get-lineage-node --domain-identifier {DOMAIN_ID} --identifier afymge5k4v0euf
   ```

   This returns the lineage node details corresponding to the dataset: `afymge5k4v0euf`. Verify if the sourceIdentifier matches that of the asset. If not matching, fix the namespace and name of the dataset in the lineage event and publish the lineage event again. You will see the lineage graph on the asset.

   ```
   {
       .....
       "downstreamNodes": [],
       "eventTimestamp": "2024-07-24T18:08:55+08:00",
       "formsOutput": [
           .....
       ],
       "id": "afymge5k4v0euf",
       "sourceIdentifier": "<sample_sourceIdentifier_value>",
       "typeName": "amazon.datazone.DatasetLineageNodeType",
       "typeRevision": "1",
       ....
       "upstreamNodes": [
           ...
       ]
   }
   ```

## Not seeing lineage for assets even though importLineage is shown as true in AWS Glue datasource


Open the datasource run(s) associated with the AWS Glue datasource and you can see the assets imported as part of the run and the lineage import status along with error message in case of failure.

**Limitations:**
+ Lineage for crawler run importing more than 250 tables isn't supported.

## Not seeing lineage for assets even though importLineage is shown as true in Amazon Redshift datasource


Lineage on Amazon Redshift tables is captured by retrieving user queries performed on the Amazon Redshift database, from the system tables.

**Lineage is not supported in the following cases:**
+ External Tables
+ Unload / Copy
+ Merge / Update
+ Queries that produce Lineage Events larger than 16MB
+ Datashares
+  ColumnLineage limitations:
  +  Column Lineage is not supported for queries not containing specific columns such as (select \$1 from tableA) 
  +  Column Lineage is not supported for queries involving temp tables 
+ Any limitations pertaining to [OpenLineageSqlParser](https://github.com/OpenLineage/OpenLineage/blob/main/integration/sql/README.md) results in failure to process some queries

**Troubleshooting steps:**

1. On the Amazon Redshift connection details, you will see the lineageJobId along with the job run schedule. Alternatively, you can fetch it using the [get-connection](https://docs.aws.amazon.com/cli/latest/reference/datazone/get-connection.html) API.

1. Invoke [list-job-runs](https://docs.aws.amazon.com/cli/latest/reference/datazone/list-job-runs.html) to get the runs corresponding to the job:

   ```
   aws datazone list-job-runs --domain-identifier {DOMAIN_ID} --job-identifier {JOB_ID}
   ```

   The response appears as follows:

   ```
   {
      "items": [ 
         { 
            .....
            "error": { 
               "message": "string"
            },
            "jobId": {JOB_ID},
            "jobType": LINEAGE,
            "runId": ...,
            "runMode": SCHEDULED,
            "status": SCHEDULED | IN_PROGRESS | SUCCESS | PARTIALLY_SUCCEEDED | FAILED | ABORTED | TIMED_OUT | CANCELED
            .....
         }
      ],
      "nextToken": ...
   }
   ```

1. If no job-runs are returned, check your job run schedule on the Amazon Redshift connection details. Reach out to AWS Support providing the lineageJobId, connectionId, projectId and domainId if job runs are not executed per given schedule.

1. If job-runs are returned, pick the relevant jobRunId and invoke GetJobRun to get job run details:

   ```
   aws datazone get-job-run --domain-identifier {DOMAIN_ID} --identifier {JOB_RUN_ID}
   ```

   The response appears as follows:

   ```
   {
     ....
     "details": {
       "lineageRunDetails": {
         "sqlQueryRunDetails": {
           "totalQueriesProcessed": ..,
           "numQueriesFailed": ...,
           "errorMessages":....,
           "queryEndTime": ...,
           "queryStartTime": ...
         }
       }
     },
     .....
   }
   ```

1. The job run fails if none of the queries are successfully processed; partially succeeds if some queries are successfully processed; succeeds if all queries are successfully processed and response also contains start and end times of processed queries.

## Troubleshooting lineage for lineage events published from AWS Glue ETL jobs/vETL/Notebooks


**Limitations:**
+ OpenLineage libraries for Spark are built into AWS Glue v5.0\$1 for Spark DataFrames only. Does not support Glue DynamicFrames.
+ Lineage capture for Spark jobs with fine-grained permission mode are not supported.
+ Lineage event has a size limit of 300KB.

**Common Issues:**
+ Verify necessary permissions are given to your job execution role as per [Permissions required for data lineage](datazone-data-lineage-permissions.md) 
+ Your spark job working with S3 files would produce lineage events with s3 datasets, even when they are catalog'ed in AWS Glue. To generate events including AWS Glue tables and build proper lineage graph with AWS Glue assets, your spark job should instead work with glue tables. 
+ If the AWS Glue ETL is in VPC, make sure the Amazon DataZone VPC endpoint is deployed in that VPC.
+ In case your domain is using a CMK, make sure that the AWS Glue execution role has the appropriate KMS permissions. CMK can be found via [https://docs.aws.amazon.com/datazone/latest/APIReference/API\$1GetDomain.html](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetDomain.html)
+ Failed to publish Lineage Event because payload is greater than 300 kb:
  + Add the following to Spark conf:

    ` "spark.openlineage.columnLineage.datasetLineageEnabled": "True" `
  + **Important Note:**
    + Column lineage typically constitutes a significant portion of the payload and enabling this will efficiently generate the column lineage info.
    + Disabling it can help reduce payload size and avoid validation exceptions.
+ Cross account lineage event submission:
  + Follow [https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html](https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html) to set up account association
  + Ensure that RAM policy is using the latest policy
+ If your Amazon SageMaker Unified Studio domain is in different region from that of job
  + Add this spark parameter: `"spark.openlineage.transport.region":"{region of your domain}"`
+ When the same DataFrame is written to multiple destinations or formats in sequence, Lineage SparkListener may only capture the lineage for the first write operation:
  + For optimization purposes, Spark's internals may reuse execution plans definition for consecutive write operations on the same DataFrame. This can lead to only capturing first lineage event.

**Troubleshooting steps:**

1. Lineage graph can only be visualized if at least one node of the graph is an asset. Therefore, create assets for any of the datasets (such as tables) involved in the job and then attempt to visualize lineage on the asset.

1. First, invoke [ListLineageEvents](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html) to see if the lineage events were submitted (refer to the linked doc to pass filters).

1. If no events are submitted, check AWS CloudWatch logs to see if any exceptions are thrown from the Amazon DataZone Lineage lib:
   + Log groups: /aws-glue/jobs/error, /aws-glue/sessions/error
   + Make sure logging is enabled: [https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging-enable.html)
   + Following is the AWS CloudWatch log insights query to check for exceptions:

     ```
     fields @timestamp, @message
     | filter @message like /(?i)exception/ and like /(?i)datazone/
     | sort @timestamp desc
     ```
   + Following is the AWS CloudWatch log insights query to confirm events are submitted:

     ```
     fields @timestamp, @message
     | filter @message like /Successfully posted a LineageEvent:/
     | sort @timestamp desc
     ```

1. Fetch the lineage events generated from this job by executing the python script:[ retrieve\$1lineage\$1events.py](https://github.com/aws-samples/amazon-datazone-examples/tree/main/data_lineage) 

1. Check if the dataset on which you expected lineage is present in any of the events
   + You can ignore empty events without any input/output nodes
   + Check if your dataset node has glue arn prefix in the namespace of the node or in the "symlink" facet of the node. If you don't see any node with glue arn prefix, it means your script is not using glue tables directly and hence lineage is not linked to glue asset. One way to workaround this is to update the script to work with glue tables. 

1. If you are still unable to see lineage and it doesn't fall under the limitations category, reach out to AWS Support by providing:
   + Spark config parameters
   + Lineage events file from executing retrieve\$1lineage\$1events.py script
   + GetAsset response

## Troubleshooting lineage for lineage events published from EMR-S/EC2/EKS


**Notes:**
+ Lineage is supported from following versions of EMR:
  + EMR-S: 7.5\$1
  + EMR-EC2: 7.11\$1
  + EMR-EKS: 7.12\$1
+ Lineage capture for Spark jobs with fine-grained permission mode are not supported.
+ If you are trying EMR outside of Amazon SageMaker Unified Studio, Amazon DataZone VPC endpoint needs to be deployed for EMR VPC.
+ Lineage event has a size limit of 300KB.

**Common Issues:**
+ Verify necessary permissions are given to your job execution role as per [Permissions required for data lineage](datazone-data-lineage-permissions.md) 
+ In case your domain is using a CMK, make sure that the job's execution role has the appropriate KMS permissions. CMK can be found via [https://docs.aws.amazon.com/datazone/latest/APIReference/API\$1GetDomain.html](https://docs.aws.amazon.com/datazone/latest/APIReference/API_GetDomain.html)
+ Failed to publish Lineage Event because payload is greater than 300 kb:
  + Add the following to Spark conf which efficiently generates the event payload:

    ` "spark.openlineage.columnLineage.datasetLineageEnabled": "true" `
+ Cross account lineage event submission:
  + Follow [https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html](https://docs.aws.amazon.com/datazone/latest/userguide/working-with-associated-accounts.html) to set up account association
  + Ensure that RAM policy is using the latest policy
+ If your Amazon SageMaker Unified Studio domain is in different region from that of job
  + Add this spark parameter: `"spark.openlineage.transport.region":"{region of your domain}"`
+ When the same DataFrame is written to multiple destinations or formats in sequence, Lineage SparkListener may only capture the lineage for the first write operation:
  + For optimization purposes, Spark's internals may reuse execution plans definition for consecutive write operations on the same DataFrame. This can lead to only capturing first lineage event.

**Troubleshooting steps:**

1. Lineage graph can only be visualized if at least one node of the graph is an asset. Therefore, create assets for any of the datasets (such as tables) involved in the job and then attempt to visualize lineage on the asset.

1. First, invoke [ListLineageEvents](https://docs.aws.amazon.com/datazone/latest/APIReference/API_ListLineageEvents.html) to see if the lineage events were submitted (refer to the linked doc to pass filters).

1. If no events are submitted, check AWS CloudWatch logs to see if any exceptions are thrown from the Amazon DataZone Lineage lib:
   + **EC2:**
     + You can provide the CloudWatch log group or log destination for S3 path at the time of creating EC2 cluster. Refer [https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html)
     + You will see logs in stderr file within cluster-id/containers/application\$1/ folder.
   + **EKS:**
     + You need to provide the CloudWatch log group or log destination for S3 path while submitting spark job. Refer [https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/getting-started.html](https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/getting-started.html)
     + You will see logs in stderr file of spark\$1driver within virtual-cluster-id/jobs/job-id/containers/\$1 folder.
   + **EMR-S:**
     + You can enable logs by following [https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/logging.html](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/logging.html)
     + You will see logs in stderr files of spark\$1driver
   + Following is the AWS CloudWatch log insights query to check for exceptions:

     ```
     fields @timestamp, @message
     | filter @message like /(?i)exception/ and like /(?i)datazone/
     | sort @timestamp desc
     ```
   + Following is the AWS CloudWatch log insights query to inspect generated events:

     ```
     fields @timestamp, @message
     | filter @message like /Successfully posted a LineageEvent:/
     | sort @timestamp desc
     ```

1. Fetch the lineage events generated from this job by executing the python script:[ retrieve\$1lineage\$1events.py](https://github.com/aws-samples/amazon-datazone-examples/tree/main/data_lineage) 

1. Check if dataset on which you expected lineage is present in any of the events
   + You can ignore empty events without any input/output nodes
   + Check if your dataset node has namespace/name matching with the sourceIdentifier of the asset. If you don't see any node with asset's sourceIdentifier, refer to following docs on how to fix it:
     + [The importance of the sourceIdentifier attribute to lineage nodes](datazone-data-lineage-sourceIdentifier-attribute.md) 
     + [Linking dataset lineage nodes with assets imported into Amazon SageMaker Unified Studio](datazone-data-lineage-linking-nodes.md)

1. If you are still unable to see lineage and it doesn't fall under the limitations category, reach out to AWS Support team by providing:
   + Spark config parameters
   + Lineage events file from executing retrieve\$1lineage\$1events.py script
   + GetAsset response

# Analyze Amazon SageMaker Unified Studio data with external analytics applications via JDBC connection
Analyze your subscribed data with external analytics applications via JDBC connection

Amazon SageMaker Unified Studio enables data consumers to easily locate and subscribe to data from multiple sources within a single project and analyze this data using Amazon Athena, Amazon Redshift Query Editor, and Amazon SageMaker.

Amazon SageMaker Unified Studio also supports authentication via the Athena JDBC driver that enables users to query their subscribed Amazon SageMaker Unified Studio data using popular external SQL and analytics tools, such as SQL Workbench, DBeaver, Tableau, Domino, Power BI and many others. Users can authenticate using their corporate credentials through SSO or IAM and begin analyzing their subscribed data within their Amazon SageMaker Unified Studio projects.

Amazon SageMaker Unified Studio's support of the Athena JDBC driver provides the following benefits:
+ Greater tool choice for querying and visualization - data consumers can connect to Amazon SageMaker Unified Studio using their preferred tools from a wide range of analytics tools that support a JDBC connection. This enables them to continue using the software they are familiar with without the need to learn new tools for data consumption. 
+ Programmatic access - a JDBC connection to access-governed data via servers or custom applications enables data consumers to perform automated and more complex data operations.

You can use your JDBC URL to connect your external analytics tools to your Amazon SageMaker Unified Studio subscribed data. To obtain your JDBC URL, perform the following procedure:

**Important**  
In the current release, Amazon SageMaker Unified Studio supports authentication using the Amazon Athena JDBC Driver. To complete this procedure, make sure that you have downloaded and installed the latest [Athena JDBC driver](https://docs.aws.amazon.com/athena/latest/ug/jdbc-v3-driver.html) for your analytics application of choice. 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project where you have the data that you want to analyze.

1. In the **Project overview**, choose the **JDBC connection details** tab.

1. In **JDBC connection details** choose your authentication method (**Using IDC auth** or **Using IAM auth**) and then choose the icon next to **JDBC connection URL** to copy the string or the individual parameters of the JDBC URL. You can then use it to connect to your external analytics application. 

When you connect your external analytics application to Amazon DataZone using your JBDC query or parameters, you invoke the `RedeemAccessToken` API. The `RedeemAccessToken` API exchanges an Identity Center access token for the `AmazonDataZoneDomainExecutionRole` credentials, which are used to call the `GetEnvironmentCredentials` API.

For more information about the authentication mechanism that uses IAM credentials to connect to Amazon DataZone-governed data in Athena, see [DataZone IAM Credentials Provider](https://docs.aws.amazon.com/athena/latest/ug/jdbc-v3-driver-datazone-iamcp.html). For more information about the authentication mechanism that enables connecting to Amazon DataZone-governed data in Athena using IAM Identity Center, see [DataZone Idc Credentials Provider](https://docs.aws.amazon.com/athena/latest/ug/jdbc-v3-driver-datazone-idc.html).

## RedeemAccessToken API Reference


**Request syntax**

```
POST /sso/redeem-token HTTP/1.1
Content-type: application/json

{
   "domainId": "string",
   "accessToken": "string"
}
```

**Request parameters**

The request uses the following parameters.

**DomainId**  
The ID of the Amazon DataZone domain.  
Pattern: ^dzd[-\$1][a-zA-Z0-9\$1-]\$11,36\$1\$1   
Required: yes

**accessToken**  
The Identity Center access token.  
Type: string  
Required: yes

**Response syntax**

```
HTTP/1.1 200
Content-type: application/json

{
   "credentials": AwsCredentials
}
```

**Response elements**

**credentials**  
The `AmazonDataZoneDomainExecutionRole` credentials that are used to call the `GetEnvironmentCredentials` API.  
Type: Array of `AwsCredentials` objects. This data type includes the following properties:  
+ accessKeyId: AccessKeyId
+ secretAccessKey: SecretAccessKey
+ sessionToken: SessionToken
+ expiration: Timestamp

**accessToken**  
The Identity Center access token.  
Type: string  
Required: yes

**Errors**

**AccessDeniedException**  
You do not have sufficient access to perform this action.  
HTTP Status Code: 403

**ResourceNotFoundException**  
The specified resource cannot be found.  
HTTP Status Code: 404

**ValidationException**  
The input fails to satisfy the constraints specified by the AWS service.  
HTTP Status Code: 400

**InternalServerException**  
The request has failed because of an unknown error, exception or failure.  
HTTP Status Code: 500

# Metadata enforcement rules for publishing


The metadata enforcement rules for publishing in Amazon SageMaker Unified Studio strengthen data governance by enabling domain unit owners to establish clear metadata requirements for data producers, streamlining access requests and enhancing data governance.

The feature is supported in all the AWS commercial Regions where Amazon SageMaker Unified Studio is currently available.

Domain unit owners can can complete the following procedure to configure metadata enforcement in Amazon SageMaker Unified Studio:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Govern** -> **Domain units** from the top navigation pane and then choose the domain or the domain unit that you want to work with.

1. Choose the **Rules** tab and then choose **Add**.

1. On the **Rule configuration** page, do the following and then choose **Add rule**:
   + Specify a name for your rule.
   + Under **Action**, choose **Data asset and product publishing** or **Subscription request**.
   + If you chose **Subscription request**, then under Required metadata forms, choose **Add metadata form**, choose a metadata form within the domain / domain unit that you want to add to this rule, and then choose **Add**. You can add up to 5 metadata forms per rule.
   + If you chose **Data asset and product publishing**, then under **Rule requirements**, choose either **Metadata forms** or **Glossary association**. If you chose **Metadata forms**, then under Required metadata forms, choose **Add metadata form**, choose a metadata form within the domain / domain unit that you want to add to this rule, and then choose **Add**. You can add up to 5 metadata forms per rule. If you chose **Glossary association**, then choose **Add terms** and add your glossary terms to your rule. You can add up to 5 glossary terms per rule. 
   + Under **Scope**, specify with which data entities you want to associate these forms. You can choose data products and/or data assets.
   + Under **Data asset types**, specify whether the rule applies across all asset types or limit it to selected asset types. 
   + Under **Projects**, specify whether the required forms will be associated with data products and/or assets published by all projects or only selected projects in this domain unit. Also, check **Cascade rule to child domain units** if you want child domain units to inherit this requirement. 

# Data discovery, subscription, and consumption


In Amazon SageMaker Unified Studio, after an asset is published to a domain, subscribers can discover and request a subscription to this asset. The subscription process begins with a subscriber searching for and browsing the catalog to find an asset they want. In the Amazon SageMaker Catalog, they subscribe to the asset by submitting a subscription request that includes justification and the reason for the request. The owner of the asset reviews the request. They can either approve or reject the request. 

After a subscription is granted, a fulfillment process starts to facilitate access to the asset for the subscriber. There are two primary modes of asset access control and fulfillment: those for Amazon SageMaker Unified Studio managed assets and those for assets that are not managed by Amazon SageMaker Unified Studio.
+ **Managed assets** – Amazon SageMaker Unified Studio can manage fulfillment and permissions for managed assets, such as AWS Glue tables and Amazon Redshift tables and views.
+ **Unmanaged assets** – Amazon SageMaker Unified Studio publishes standard events related to your actions (for example, approval given to a subscription request to Amazon EventBridge). You can use these standard events to integrate with other AWS services or third-party solutions for custom integrations.

**Topics**
+ [

# Search for and view assets in the Amazon SageMaker Unified Studio catalog
](search-for-data.md)
+ [

# Request subscription to assets in Amazon SageMaker Unified Studio
](subscribe-to-data-assets-managed.md)
+ [

# Approve or reject a subscription request in Amazon SageMaker Unified Studio
](approve-reject-subscription-request.md)
+ [

# Revoke an existing subscription in Amazon SageMaker Unified Studio
](revoke-subscription.md)
+ [

# Cancel a subscription request in Amazon SageMaker Unified Studio
](cancel-subscription-request.md)
+ [

# Unsubscribe from an asset in Amazon SageMaker Unified Studio
](unsubscribe-from-subscription.md)
+ [

# Grant access to managed AWS Glue Data Catalog assets in Amazon SageMaker Unified Studio
](grant-access-to-glue-asset.md)
+ [

# Grant access to managed Amazon Redshift assets in Amazon SageMaker Unified Studio
](grant-access-to-redshift-asset.md)
+ [

# Grant access for approved subscriptions to unmanaged assets in Amazon SageMaker Unified Studio
](grant-access-to-unmanaged-asset.md)
+ [

# Metadata enforcement rules for subscription requests
](metadata-rules-subscription.md)

# Search for and view assets in the Amazon SageMaker Unified Studio catalog
Search for and view assets in the catalog

Amazon SageMaker Unified Studio provides a streamlined way to search for data. Any Amazon SageMaker Unified Studio user with permissions to access Amazon SageMaker Unified Studio can search for assets in the Amazon SageMaker Unified Studio catalog and view asset names and the metadata assigned to them. You can take a closer look at an asset by examining its details page.

**Note**  
To view the actual data that an asset contains, you must first subscribe to the asset and have your subscription request approved and access granted. 

**To search for assets in the catalog**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Data catalog**.

1. Find the asset that you want to subscribe to by browsing or entering the name of the asset into the search bar. You can apply filters to narrow the results. The filters include asset type, source account, the AWS Region to which the asset belongs, date range, and custom metadata filters. To add a custom metadata filter, choose **Add Filter** at the bottom of the filters panel. You can filter by asset name, description, or metadata form fields.

   For metadata form filters, select the form, field, and operator (`contains` for string fields; `equals`, `greater than`, or `less than` for numeric fields). Enter a value and choose **Apply**. You can combine multiple custom filters.

   Filter selections persist in your browser by using local storage. Only fields that are marked as searchable (strings) or sortable (numerics) are available for filtering.

1. To view details about a specific asset, choose the asset to open its details page. The details page includes the following information:
   + The asset name and type.
   + A description of the asset.
   + The current published revision of the asset, the owner, whether approval is required for subscriptions, and update history.
   + A **Business metadata** tab which includes glossary terms and metadata forms.
   + A **Subscription requests** tab which includes a list of subscribers to the domain.
   + A **Lineage** tab which displays a chart of past revisions of the asset.

# Request subscription to assets in Amazon SageMaker Unified Studio
Request subscription to assets

Amazon SageMaker Unified Studio allows you to find, access and consume the assets in the Amazon SageMaker Unified Studio catalog. When you find an asset in the catalog that you want to access, you need to *subscribe* to the asset, which creates a subscription request. An approver can then approve or request your request.

You must be a member of a project in order to request subscription to an asset within that project.

**To subscribe to an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the **Discover** menu in the top navigation bar.

1. Choose **Data catalog**.

1. Find the asset you want to subscribe to by browsing or typing the name of the asset into the search bar.

1. Choose the asset to which you want to subscribe, and then choose **Subscribe**. 

1. In the **Subscribe** pop-up window, provide the following information:
   + The project that you want to subscribe to the asset.
   + A short justification for your subscription request.

1. Choose **Request**.

   The project will be subscribed to the asset when the publisher approves your request.

To view the status of the subscription request, locate and choose the project with which you subscribed to the asset. Choose **Subscription requests** from the project left side navigation, then choose the **Outgoing requests** tab. This page lists the assets to which the project has requested access. You can filter the list by the status of the request.

# Approve or reject a subscription request in Amazon SageMaker Unified Studio
Approve or reject a subscription request

Amazon SageMaker Unified Studio allows you to find, access and consume the assets in the Amazon SageMaker Unified Studio catalog. When you find an asset in the catalog that you want to access, you must *subscribe* to the asset, which creates a subscription request. An approver can then approve or reject your request.

You must be a member of the owning project (the project that published the asset) to approve or reject a subscription request.

**To approve or reject a subscription request**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains the asset that has a subscription request. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. Under **Project catalog**, choose **Subscription requests**.

1. Choose the **Incoming requests** tab. 

1. Locate the request and choose **View request**. You can filter by **Requested** to see only requests that are still open.

1. Review the subscription request and reason for access, and decide whether to approve or reject it.

1. To approve, select between the two options:
   + **Full access**: If you choose to approve the subscription with full access option, the subscriber will get access to all the rows and columns in your data asset. 
   + **Approve with row and column filters**: To limit access to specific rows and columns of data, you can choose the option to approve with row and column filters. For more information, see [Fine-grained access control to data](fine-grained-access-control.md). 
     + Select **Choose filters**, and then from the drop down select one or more available filters you want to apply to the subscription. 
     + To create a new filter you can choose Create new filter option, which opens a new page to create a new row or column filter. For more information, see [Create column filters in Amazon SageMaker Unified Studio](create-column-filter.md) and [Create row filters in Amazon SageMaker Unified Studio](create-row-filter.md).

1. (Optional) Enter a response that explains your reason for accepting or rejecting the request.

1. Choose either **Approve** or **Reject**.

As the project owner, you can revoke the subscription at any time. For more information, see [Revoke an existing subscription in Amazon SageMaker Unified Studio](revoke-subscription.md).

**Note**  
Amazon SageMaker Unified Studio supports fine-grained access control for AWS Glue tables, Amazon Redshift tables, and Amazon Redshift views.

## Automatic approval of subscription requests


By default, subscription requests to a published asset require manual approval by a data owner. However, Amazon SageMaker Unified Studio supports two scenarios where subscription requests can be automatically approved:
+ Approval disabled during asset publishing - when publishing a data asset, you can choose to not require subscription approval. In this case, all incoming subscription requests to that asset are automatically approved. To learn how to disable approval for an asset, see [Publish assets to the Amazon SageMaker Unified Studio catalog from the project inventory](publishing-data-asset.md) .
+ Requester is an owner or contributor in the project that published the asset - a subscription request is also automatically approved if the requester is already authorized to approve it manually. Specifically, if they are a member of both the project that published the asset and the project requesting access.

  To qualify for auto-approval:
  + The requester must be listed as an owner or contributor in the project where the asset was originally published.
  + The requester must also be listed as an owner or contributor in the project making the subscription request.

  This ensures that auto-approval only occurs when the requester has visibility and permissions in both projects — the one sharing the asset and the one requesting access. If the requester meets both conditions, the system auto-approves the request.

# Revoke an existing subscription in Amazon SageMaker Unified Studio
Revoke an existing subscription

Amazon SageMaker Unified Studio allows you to find, access and consume the assets in the Amazon SageMaker Unified Studio catalog. When you find an asset in the catalog that you want to access, you need to *subscribe* to the asset, which creates a subscription request. An approver can then approve or request your request. You might need to revoke a subscription after you have approved it, either because the approval was a mistake, or because the subscriber no longer needs access to the asset.

You must be a member of the owning project (the project that published the asset) to revoke a subscription.

**To revoke a subscription**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains the asset that has a subscription request. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. Under **Project catalog**, choose **Subscription requests**.

1. Choose the **Incoming requests** tab. 

1. Locate the subscription you want to revoke and choose **View subscription**.

1. (Optional) Enable the checkbox to allow the subscriber to keep the asset in the project's subscription targets. A subscription target is a reference to a set of resources where subscribed data can be made available within an environment.

   If you want to revoke access to the asset from the subscription target at a later time, you must do so in AWS Lake Formation.

1. Choose **Revoke subscription**.

You can't re-approve a subscription after you revoke it. The subscriber must request a subscription to the asset again in order for you to approve it.

# Cancel a subscription request in Amazon SageMaker Unified Studio
Cancel a subscription request

Amazon SageMaker Unified Studio allows you to find, access and consume the assets in the Amazon SageMaker Unified Studio catalog. When you find an asset in the catalog that you want to access, you need to *subscribe* to the asset, which creates a subscription request. An approver can then approve or request your request. You might need to cancel a pending subscription request, either because you submitted it by mistake, or because you no longer need read access to the asset.

To cancel a subscription request, you must be either a project owner or contributor.

**To cancel a subscription request**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains the asset that has a subscription request. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. Under **Project catalog**, choose **Subscription requests**.

1. Choose the **Outgoing requests** tab. 

1. Filter by **Requested** to see only requests that are still pending.

1. Locate the request and choose **View request**. 

1. Review the subscription request and choose **Cancel request**.

If you want to re-subscribe to the asset (or to a different asset), see [Request subscription to assets in Amazon SageMaker Unified Studio](subscribe-to-data-assets-managed.md).

# Unsubscribe from an asset in Amazon SageMaker Unified Studio
Unsubscribe from an asset

Amazon SageMaker Unified Studio allows you to find, access and consume the assets in the Amazon SageMaker Unified Studio catalog. When you find an asset in the catalog that you want to access, you need to *subscribe* to the asset, which creates a subscription request. An approver can then approve or request your request. You might need to unsubscribe from an asset, either because you subscribed by mistake and were approved, or because you no longer need read access to the asset.

You must be a member of a project in order to unsubscribe from one of its assets.

**To unsubscribe from an asset**

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Navigate to the project that contains the asset that has a subscription request. You can do this by using the center menu at the top of the page and choosing **Browse all projects**, then choosing the name of the project that you want to navigate to.

1. Under **Project catalog**, choose **Subscription requests**.

1. Choose the **Outgoing requests** tab. 

1. Filter by **Approved** to see only requests that have been approved.

1. Locate the request and choose **View subscription**. 

1. Review the subscription and choose **Unsubscribe**.

If you want to re-subscribe to the asset (or to a different asset), see [Request subscription to assets in Amazon SageMaker Unified Studio](subscribe-to-data-assets-managed.md).

# Grant access to managed AWS Glue Data Catalog assets in Amazon SageMaker Unified Studio
Grant access to managed AWS Glue Data Catalog assets

**Note**  
Access management for the AWS Glue Data Catalog assets using the AWS Lake Formation LF-TBAC method is not supported.  
Support for cross-Region sharing of assets in AWS Glue Data Catalog is not supported.  
Support for cross-account sharing of assets in a federated catalog within AWS Glue Data Catalog is not supported.

When a subscription request to managed AWS Glue Data Catalog assets is approved, Amazon SageMaker Unified Studio grants and manages access to the approved AWS Glue Data Catalog tables on your behalf through AWS Lake Formation. For the subscriber project, assets that are granted appear in the AWS Glue Data Catalog as resources in your account. You can then use Amazon Athena, Amazon Redshift, or Spark to query the tables.

For Amazon SageMaker Unified Studio to be able to grant access to AWS Glue Data Catalog tables, the following conditions must be met.
+ The AWS Glue table must be Lake Formation-managed since Amazon SageMaker Unified Studio grants access by managing Lake Formation permissions.
+ The IAM role of the project that has published the asset to the Amazon SageMaker Catalog must have the following AWS Lake Formation permissions:
  + `DESCRIBE` and `DESCRIBE GRANTABLE` permissions on the AWS Glue database that contains the published table.
  + `DESCRIBE`, `SELECT`, `DESCRIBE GRANTABLE`, `SELECT GRANTABLE` permissions in Lake Formation on the published table itself.

For more information, see [Granting and revoking permissions on catalog resources](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-catalog-permissions.html) in the *AWS Lake Formation Developer Guide*.

# Grant access to managed Amazon Redshift assets in Amazon SageMaker Unified Studio
Grant access to managed Amazon Redshift assets

When a subscription to an Amazon Redshift table or view is approved, Amazon SageMaker Unified Studio can automatically add the subscribed asset to the Amazon Redshift Serverless workgroup created for the project, so that members of the project can query the data using the Amazon Redshift query editor link within the project. Under the hood, Amazon SageMaker Unified Studio creates the necessary grants and datashares. 

The process of granting access varies depending on where the source database (publisher) and the target database (subscriber) are located. 
+ Same cluster, same database - if data must be shared within the same database, Amazon SageMaker Unified Studio grants permissions directly on the source table. 
+ Same cluster, different database - if data must be shared across two databases within the same cluster, Amazon SageMaker Unified Studio creates a view in the target database and permissions are granted on the created view.
+ Same account different cluster - Amazon SageMaker Unified Studio creates a datashare between the source and target cluster and creates a view on top of the shared table. Permissions are granted on the view.
+ Cross-account - same as above but an additional step is required to authorize cross-account datashare on the producer cluster side and another step to associate the data share on consumer cluster side.

Make sure that your publishing and subscribing Amazon Redshift clusters meet all requirements for Amazon Redshift datashares. For more information, see [Data sharing in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/dg/datashare-overview.html) in the Amazon Redshift Developer Guide.

**Note**  
Cross-Region data sharing using Amazon Redshift is not supported.

# Grant access for approved subscriptions to unmanaged assets in Amazon SageMaker Unified Studio
Grant access for approved subscriptions to unmanaged assets

Amazon SageMaker Unified Studio enables users to publish any type of asset in the Amazon SageMaker Catalog. For some of these assets, Amazon SageMaker Unified Studio can can automatically manage access grants. These assets are called **managed assets** and include Lake Formation-managed AWS Glue Data Catalog tables and Amazon Redshift tables and views. All other assets to which Amazon SageMaker Unified Studio can't automatically grant subscriptions are called **unmanaged**.

Amazon SageMaker Unified Studio provides a path for you to manage access grants for your unmanaged assets. When a subscription to an asset in the Amazon SageMaker Catalog is approved by the data owner, Amazon SageMaker Unified Studio publishes an event in Amazon EventBridge in your account along with all the necessary information in the payload that enables you to create the access grants between the source and the target. When you receive this event, you can trigger a custom handler which can use the information in the event to create necessary grants or permissions. After you have granted the access, you can report back and update the status of the subscription in Amazon SageMaker Unified Studio so that it can notify the user(s) who subscribed to the asset that they can start consuming the asset. 

## Set up Cross-Region Subscriptions


Cross-region subscriptions allow data consumers to subscribe to and access data assets published in a different AWS Region than their consuming project or environment.

With cross-region subscriptions, you can:
+ Subscribe to data published in a different Region than your consuming environment
+ Extend existing approved subscriptions to another Region

For AWS Glue assets, cross-region access is achieved through resource links. The original table remains in the source Region, and a resource link is created in the target Region for consumer access.

For Amazon Redshift assets, cross-region data sharing uses Redshift's native datashare functionality. For cross-account scenarios, AWS Resource Access Manager (AWS RAM) authorization is required.

### Supported assets and Regions


Cross-region subscriptions support AWS Glue tables, AWS Glue views, Amazon Redshift tables, and Amazon Redshift views across all standard (non-opt-in) AWS Regions. Cross-region subscriptions to opt-in Regions are not supported.

### Prerequisites


Before you enable cross-region subscriptions, you must have the following:
+ An existing Amazon DataZone or SageMaker Unified Studio domain
+ Permissions to manage blueprints, environments, and projects in your domain
+ For Glue assets: The appropriate data lake blueprint enabled in both source and target Regions
+ For Redshift assets: The appropriate data warehouse blueprint enabled in both source and target Regions

### Enabling cross-region subscriptions (DataZone domains - V1)


Complete the following steps to enable cross-region subscriptions in DataZone domains.

#### Step 1: Enable the blueprint in the target Region


1. Open the Amazon DataZone console at [https://console.aws.amazon.com/datazone/](https://console.aws.amazon.com/datazone/).

1. Choose your domain.

1. In the navigation pane, choose **Blueprints**.

1. Choose the appropriate blueprint:
   + For Glue assets, choose **DataLake**
   + For Redshift assets, choose **DataWarehouse**

1. If the blueprint is disabled, choose **Enable**.

#### Step 2: Create an environment profile


1. Sign in to the Amazon DataZone data portal.

1. Navigate to the subscriber project.

1. Choose **Create environment profile**.

1. For **Region**, select the Region that you enabled in Step 1.

1. Configure other settings as needed, and then choose **Create**.

#### Step 3: Create an environment


1. In the subscriber project, choose **Environments**.

1. Choose **Create environment**.

1. For **Environment profile**, select the environment profile that you created in Step 2.

1. Configure other settings as needed, and then choose **Create**.

#### Step 4: Subscribe to assets


1. Navigate to the data catalog and find the asset that you want to subscribe to.

1. Choose **Subscribe**.

1. Select the subscriber project with the cross-region environment.

1. Complete the subscription request.

The subscription automatically fulfills to the new Region. You can query the data from the new Region environment.

### Enabling cross-region subscriptions (SageMaker Unified Studio domains - V2)


Complete the following steps to enable cross-region subscriptions in SageMaker Unified Studio domains.

#### Step 1: Enable the blueprints in the target Region


1. Open the SMUS portal.

1. Choose your domain.

1. In the navigation pane, choose **Blueprints**.

1. Enable the **Tooling** blueprint in the target Region. This is required for both Glue and Redshift assets.

1. Enable the appropriate asset blueprint in the target Region:
   + For Glue assets, choose **LakeHouseDatabase**
   + For Redshift assets, choose **RedshiftServerless**
   + **Tooling** (required)

1. Add the target Regions to each blueprint.

#### Step 2: Create a project profile


1. In the navigation pane, choose **Project profiles**.

1. Choose **Create project profile**.

1. For **Region**, select the Region that you enabled in Step 1.

1. Configure other settings as needed, and then choose **Create**.

#### Step 3: Create a project


1. On SMUS, Choose **Create project**.

1. For **Project profile**, select the project profile that you created in Step 2.

1. Configure other settings as needed, and then choose **Create**.

The project is provisioned in the target Region. Subscriptions to this project automatically fulfill to the target Region.

### Considerations


When working with cross-region subscriptions, keep the following in mind:
+ **Region restrictions** – Cross-region subscriptions are not supported in opt-in Regions.
+ **Blueprint requirements** – Blueprints must be enabled in both the source and target Regions before you can create cross-region subscriptions.
+ **Environment requirements (V1)** – Environments must exist in the target Region before subscriptions can be fulfilled to that Region.
+ **Project requirements (V2)** – In SageMaker Unified Studio domains, you cannot add new environments to existing projects through the console. To subscribe to assets in a new Region, you must create a new project with a project profile configured for that Region.
+ **Tooling blueprint (V2)** – The Tooling blueprint must be enabled in the target Region before enabling LakeHouseDatabase or RedshiftServerless blueprints.
+ **Cross-account Redshift sharing** – For cross-account Redshift data sharing, AWS RAM authorization is required on both the producer and consumer sides.

# Metadata enforcement rules for subscription requests


The metadata enforcement rules for subscription requests in Amazon SageMaker Unified Studio strengthen data governance by enabling domain unit owners to establish clear metadata requirements for data consumers, streamlining access requests and enhancing data governance.

The feature is supported in all the AWS commercial Regions where Amazon SageMaker Unified Studio is currently available.

Domain unit owners can can complete the following procedure to configure metadata enforcement for subscription requests in Amazon SageMaker Unified Studio:

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Govern** -> **Domain units** from the top navigation pane and then choose the domain or the domain unit that you want to work with.

1. Choose the **Rules** tab and then choose **Add**.

1. On the **Create required metadata form rule** page, do the following and then choose **Add rule**:
   + Specify a name for your rule.
   + Under **Action**, choose **Subscription request**.
   + Under **Required forms**, choose **Add metadata form**, choose a metadata form within the domain / domain unit that you want to add to this rule, and then choose **Add**. You can add up to 5 metadata forms per rule.
   + Under **Scope**, specify with which data entities you want to associate these forms. You can choose data products and/or data assets.
   + Under **Data asset types**, specify whether the rule applies across all asset types or limit it to selected asset types. 
   + Under **Projects**, specify whether the required forms will be associated with data products and/or assets published by all projects or only selected projects in this domain unit. Also, check **Cascade rule to child domain units** if you want child domain units to inherit this requirement. 

# Fine-grained access control to data
Fine-grained access control to data

In the current release of Amazon SageMaker Unified Studio, fine-grained access control of your data is supported so you can have granular access control over your sensitive data. You can control which project can access specific records of data within your data assets published to the Amazon SageMaker Unified Studio business data catalog. Amazon SageMaker Unified Studio supports row and column filters to implement fine-grained access control.

Use **row filters** to restrict access to specific rows based on the criteria you define. For example, if your table contains data for two regions (America and Europe) and you want to ensure that employees in Europe can only access data relevant to their region, you can create a row filter that includes rows where the region is Europe (`region = 'Europe'`). This way, employees in Europe won't have access to America’s data.

Use **column filters** to limit access to specific columns within your data assets. For example, if your table includes sensitive information such as Personally Identifiable Information (PII), you can create a column filter to exclude PII columns. This ensures that subscribers can only access non-sensitive data.

To utilize fine-grained access control, you can create row and column filters for your AWS Glue and Amazon Redshift assets in Amazon SageMaker Unified Studio. When you receive a subscription request to access your data assets, you can approve it by applying the appropriate row and column filters. Amazon SageMaker Unified Studio ensures that the subscriber can only access the rows and columns permitted by the filters you applied at the time of subscription approval.

**Topics**
+ [

## Limitations
](#fine-grained-data-limitations)
+ [

# Create row filters in Amazon SageMaker Unified Studio
](create-row-filter.md)
+ [

# Create column filters in Amazon SageMaker Unified Studio
](create-column-filter.md)
+ [

# Delete row or column filters in Amazon SageMaker Unified Studio
](delete-row-column-filter.md)
+ [

# Edit row or column filters in Amazon SageMaker Unified Studio
](edit-row-column-filter.md)
+ [

# Grant access with filters in Amazon SageMaker Unified Studio
](grant-access-with-filters.md)

## Limitations


When configuring row or column level filters for fine-grained access control, filtering on columns whose name contains special characters impacts which compute types can access the data.
+ In cases where the column name contains special characters, adding an Asset Filter will automatically add double quotes “ ” around the column name to escape the special characters. 

  As a result, the asset is not accessible by data processing compute engines such as EMR-EC2, EMR-Serverless, or Glue-ETL. This asset is still accessible by other compute engines.

  To remove this limitation, either remove the filters on the column names containing special characters or rename the column to remove the special characters and recreate the filter.

# Create row filters in Amazon SageMaker Unified Studio
Create row filters

Amazon SageMaker Unified Studio allows you to create row filters that you can use when approving subscriptions to make sure that the subscriber can only access rows of data as defined in the row filters. To create a row filter, follow the steps below: 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. Make sure you are on the **Inventory** tab, then choose the name of the asset that you want to create a column filter for. You can add column filters if your data asset in Amazon SageMaker Unified Studio is of type AWS Glue table, Amazon Redshift table, or Amazon Redshift view. You are then brought to the asset details page.

1. On the asset detail page, go to the **Asset filters** tab and then choose **Add asset filter**.

1. Configure the following fields:
   + **Name** - the name of the filter
   + **Description** – the description of the filters

1. Under filter type, choose **Row filter**.

1. Under row filter expression, provide one or more expressions for row filter.
   + Choose a column from the **Column** dropdown.
   + Choose an operator from the **Operator** dropdown.
   + Enter a value in the **Value** field.

1. To add another condition to your filter expression, choose **Add condition**.

1. When using multiple conditions in the row filter expression, choose **And** or **Or** to link the conditions.

1. Select an option to indicate whether or not the filter contains sensitive values that you want to hide from approved subscribers.

1. Choose **Create asset filter**.

# Create column filters in Amazon SageMaker Unified Studio
Create column filters

Amazon SageMaker Unified Studio enables you to create column filters that you can use when approving subscriptions to make sure that the subscriber can only access columns of data as defined in the column filters. To create a column filter, follow the steps below: 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. Make sure you are on the **Inventory** tab, then choose the name of the asset that you want to create a column filter for. You can add column filters if your data asset in Amazon SageMaker Unified Studio is of type AWS Glue table, Amazon Redshift table, or Amazon Redshift view. You are then brought to the asset details page.

1. On the asset detail page, go to the **Asset filters** tab and then choose **Add asset filter**.

1. Configure the following fields:
   + **Name** – the name of the filter
   + **Description** – the description of the filters

1. Under filter type, choose **Column**.

1. Select the columns you want to include in the filters using the check boxes for the columns in the data asset. 

1. Choose **Create asset filter**.

# Delete row or column filters in Amazon SageMaker Unified Studio
Delete row or column filters

To delete a row or a column filter, follow the steps below: 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. Make sure you are on the **Inventory** tab, then choose the name of the asset where you want to delete a row or a column filter. 

1. On the asset details page, go to the **Asset filters** tab and then choose the name of the filter that you want to delete. 

1. Choose **Actions**, **Delete** and then confirm the deletion.

**Note**  
You can delete a filter only if it is not being used in active subscriptions.

# Edit row or column filters in Amazon SageMaker Unified Studio
Edit row or column filters

To edit a row or a column filter, follow the steps below: 

1. Navigate to Amazon SageMaker Unified Studio using the URL from your admin and log in using your SSO or AWS credentials. 

1. Choose **Select project** from the top navigation pane and select the project to which the asset belongs.

1. Under **Project catalog** in the left side navigation, choose **Assets**.

1. Make sure you are on the **Inventory** tab, then choose the name of the asset that contains the filter that you want to edit.

1. On the asset detail page, go to **Asset filters** tab and then choose the name of the filter that you want to edit.

1. You can edit the following fields:
   + **Name** – the name of the filter
   + **Description** – the description of the filters

1. If you're editing a row filter, you can update the row filter expression.

1. If you're editing a column filter, you can add or remove the columns selected in the filter. 

1. After you have made the changes, choose **Edit asset filter**.

**Note**  
 If you edit a filter that is being used in active subscriptions, Amazon SageMaker Unified Studio will automatically update the permissions granted to the subscriber projects. This means that the subscribers will only be able to access the rows or columns as defined in the updated filter, ensuring that your data access policies are consistently enforced.

# Grant access with filters in Amazon SageMaker Unified Studio
Grant access with filters

Amazon SageMaker Unified Studio enables fine-grained access control by translating the defined row and column filters into appropriate grants for AWS Lake Formation and Amazon Redshift. Below is an explanation of how Amazon SageMaker Unified Studio materializes these filters for both AWS Glue tables and Amazon Redshift.

## AWS Glue tables


When a subscription to an AWS Glue table with row and/or column filters is approved, Amazon SageMaker Unified Studio materializes the subscription by creating grants in AWS Lake Formation with Data Cell Filters, ensuring that the members of the subscriber project are only able to access the rows and columns they are allowed to access based on the filters applied to the subscription. 

Amazon SageMaker Unified Studio first translates the row and columns filters applied in Amazon SageMaker Unified Studio to AWS Lake Formation Data Cell Filters. If multiple row and columns filters are used, Amazon SageMaker Unified Studio unions all the columns and all the row filter conditions to compute effective permissions at both row and column level. Amazon SageMaker Unified Studio then creates a single AWS Lake Formation data cell filter using effective row and column permissions. 

After the data cell filter is created, Amazon SageMaker Unified Studio shares the subscribed table with the subscriber project by creating read-only (SELECT) permissions in AWS Lake Formation using this data cell filter. 

## Amazon Redshift


When a subscription to an Amazon Redshift table/view with row and/or column filters is approved, Amazon SageMaker Unified Studio materializes the subscription by creating scoped-down late binding views in Amazon Redshift, ensuring that the members of the subscriber project are only able to access the rows and columns they are allowed to access based on the row and column filters applied to the subscription. 

Amazon SageMaker Unified Studio first translates the row and columns filters applied to a subscription in Amazon SageMaker Unified Studio to an Amazon Redshift late binding view. If multiple row and columns filters are used, Amazon SageMaker Unified Studio unions all the columns and all the row filter conditions from to compute effective permissions at both row and column level. Amazon SageMaker Unified Studio then creates the late binding view using effective row and column permissions. 

After the late binding view is created, Amazon SageMaker Unified Studio shares this view with the members of subscriber project by creating read-only (SELECT) permissions in Amazon Redshift.

# Exporting asset metadata


In the current release of Amazon SageMaker Unified Studio, you can export asset metadata as an Apache Iceberg table through Amazon S3 Tables. This allows data teams to query catalog inventory and answer questions - like the following: How many assets were registered last month?, Which assets are classified as confidential?, or Which assets lack business descriptions?, etc. using standard SQL without building custom ETL infrastructure for reporting.

This capability automatically converts catalog asset metadata into a queryable table accessible from Amazon Athena, Amazon SageMaker Unified Studio notebooks, AI agents, and other analytics and BI tools. The exported table includes technical metadata (such as resource\$1id, resource\$1type), business metadata (such as asset\$1name, business\$1description), ownership details, and timestamps. Data is partitioned by snapshot\$1date for query performance and automatically appears in Amazon SageMaker Unified Studio under the aws-sagemaker-catalog bucket. 

**Note**  
You can enable asset metadata export for only one domain per AWS account per Region. To enable export for a different domain, complete the following steps:  
Delete the export configuration for the currently enabled domain using the DeleteDataExportConfiguration operation.
Delete the asset S3 table under the aws-sagemaker-catalog S3 table bucket. We recommend backing up the S3 table before deletion.
Call the PutDataExportConfiguration API to enable export for the new domain.
Also, encryption configuration for the exported asset table cannot be updated.

This capability is available in all AWS Regions where Amazon SageMaker Catalog is supported at no additional charge. You pay only for underlying services including S3 Tables storage and Amazon Athena queries. You can control storage costs by setting retention policies on S3 tables to automatically remove records older than your specified period. For more information, see [https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-record-expiration.html](https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-record-expiration.html). 

## Requirements


Before you export asset metadata from Amazon SageMaker Unified Studio, make sure that you meet the following requirements:

### IAM permissions


The IAM user or role that runs the export commands must have permissions to configure asset metadata export in Amazon DataZone and to create the Amazon S3 Tables resources used to store the exported metadata.

The following example IAM policy shows the required permissions:

```
{
    "Version": "2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "datazone:GetDataExportConfiguration",
                "datazone:PutDataExportConfiguration"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3tables:CreateTableBucket",
                "s3tables:PutTableBucketPolicy"
            ],
             "Resource": "arn:aws:s3tables:<REGION>:<ACCOUNT_ID>:bucket/aws-sagemaker-catalog"
        }
    ]
}
```

These permissions allow you to retrieve and update the asset metadata export configuration. They also allow Amazon DataZone to create the Amazon SageMaker Catalog S3 table bucket and apply the required bucket policy on your behalf, enabling DataZone to write the exported metadata. 

For KMS related permission requirements, see [KMS permissions for exporting asset metadata in Amazon SageMaker Unified Studio](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/sagemaker-unified-studio-export-asset-metadata-kms-permissions.html).

## Start exporting asset metadata


To get started, activate dataset export by invoking the `PutDataExportConfiguration` API action. In response, the DataZone service creates an S3 table bucket named aws-sagemaker-catalog with an asset\$1metadata namespace and an empty asset table. It also schedules a daily job to export updated asset data. Within 24 hours, you can access the asset table through the S3 Tables console or the Data tab in Amazon SageMaker Unified Studio. You can query the data using Amazon Athena or Studio notebooks, or connect external BI tools through the S3 Tables Iceberg REST Catalog endpoint.

Asset metadata is exported once a day around midnight local time per AWS region.

Enable data export:

```
aws datazone put-data-export-configuration  --domain-identifier dzd-440699i00ezy21 --region us-east-2 --enable-export
```

With KMS key encryption configuration:

```
aws datazone put-data-export-configuration --encryption-configuration kmsKeyArn=arn:aws:kms:us-east-2:651673343886:key/292fedfe-c9h6-40fa-961b-87393584195c,sseAlgorithm=aws:kms --enable-export --region us-east-2 --domain-identifier dzd-440699i00ezy21
```

For more information, see the [API reference documentation](https://docs.aws.amazon.com/datazone/latest/APIReference/API_PutDataExportConfiguration.html).

## Asset table schema


The asset metadata is exported to the following table structure:


| Column name | Data type | Comment | 
| --- | --- | --- | 
| snapshot\$1time | timestamp | Timestamp when this metadata snapshot was captured from AWS Sagemaker. Partition key for temporal queries. Use for point-in-time analysis and time-travel queries (e.g., "show catalog state 30 days ago"). | 
| asset\$1id | string | Unique identifier assigned by AWS Sagemaker Catalog for this asset. Primary key for asset lookups. | 
| resource\$1type\$1enum | string | Type of data resource cataloged. Values: GlueTable, RedshiftTable, S3Collection etc. Used for filtering by resource category. | 
| resource\$1id | string | Platform-native unique identifier for the resource. For AWS: ARN format (e.g., arn:aws:glue:region:account:table/db/table). Use for cross-referencing with source systems. | 
| account\$1id | string | Cloud account or organizational identifier. For AWS: 12-digit account ID. Used for multi-account queries and cost allocation. | 
| region | string | Geographic region or availability zone where resource is hosted. For AWS: us-east-1, eu-west-1, etc. NULL for region-agnostic resources. Use for regional analysis and compliance queries. | 
| catalog | string | Top-level namespace identifier. Meaning varies by resource\$1type: For GLUE\$1TABLE: AWS account ID. For REDSHIFT\$1TABLE: database name. For S3\$1COLLECTION: AWS account ID. Use for catalog-level grouping. | 
| namespace | string | Mid-level namespace identifier. Meaning varies by resource\$1type: For GLUE\$1TABLE: database name. For REDSHIFT\$1TABLE: schema name. For S3\$1COLLECTION: bucket name. May contain multiple levels (e.g., database.schema). | 
| asset\$1name | string | Business friendly name for this asset. Optional business identifier that can differ from the technical resource\$1name. NULL if no custom name was provided during asset registration. Examples: 'Customer Master Dataset', 'Q4 Sales Report', 'Production Orders Table'. Use for user-friendly display and business-oriented searches. | 
| resource\$1name | string | Leaf-level resource identifier. For tables: table name. For S3: prefix path. For views: view name. Use for resource-level filtering. | 
| resource\$1description | string | Technical description from source system. For Glue: table comment field. For Redshift: table remarks. NULL if source system has no description. Use for technical documentation searches. | 
| business\$1description | string | Business-friendly description provided to AWS Sagemaker Catalog. Explains purpose, usage, and business context of the asset. NULL if not provided by data owner. Use for business glossary and discovery queries. | 
| extended\$1metadata | map<string,string> | Flexible key-value store for platform-specific attributes not covered by standard columns. Examples: s3\$1location, compression\$1type, column\$1count, partition\$1keys, namespace\$1name, cluster\$1name. Use for advanced filtering and platform-specific queries. Schema varies by platform and resource\$1type. | 
| asset\$1created\$1time | timestamp | Timestamp when this asset was first created in AWS Sagemaker Catalog. | 
| asset\$1updated\$1time | timestamp | Timestamp when this asset was last updated in AWS Sagemaker Catalog. | 

## Querying asset tables


To query Amazon S3 tables using Amazon SageMaker Unified Studio or Amazon Athena you must first do the following:
+ Enable analytic services integration by following [https://docs.aws.amazon.com/lake-formation/latest/dg/enable-s3-tables-catalog-integration.html](https://docs.aws.amazon.com/lake-formation/latest/dg/enable-s3-tables-catalog-integration.html)
+ Grant the query role permission in Lake Formation by following [https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html)

  Example CLI command: 

  ```
  aws lakeformation grant-permissions \
          —principal DataLakePrincipalIdentifier=arn:aws:iam::123456789012:role/datazone_usr_role_3guzb15tfpk015_agjdowt5f47xgp \
          --resource '{"Table": {"CatalogId": "123456789012:s3tablescatalog/aws-sagemaker-catalog", "DatabaseName": "asset_metadata", "Name": "asset"}}' \
          --permissions DESCRIBE SELECT —region us-east-2
  ```

### Assets registered in last one month

+ Query with sample aggregates

  ```
  SELECT 
      DATE(asset_created_time) as date,
      resource_type_enum,
      COUNT(*) as count
  FROM asset_metadata.asset
  WHERE DATE(snapshot_time) = CURRENT_DATE
      AND asset_created_time >= DATE_ADD('month', -1, CURRENT_DATE)
  GROUP BY DATE(asset_created_time), resource_type_enum
  ORDER BY date DESC;
  ```
+ Plain query without aggregates and groupBy

  ```
  SELECT *
  FROM asset_metadata.asset
  WHERE DATE(snapshot_time) = CURRENT_DATE
      AND asset_created_time >= DATE_ADD('month', -1, CURRENT_DATE)
  ```

### Assets without business description or owningEntityId in them


```
SELECT 
    asset_id,
    asset_name,
    resource_name,
    resource_type_enum,
    account_id,
    business_description,
    extended_metadata['owningEntityId'] as owner
FROM asset_metadata.asset
WHERE DATE(snapshot_time) = CURRENT_DATE
    AND (business_description IS NULL 
         OR extended_metadata['owningEntityId'] IS NULL);
```

### Query asset matching metadata form field values


```
SELECT *
FROM asset_metadata.asset
WHERE DATE(snapshot_time) = CURRENT_DATE
    AND extended_metadata['<metadata-form-name>.<field-name>'] = '<field-value>';
```

### Asset distribution queries

+ Get distributions by account

  ```
  SELECT 
      account_id,
      resource_type_enum,
      COUNT(*) as count
  FROM asset_metadata.asset
  WHERE DATE(snapshot_time) = CURRENT_DATE
  GROUP BY account_id, resource_type_enum
  ORDER BY count DESC
  ```
+ Get distribution by asset owner (projectIds)

  ```
  SELECT 
      extended_metadata['owningEntityId'] as owner,
      COUNT(*) as count
  FROM asset_metadata.asset
  WHERE DATE(snapshot_time) = CURRENT_DATE
      AND extended_metadata['owningEntityId'] IS NOT NULL
  GROUP BY extended_metadata['owningEntityId']
  ORDER BY count DESC;
  ```

### Time travel capabilities


The asset\$1metadata.asset table captures daily snapshots of asset metadata, allowing us to view the state of data catalog at any point in time. Simply change the date filter in our query to travel back to any previous snapshot

**Note**  
Querying without a snapshot\$1time filter will read all historical snapshots, resulting in duplicate records and slower performance. Always filter by the desired date or current timestamp.
+ View Current Assets snapshot

  ```
  SELECT * 
  FROM asset_metadata.asset
  WHERE DATE(snapshot_time) = CURRENT_DATE;
  ```
+ Travel to a Specific Date ex: Nov-26-2025

  ```
  SELECT *
  FROM asset_metadata.asset
  WHERE DATE(snapshot_time) = DATE('2025-11-26');
  ```
+ Travel Back Relative to Today ex: travel back by 2 days

  ```
  SELECT *
  FROM asset_metadata.asset 
  WHERE DATE(snapshot_time) = date_add('day', -2, CURRENT_DATE);
  ```

### Common use cases


1. Track Metadata Improvements see which assets gained descriptions or ownership over time:

   ```
   SELECT 
       t.asset_id,
       t.resource_name,
       p.business_description as description_before,
       t.business_description as description_now
   FROM asset_metadata.asset t
   JOIN asset_metadata.asset p ON t.asset_id = p.asset_id
   WHERE DATE(t.snapshot_time) = CURRENT_DATE
       AND DATE(p.snapshot_time) = CURRENT_DATE - INTERVAL '7' DAY
       AND p.business_description IS NULL 
       AND t.business_description IS NOT NULL;
   ```

1. Monitor Asset Growth View how data catalog has grown over the last 30 days:

   ```
   SELECT 
       DATE(snapshot_time) as date,
       COUNT(*) as total_assets
   FROM asset_metadata.asset
   WHERE DATE(snapshot_time) >= CURRENT_DATE - INTERVAL '30' DAY
   GROUP BY DATE(snapshot_time)
   ORDER BY date DESC;
   ```

1. Audit Historical Changes to investigate what an asset looked like at a specific point in time:

   ```
   SELECT 
       asset_id,
       resource_name,
       business_description,
       extended_metadata['owningEntityId'] as owner,
       snapshot_time
   FROM asset_metadata.asset
   WHERE asset_id = 'your-asset-id'
       AND DATE(snapshot_time) = DATE('2025-11-26');
   ```

# Working with Catalog in IAM-based domains


Amazon SageMaker Unified Studio provides integrated data catalog functionality that allows you to discover, organize, and manage your data assets. The catalog integrates with AWS Glue Data Catalog to provide a unified view of your databases, tables, and Amazon S3 buckets within your project.

You can use the catalog to browse existing AWS Glue Data Catalog assets and Amazon S3 buckets that your project's execution role has access to.

## Key capabilities

+ Browse accessible data assets - Navigate through AWS Glue catalogs and Amazon S3 buckets you have access to using hierarchical browsing
+ Search across all data - Find tables, models, and other assets across the environment using global search, including data you may not have access to
+ Create tables from uploaded data - Upload CSV, JSON, Parquet, or delimited files to create new catalog tables
+ Manage Amazon S3 bucket contents - Browse Amazon S3 bucket hierarchies, create folders, upload files, and manage Amazon S3 objects

# Browse data you have access to


Use browsing when you know where your data is located. You can also use it to explore your existing data assets. Expand catalog hierarchies to view databases and tables. Navigate through Amazon S3 bucket folders to find specific files and objects.

**To browse data assets you have access to**

1. In the Amazon SageMaker Unified Studio, choose **Data** in the left navigation panel.

1. To browse catalog assets:

   1. Choose the **Catalogs** tab to view available data catalogs.

   1. Expand AwsDataCatalog to see your databases.

   1. Expand any database to view its tables.

   1. Choose a table to view its schema, columns, and metadata in the right panel.

   1. For selected table, use the **Overview**, **Preview data**, and **Details** tabs to explore additional metadata and sample data.

1. To browse Amazon S3 bucket contents:

   1. Choose the **Amazon S3 buckets** tab to view accessible Amazon S3 buckets.

   1. Expand any Amazon S3 bucket to navigate through its folder structure.

   1. Choose folders to view their contents.

   1. Select files to view their properties and metadata.

# Search and find data


**Note**  
For files in Amazon S3, the search is limited to locations that can be accessed by your IAM role

**To search and find data**

1. Use the search bar at the top of the Amazon SageMaker Unified Studio interface.

1. Enter your search terms, such as:
   + Table names (for example, "churn")
   + Database names
   + Model names
   + Asset descriptions or metadata

1. Use the search suggestions and filters to refine your results:

   1. Select **Show all results for "[search term]"** to see all the results.

   1. Use the **Tables**, **Files** and **Models** filter tabs to focus on specific asset types.

1. Choose any search result to view available details and metadata.

1. Use the **Select**, **Open**, and **Open in new tab** options to work with assets.

# Create new catalog table


**To add data to a catalog**

1. In the data explorer, choose the **Add** button.

1. In the Add data dialog, choose one of the options
   + Create table: Upload data files to create a new table
   + Create Amazon S3 Tables catalog: Create a new catalog for your Amazon S3 Tables
   + Add connection: Connect to first and third-party data sources
   + Add Amazon S3 location: Add an existing Amazon S3 location

1. For creating a table from uploaded data:

   1. Choose **Create table** option and select **Next**.

   1. In the file upload area, either drag and drop your file or select **Choose file** to browse for your data file.

   1. Configure the table properties:
      + Table type: Select Amazon S3/external table
      + Catalog name: Choose the target catalog
      + Database: Select the database
      + Table name: Enter a name for your table

   1. Set the data format options:
      + Data format: Choose CSV, JSON, or Parquet
      + Compression type: Select compression if applicable
      + Delimiter: Specify the field separator for CSV files
      + Ignore first row: Check if the first row contains column headers

   1. Choose **Next** to proceed to schema configuration.

   1. In the Schema section:
      + Review the automatically detected column names and data types.
      + Modify column names by editing the text fields.
      + Change data of columns as needed

   1. Choose **Create table** to complete the process.

1. The new table will appear in your catalog with the specified schema and uploaded data.

# Work with Amazon S3 bucket


**To manage Amazon S3 bucket**

1. In the data explorer, choose the **Amazon S3 buckets** tab.

1. Expand the Amazon S3 bucket nodes to view available buckets in your account.

1. Select a bucket to view its contents.

1. To create a new folder:

   1. Choose the **Actions** menu in the bucket view.

   1. Select **Create folder**.

   1. Enter a folder name and choose **Create**.

1. To upload files:

   1. Choose **Upload files** from the **Actions** menu.

   1. Select files from your local system.

   1. Monitor the upload progress.

1. To create a table from existing Amazon S3 data:

   1. Choose **Create table from contents** from the **Actions** menu.

   1. Follow the table creation workflow to define schema and properties.

# Working with an existing AWS Glue Data Catalog IRC


This document outlines the procedure for onboarding existing AWS Glue IRC federated catalogs managed by AWS Lake Formation into an Amazon SageMaker Unified Studio domain. Successful onboarding requires granting appropriate permissions granted to the Studio role within AWS Lake Formation by the Datalake admin.

## IAM based domain


### Prerequisites

+ Amazon SageMaker Unified Studio deployed with IAM-based domain mode
+ Existing AWS Glue federated catalog managed by AWS Lake Formation
+ Data Lake Administrator credentials

### Step by Step Instructions


**Step 1: Retrieve the Project Execution Role**

1. Access your Amazon SageMaker Unified Studio project

1. Locate and copy the project execution role ARN

**Step 2: Configure Lake Formation Permissions**

1. Sign in to the AWS Management Console using Data Lake Administrator credentials

1. Navigate to AWS Lake Formation and grant Permissions (Select One Option):  
Option 1: Full Catalog Access (Recommended for Admin project)  
Grant the execution role super\$1user permissions on the federated catalog. The execution role receives complete access to discover and query all databases and tables within the federated catalog.  
Option 2: Granular Access Control (Recommended for non-Admin project)  
Apply least-privilege permissions by granting specific access levels:  

   1. Catalog Level: Grant DESCRIBE permission on the catalog to the execution role

   1. Database Level: Grant DESCRIBE permission on the target database(s) to the execution role

   1. Table Level: Grant SELECT and DESCRIBE permissions on the target table(s) to the execution role

**Step 3 : Query federated resource from Unified Studio**

1. Use Query Editor:

   1. Now you can see the federated catalogs discoverable under Query Editor and query them as well.

1. Use Data notebook

   1. To use Data notebook to query you can navigate to the notebooks tab in the left navigation panel.

   1. Create notebook and you can now run select on federated catalog

   1. For Athena(SQL) you can run the query as shown below

      ```
      SELECT * FROM "smus_lfuc_poc"."lfuc"."customer" LIMIT 100
      ```

   1. For Athena(spark), add the following config to enable federated catalog querying.

      ```
      SET `spark.sql.catalog.<catalog_name>`=`org.apache.iceberg.spark.SparkCatalog`;
      SET `spark.sql.catalog.<catalog_name>.catalog-impl`=`org.apache.iceberg.aws.glue.GlueCatalog`;
      SET `spark.sql.catalog.<catalog_name>.glue.id`=`<account_id>:<federated_catalog_name>`;
      SET `spark.sql.catalog.<catalog_name>.glue.lakeformation-enabled` = `true`;
      SET `spark.sql.catalog.<catalog_name>.glue.account-id` = `<accountid>`;
      SET `spark.sql.catalog.<catalog_name>.client.region` = `<region>`;
      ```

   1. Query the catalog by running the following:

      ```
      select * from <fderated_catalog_name>.<database_name>.<table_name>
      ```

## Identity Center based domain


### Prerequisites

+ Amazon SageMaker Unified Studio deployed with IDC-based domain mode
+ Existing AWS Glue federated catalog managed by AWS Lake Formation
+ Data Lake Administrator credentials

### Step by Step Instructions


**Step 1: Retrieve the Project IAM Role**

1. Access your Amazon SageMaker Unified Studio project

1. Locate and copy the project IAM role ARN

**Step 2: Configure Lake Formation Permissions**

1. Sign in to the AWS Management Console using Data Lake Administrator credentials

1. Navigate to AWS Lake Formation and grant Permissions (Select One Option):  
Option 1: Full Catalog Access  
Grant the project role super\$1user permissions on the federated catalog. The execution role receives complete access to discover and query all databases and tables within the federated catalog.  
Option 2: Granular Access Control  
Apply least-privilege permissions by granting specific access levels:  

   1. Catalog Level: Grant DESCRIBE permission on the catalog to the project role

   1. Database Level: Grant DESCRIBE permission on the target database(s) to the project role

   1. Table Level: Grant SELECT and DESCRIBE permissions on the target table(s) to the project role

**Step 3 : Query federated resource from Unified Studio**
+ Login into studio and access as Idc used and access the the federated resource from the explorer. You can select resource and query with Athena and Amazon Redshift.

# Data and catalog connections in IAM-based domains


Amazon SageMaker Unified Studio notebooks can connect to multiple data sources including Amazon S3, AWS Glue Data Catalog, Amazon Athena, Amazon Redshift, and third-party sources. You can query data directly from these sources using SQL cells or Python code. The notebook interface provides built-in connectors for AWS services and supports custom connections for external data sources. Data connections are configured at the project level and shared across notebooks.

## Prerequisites


1. Configured data connections in your Amazon SageMaker Unified Studio project

1. Appropriate IAM permissions to access data sources

1. Network connectivity to external data sources if applicable

## Supported data connections


Amazon SageMaker Unified Studio supports the following data connections for IAM-based domains:

### Databases and data warehouses

+ Amazon DocumentDB
+ Amazon DynamoDB
+ Amazon Redshift
+ Aurora MySQL
+ Aurora PostgreSQL
+ Azure SQL
+ Google BigQuery
+ Microsoft SQL Server
+ MySQL
+ Oracle
+ PostgreSQL
+ Snowflake

### Storage

+ Amazon S3

## AWS resources created by connections


When you create a connection in Amazon SageMaker Unified Studio, the following resources are created in your AWS account(s) behind the scenes:
+ AWS Glue connection - a connection object that stores core connection information.

Those resources are visible in the account where Amazon SageMaker Unified Studio domain is hosted and you can discover and describe them through Console or API/SDK/CLI of the corresponding service (in this case - AWS Glue).

# Connecting to a new data source


## Domain and project VPC configuration


Data sources Amazon DocumentDB, Amazon Redshift, Aurora MySQL, Aurora PostgreSQL, Azure SQL, PostgreSQL, Oracle, Microsoft SQL Server require your project to be configured with a VPC [Step 1](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/configure-vpc-networking-iam-based-domains.html), [Step 2](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/update-individual-projects-vpc.html). You might need to contact administrator to complete that configuration.

**To connect to a new data source**

1. In the navigation pane, choose **Connections**.

1. Choose **Create connection**.

1. Select the connection type in the gallery that opens.

1. For **Name** enter a descriptive name for your connection.

1. Configure the connection parameters based on your selected connection type.

   1. You can select [SSL connection](#ssl-connection) for supported data sources.

1. Configure the authentication details. You can either enter **Username** and **Password** directly in the connection details, or use a secret in AWS Secrets Manager that stores the credentials. You might need to contact your administrator to create a new secret for the connection.

1. Choose **Create connection**.

1. If all validations pass, a new connection will be created.

Your connection will appear under **Data section** in the navigation pane. In that section, select **Connections** to see the list of all available connections. From there you can see connection details such as **Name**, **Connection type**, and **Authorization mode**. You can also check the connection status.

## SSL connection


You can configure SSL connection for Amazon DocumentDB, Amazon Redshift, Aurora MySQL, Aurora PostgreSQL, Azure SQL, Microsoft SQL Server, MySQL, Oracle, and PostgreSQL connection types.

SSL connection enables encrypted communication between Amazon SageMaker Unified Studio and your data source. When enabled, all data transmitted between your notebooks and the database is encrypted using SSL/TLS protocols, protecting sensitive information in transit. This setting is recommended for production environments and required when connecting to databases that enforce SSL connections.

To enable SSL connection select Enforce SSL checkbox when configuring connection properties.

**Note**  
Enabling SSL may slightly increase connection latency due to the encryption overhead.

# Connecting to Amazon S3


You can create a data connection to Amazon S3 when you need to directly access files stored in Amazon S3 buckets from your notebooks. This connection is only required if you want to read or write individual files (such as CSV, JSON, or Parquet files) directly from Amazon S3 storage. If you are working with Data Catalog tables that are backed by Amazon S3, you do not need to create a separate Amazon S3 connection, you can access those tables directly through the catalog.

Before connecting to Amazon S3, complete the one of the following prerequisite options:
+ [Prerequisite option 1 (recommended): Gain access using an access role](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/adding-existing-s3-data.html#adding-existing-s3-access-role)
+ [Prerequisite option 2: Gain access using the project role](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/adding-existing-s3-data.html#adding-existing-s3-project-role)

**To connect to Amazon S3**

1. In the navigation pane, choose **Connections**.

1. Choose **Create connection**.

1. In the gallery that opens, select Amazon S3.

1. For **Name** enter a descriptive name for your connection.

1. Enter S3 URI - optional. If you do not specify an Amazon S3 URI, SageMaker Unified Studio will list all buckets accessible with the provided credential.

1. AWS region - enter the AWS region where the S3 bucket is located.

1. Access role ARN - optional - select an existing IAM role from the dropdown. You might need to contact your Administrator for configuring an access role if you are connecting to S3 bucket in an AWS account that is different from the AWS account where your SageMaker Unified Studio domain is hosted.

1. Choose **Create connection**.

1. If all validations pass, a new Amazon S3 connection will be created.

After creating the connection, you can use it in your notebooks to read and write files directly from the specified S3 location. You can also all the buckets you connected to if you select Data on navigation pane and select S3 buckets tab.

# Connecting to Amazon Redshift


You can create a data connection to Amazon Redshift to query data warehouses from Amazon SageMaker Unified Studio. You can connect to both provisioned clusters and serverless workgroups.

## VPC Requirements


Amazon Redshift connections use different connection methods depending on the tool you are using in Amazon SageMaker Unified Studio.
+ Visual ETL and data processing jobs require your Amazon Redshift database to be in the same VPC as Amazon SageMaker Unified Studio domain. To configure that refer to the following documentation: [Step 1](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/configure-vpc-networking-iam-based-domains.html), [Step 2](https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/update-individual-projects-vpc.html). You might need to contact Administrator to complete that configuration.
+ Query Editor and Data notebooks do not require VPC configuration.

## Steps to connect to Amazon Redshift from Amazon SageMaker Unified Studio


1. In the navigation pane, choose **Connections**.

1. Choose **Create connection**.

1. In the gallery that opens, select Amazon Redshift.

1. For **Name** enter a descriptive name for your connection.

1. Configure the connection properties:

   1. **Redshift compute** - Select existing compute or enter a JDBC URL.

      1. JDBC URL format: `jdbc:redshift://endpoint:port/database`.

   1. **EnforceSSL** - Select to enable encrypted communication (recommended).

   1. **JDBC URL Parameters** - Optional configuration parameters for the JDBC/ODBC driver in the following format: `<key1>=<value1>;<key2>=<value2>`.

1. Authentication option, you can select one of the options below

   1. **IAM** - generates a database username based on your IAM identity. You can apply grants directly to this user, or use the RedshiftDbUser principal tag on your project execution role to connect as a different database user.

   1. **Username and password** - Provide a username and password for the database that you are connecting to. We will store your credentials in AWS Secrets Manager.

   1. **AWS Secrets Manager** - Choose a secret in AWS Secrets Manager with the credentials. You might need to contact your Administrator to create a secret.

1. **Access role ARN** - optional. Required when connecting to resources in a different AWS account. See [Gaining access to Amazon Redshift resources](compute-prerequisite-redshift.md).

1. Choose **Create connection**.

## Steps to connect to Amazon SageMaker Unified Studio from Amazon Redshift console pages


These steps describe how Amazon Redshift customers can automatically create an Amazon Redshift connection within the Amazon SageMaker Unified Studio environment from the Amazon Redshift Management Console and Amazon Redshift Query Editor v2 (QEv2). This functionality is available from the following Amazon Redshift console pages:
+ Landing page
+ Provisioned dashboard
+ Serverless dashboard
+ Cluster list
+ Cluster detail
+ Serverless workgroup and namespace list pages
+ Workgroup and namespace detail pages

When a user selects the "Query in SageMaker Unified Studio" option, they will be prompted to choose the Amazon Redshift cluster they want to use. Amazon SageMaker Unified Studio will then issue temporary credentials and provision the session, redirecting the user into the Amazon SageMaker Unified Studio environment where they can begin querying data immediately.

## Steps to connect to Amazon SageMaker Unified Studio from Amazon Redshift Query Editor v2


These steps describe how Amazon Redshift customers can automatically create an Amazon Redshift connection within the Amazon SageMaker Unified Studio environment from the Amazon Redshift Query Editor v2 (QEv2). This functionality is available within Amazon Redshift Query Editor v2 under a "Run in SageMaker Unified Studio" button in the following locations.
+ As a tooltip option next to the "Run" button in the SQL editor
+ As a tooltip option next to the "Run" button in notebook cells

If the connection selected is using IAM authentication then when the user clicks on "Run in Amazon SageMaker Unified Studio" button, the query currently typed in the active editor (SQL or notebook cell) will be executed directly in the Amazon SageMaker Unified Studio environment.

## Redshift Access using IAM Authentication


IAM authentication generates temporary database credentials based on your AWS identity, eliminating the need to manage database passwords. When you connect using IAM authentication, Amazon SageMaker Unified Studio maps your IAM identity to a database user.

### How IAM Authentication Works


When connecting with IAM authentication, your database username is automatically generated based on your IAM identity. By default, the username follows the pattern `IAMR:<your-federated-identity>`. You can customize this behavior using the RedshiftDbUser principal tag on your project execution role to specify a different database user.

### Using Principal Tags


Principal tags allow you to control which database user is used when connecting to Redshift. Configure the RedshiftDbUser tag on your project execution role in IAM. When this tag is set, your connection will use `IAMR:<tag-value>` as the database username instead of the federated IAM identity.

For more information about setting up principal tags, see [Setting up principal tags to connect a cluster or workgroup from query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-getting-started.html#query-editor-v2-principal-tags-iam).

## Troubleshooting Database Access


If after establishing a connection, you do not see expected databases, schemas, or tables, follow the steps described below.

First identify your effective database user by running the following SQL:

```
SELECT current_user;
```

Your database username follows these patterns:
+ Federated: `IAMR:<your-federated-identity>`
+ With RedshiftDbUser tag on the project execution role: `IAMR:<tag-value>`

### Common Access Issues


**Databases are not displayed**  
Your database user lacks connection privileges. The database administrator must grant:  

```
GRANT CONNECT ON DATABASE <database_name> TO "IAMR:<your-identity>";
```

**Schemas are not displayed**  
Your database user lacks schema usage privileges. The database administrator must grant:  

```
GRANT USAGE ON SCHEMA <schema_name> TO "IAMR:<your-identity>";
```

**Tables are not displayed**  
Your database user lacks table access privileges. The database administrator must grant:  

```
GRANT SELECT ON ALL TABLES IN SCHEMA <schema_name> TO "IAMR:<your-identity>";
```

For more information on grants in Redshift, see [GRANT](https://docs.aws.amazon.com/redshift/latest/dg/r_GRANT.html).

# Third-party business data catalog integrations
Third-party catalog integrations

Amazon SageMaker Unified Studio supports metadata synchronization with third-party business data catalog platforms. These integrations keep catalog metadata aligned between Amazon SageMaker Catalog and partner platforms. Teams get a consistent view of their data and AI assets regardless of which tool they use day to day.

With these integrations, you can synchronize key metadata elements such as projects, assets, descriptions, glossary terms, and their hierarchies. Organizations can maintain aligned glossary terms, asset descriptions, and ownership information across platforms without manual reconciliation.

Amazon SageMaker Unified Studio currently integrates with the following third-party catalog platforms:
+ **Atlan** – Bidirectional metadata synchronization between Amazon SageMaker Catalog and Atlan.
+ **Collibra** – Bidirectional metadata synchronization and access request workflow integration between Amazon SageMaker Catalog and Collibra.
+ **Alation** – Metadata extraction from Amazon SageMaker Catalog into Alation.

**Topics**
+ [

# Atlan integration
](atlan-integration.md)
+ [

# Collibra integration
](collibra-integration.md)
+ [

# Alation integration
](alation-integration.md)

# Atlan integration


The integration between Amazon SageMaker Catalog and Atlan enables bidirectional metadata synchronization across both platforms. Atlan is a data workspace that helps business users, analysts, and engineers collaborate on data projects. This integration connects teams working in Atlan with technical teams working in Amazon SageMaker Unified Studio for analytics and machine learning. For detailed setup instructions, see [Unifying governance and metadata across Amazon SageMaker Unified Studio and Atlan](https://aws.amazon.com/blogs/big-data/unifying-governance-and-metadata-across-amazon-sagemaker-unified-studio-and-atlan/).

## Capabilities


The Atlan integration supports the following capabilities:
+ On-demand and scheduled bidirectional metadata synchronization.
+ Synchronization of glossary terms and descriptions, including parent-child relationships.
+ Ingestion of projects, published and subscribed assets, domains, data products, metadata forms, and column descriptions from Amazon SageMaker Catalog into Atlan.
+ Automatic association of glossary terms with related data assets.
+ Real-time reverse sync of metadata updates from Atlan back to Amazon SageMaker Catalog.

## How it works


The integration uses AWS Identity and Access Management roles to establish a secure connection between your AWS account and Atlan. You deploy an AWS CloudFormation template that creates the required IAM role and policies. This role follows the principle of least privilege, granting Atlan access only to the resources required for cataloging and governance.

After you configure the connection, the Atlan connector calls Amazon SageMaker Unified Studio APIs to ingest assets and metadata. The connector transforms ingested assets into Atlan's metadata model, making them discoverable and governable inside Atlan. When users update metadata in Atlan, the real-time reverse sync pipeline detects changes and pushes updates back to Amazon SageMaker Catalog.

You set up this integration by configuring a connection to Amazon SageMaker Unified Studio from within Atlan.

# Collibra integration


The integration between Amazon SageMaker Catalog and Collibra provides bidirectional metadata synchronization and access governance across both platforms. Collibra is a data intelligence platform that helps organizations centralize governance workflows, define business glossaries, and enforce policies across data assets. This integration is available as an open-source solution on [GitHub](https://github.com/aws-samples/amazon-datazone-examples/tree/main/blogs/unifying_metadata_governance_across_amazon_sagemaker_catalog_and_collibra), co-developed by AWS and Collibra. For detailed setup instructions, see [Unifying metadata governance across Amazon SageMaker and Collibra](https://aws.amazon.com/blogs/big-data/unifying-metadata-governance-across-amazon-sagemaker-and-collibra/).

## Capabilities


### Metadata synchronization


The Collibra integration synchronizes the following metadata between Amazon SageMaker Catalog and Collibra:
+ Bidirectional synchronization of glossary terms and descriptions.
+ Preservation of glossary structure, including parent-child relationships.
+ Association of terms with data assets such as datasets, tables, and columns.
+ Synchronization of classifications, data categories, and tags.
+ Alignment of technical descriptions for datasets and columns.

Core metadata elements synchronize every 5 minutes. Subscription requests that originate in Amazon SageMaker Catalog synchronize to Collibra instantly.

### Access request workflows


The Collibra integration extends Collibra's access governance workflows to assets cataloged in Amazon SageMaker Catalog. Users can discover and request access to datasets from within Collibra or Amazon SageMaker Unified Studio using familiar approval processes.

Key capabilities of the access request workflow include:
+ Access request initiation from either Collibra or Amazon SageMaker Unified Studio.
+ Centralized review and approval managed within Collibra by designated business stewards.
+ Automatic access provisioning through the Amazon SageMaker Catalog grant mechanism.
+ Status tracking of subscription requests across both platforms.

## How it works


The integration uses the APIs of both Amazon SageMaker and Collibra Data Governance Center. You deploy an AWS CloudFormation template that provisions the required AWS resources, including IAM roles and AWS Lambda functions. On the Collibra side, you configure operating model changes, import workflows, and assign business stewards to assets.

The solution is available as an open-source project on [GitHub](https://github.com/aws-samples/amazon-datazone-examples/tree/main/blogs/unifying_metadata_governance_across_amazon_sagemaker_catalog_and_collibra).

# Alation integration


The integration between Amazon SageMaker Catalog and Alation synchronizes catalog metadata between both systems. Alation is a data intelligence platform that helps organizations make data discoverable, governed, and actionable. This integration creates a unified metadata experience where technical teams working in Amazon SageMaker Unified Studio and business teams working in Alation collaborate on top of the same metadata. For detailed setup instructions, see [Build a trusted foundation for data and AI using Alation and Amazon SageMaker Unified Studio](https://aws.amazon.com/blogs/big-data/build-a-trusted-foundation-for-data-and-ai-using-alation-and-amazon-sagemaker-unified-studio/).

## Capabilities


The current phase of the Alation integration extracts metadata from Amazon SageMaker Catalog into Alation. The integration synchronizes the following metadata:
+ Domains, projects, and asset names.
+ Descriptions, owners, and glossary terms.
+ Custom metadata fields (metadata forms).
+ Provenance metadata, including the originating service, the actor who made the change, and the timestamp.

You can run metadata extractions on demand or schedule them to run automatically. The system performs an initial bulk extraction and then keeps data current through incremental updates.

## How it works


The integration connects through AWS Identity and Access Management authentication. You can use either an IAM role (recommended) or an IAM user with access keys. The connector uses scoped IAM permissions following least-privilege principles. Communication uses encrypted APIs, and only metadata is synchronized. Your data files and artifacts remain in their original AWS locations.

You set up this integration by installing the SageMaker enhanced connector in Alation and configuring a data source connection.