# Prepare ML Data with Amazon SageMaker Data Wrangler
Prepare Data with Data Wrangler

**Important**  
Amazon SageMaker Data Wrangler has been integrated into Amazon SageMaker Canvas. Within the new Data Wrangler experience in SageMaker Canvas, you can use a natural language interface to explore and transform your data in addition to the visual interface. For more information about Data Wrangler in SageMaker Canvas, see [Data preparation](canvas-data-prep.md).

Amazon SageMaker Data Wrangler (Data Wrangler) is a feature of Amazon SageMaker Studio Classic that provides an end-to-end solution to import, prepare, transform, featurize, and analyze data. You can integrate a Data Wrangler data preparation flow into your machine learning (ML) workflows to simplify and streamline data pre-processing and feature engineering using little to no coding. You can also add your own Python scripts and transformations to customize workflows.

Data Wrangler provides the following core functionalities to help you analyze and prepare data for machine learning applications. 
+ **Import** – Connect to and import data from Amazon Simple Storage Service (Amazon S3), Amazon Athena (Athena), Amazon Redshift, Snowflake, and Databricks.
+ **Data Flow** – Create a data flow to define a series of ML data prep steps. You can use a flow to combine datasets from different data sources, identify the number and types of transformations you want to apply to datasets, and define a data prep workflow that can be integrated into an ML pipeline. 
+ **Transform** – Clean and transform your dataset using standard *transforms* like string, vector, and numeric data formatting tools. Featurize your data using transforms like text and date/time embedding and categorical encoding.
+ **Generate Data Insights** – Automatically verify data quality and detect abnormalities in your data with Data Wrangler Data Insights and Quality Report. 
+ **Analyze** – Analyze features in your dataset at any point in your flow. Data Wrangler includes built-in data visualization tools like scatter plots and histograms, as well as data analysis tools like target leakage analysis and quick modeling to understand feature correlation. 
+ **Export** – Export your data preparation workflow to a different location. The following are example locations: 
  + Amazon Simple Storage Service (Amazon S3) bucket
  + Amazon SageMaker Pipelines – Use Pipelines to automate model deployment. You can export the data that you've transformed directly to the pipelines.
  + Amazon SageMaker Feature Store – Store the features and their data in a centralized store.
  + Python script – Store the data and their transformations in a Python script for your custom workflows.

To start using Data Wrangler, see [Get Started with Data Wrangler](data-wrangler-getting-started.md).

**Important**  
Data Wrangler no longer supports Jupyter Lab Version 1 (JL1). To access the latest features and updates, update to Jupyter Lab Version 3. For more information about upgrading, see [View and update the JupyterLab version of an application from the console](studio-jl.md#studio-jl-view).

**Important**  
The information and procedures in this guide use the latest version of Amazon SageMaker Studio Classic. For information about updating Studio Classic to the latest version, see [Amazon SageMaker Studio Classic UI Overview](studio-ui.md).

You must use Studio Classic version 1.3.0 or later. Use the following procedure to open Amazon SageMaker Studio Classic and see which version you're running.

To open Studio Classic and check its version, see the following procedure.

1. Use the steps in [Prerequisites](data-wrangler-getting-started.md#data-wrangler-getting-started-prerequisite) to access Data Wrangler through Amazon SageMaker Studio Classic.

1. Next to the user you want to use to launch Studio Classic, select **Launch app**.

1. Choose **Studio**.

1. After Studio Classic loads, select **File**, then **New**, and then **Terminal**.  
![\[The Studio Classic context menu options described in step 4.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/terminal.png)

1. Once you have launched Studio Classic, select **File**, then **New**, and then **Terminal**.

1. Enter `cat /opt/conda/share/jupyter/lab/staging/yarn.lock | grep -A 1 "@amzn/sagemaker-ui-data-prep-plugin@"` to print the version of your Studio Classic instance. You must have Studio Classic version 1.3.0 to use Snowflake.   
![\[A terminal window opened in Studio Classic with the command from step 6 copied and pasted in.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/cat-command.png)

You can update Amazon SageMaker Studio Classic from within the AWS Management Console. For more information about updating Studio Classic, see [Amazon SageMaker Studio Classic UI Overview](studio-ui.md).

**Topics**
+ [

# Get Started with Data Wrangler
](data-wrangler-getting-started.md)
+ [

# Import
](data-wrangler-import.md)
+ [

# Create and Use a Data Wrangler Flow
](data-wrangler-data-flow.md)
+ [

# Get Insights On Data and Data Quality
](data-wrangler-data-insights.md)
+ [

# Automatically Train Models on Your Data Flow
](data-wrangler-autopilot.md)
+ [

# Transform Data
](data-wrangler-transform.md)
+ [

# Analyze and Visualize
](data-wrangler-analyses.md)
+ [

# Reusing Data Flows for Different Datasets
](data-wrangler-parameterize.md)
+ [

# Export
](data-wrangler-data-export.md)
+ [

# Use an Interactive Data Preparation Widget in an Amazon SageMaker Studio Classic Notebook to Get Data Insights
](data-wrangler-interactively-prepare-data-notebook.md)
+ [

# Security and Permissions
](data-wrangler-security.md)
+ [

# Release Notes
](data-wrangler-release-notes.md)
+ [

# Troubleshoot
](data-wrangler-trouble-shooting.md)
+ [

# Increase Amazon EC2 Instance Limit
](data-wrangler-increase-instance-limit.md)
+ [

# Update Data Wrangler
](data-wrangler-update.md)
+ [

# Shut Down Data Wrangler
](data-wrangler-shut-down.md)

# Get Started with Data Wrangler


Amazon SageMaker Data Wrangler is a feature in Amazon SageMaker Studio Classic. Use this section to learn how to access and get started using Data Wrangler. Do the following:

1. Complete each step in [Prerequisites](#data-wrangler-getting-started-prerequisite).

1. Follow the procedure in [Access Data Wrangler](#data-wrangler-getting-started-access) to start using Data Wrangler.

## Prerequisites


To use Data Wrangler, you must complete the following prerequisites. 

1. To use Data Wrangler, you need access to an Amazon Elastic Compute Cloud (Amazon EC2) instance. For more information about the Amazon EC2 instances that you can use, see [Instances](data-wrangler-data-flow.md#data-wrangler-data-flow-instances). To learn how to view your quotas and, if necessary, request a quota increase, see [AWS service quotas](https://docs.aws.amazon.com/general/latest/gr/aws_service_limits.html).

1. Configure the required permissions described in [Security and Permissions](data-wrangler-security.md). 

1. If your organization is using a firewall that blocks internet traffic, you must have access to the following URLs:
   + `https://ui.prod-1.data-wrangler.sagemaker.aws/`
   + `https://ui.prod-2.data-wrangler.sagemaker.aws/`
   + `https://ui.prod-3.data-wrangler.sagemaker.aws/`
   + `https://ui.prod-4.data-wrangler.sagemaker.aws/`

To use Data Wrangler, you need an active Studio Classic instance. To learn how to launch a new instance, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md). When your Studio Classic instance is **Ready**, use the instructions in [Access Data Wrangler](#data-wrangler-getting-started-access).

## Access Data Wrangler


The following procedure assumes you have completed the [Prerequisites](#data-wrangler-getting-started-prerequisite).

To access Data Wrangler in Studio Classic, do the following.

1. Sign in to Studio Classic. For more information, see [Amazon SageMaker AI domain overview](gs-studio-onboard.md).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. You can also create a Data Wrangler flow by doing the following.

   1. In the top navigation bar, select **File**.

   1. Select **New**.

   1. Select **Data Wrangler Flow**.  
![\[Home tab of the Studio Classic console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/new-flow-file-menu.png)

1. (Optional) Rename the new directory and the .flow file.

1. When you create a new .flow file in Studio Classic, you might see a carousel that introduces you to Data Wrangler.

   **This may take a few minutes.**

   This messaging persists as long as the **KernelGateway** app on your **User Details** page is **Pending**. To see the status of this app, in the SageMaker AI console on the **Amazon SageMaker Studio Classic** page, select the name of the user you are using to access Studio Classic. On the **User Details** page, you see a **KernelGateway** app under **Apps**. Wait until this app status is **Ready** to start using Data Wrangler. This can take around 5 minutes the first time you launch Data Wrangler.  
![\[Example showing the KernelGateway app status is Ready on the User Details page.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/gatewayKernel-ready.png)

1. To get started, choose a data source and use it to import a dataset. See [Import](data-wrangler-import.md) to learn more. 

   When you import a dataset, it appears in your data flow. To learn more, see [Create and Use a Data Wrangler Flow](data-wrangler-data-flow.md).

1. After you import a dataset, Data Wrangler automatically infers the type of data in each column. Choose **\$1** next to the **Data types** step and select **Edit data types**. 
**Important**  
After you add transforms to the **Data types** step, you cannot bulk-update column types using **Update types**. 

1. Use the data flow to add transforms and analyses. To learn more see [Transform Data](data-wrangler-transform.md) and [Analyze and Visualize](data-wrangler-analyses.md).

1. To export a complete data flow, choose **Export** and choose an export option. To learn more, see [Export](data-wrangler-data-export.md). 

1. Finally, choose the **Components and registries** icon, and select **Data Wrangler** from the dropdown list to see all the .flow files that you've created. You can use this menu to find and move between data flows.

After you have launched Data Wrangler, you can use the following section to walk through how you might use Data Wrangler to create an ML data prep flow. 

## Update Data Wrangler


We recommend that you periodically update the Data Wrangler Studio Classic app to access the latest features and updates. The Data Wrangler app name starts with **sagemaker-data-wrang**. To learn how to update a Studio Classic app, see [Shut Down and Update Amazon SageMaker Studio Classic Apps](studio-tasks-update-apps.md).

## Demo: Data Wrangler Titanic Dataset Walkthrough


The following sections provide a walkthrough to help you get started using Data Wrangler. This walkthrough assumes that you have already followed the steps in [Access Data Wrangler](#data-wrangler-getting-started-access) and have a new data flow file open that you intend to use for the demo. You may want to rename this .flow file to something similar to `titanic-demo.flow`.

This walkthrough uses the [Titanic dataset](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/walkthrough_titanic.csv). It's a modified version of the [Titanic dataset](https://www.openml.org/d/40945) that you can import into your Data Wrangler flow more easily. This data set contains the survival status, age, gender, and class (which serves as a proxy for economic status) of passengers aboard the maiden voyage of the *RMS Titanic* in 1912.

In this tutorial, you perform the following steps.

1. Do one of the following:
   + Open your Data Wrangler flow and choose **Use Sample Dataset**.
   + Upload the [Titanic dataset](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/walkthrough_titanic.csv) to Amazon Simple Storage Service (Amazon S3), and then import this dataset into Data Wrangler.

1. Analyze this dataset using Data Wrangler analyses. 

1. Define a data flow using Data Wrangler data transforms.

1. Export your flow to a Jupyter Notebook that you can use to create a Data Wrangler job. 

1. Process your data, and kick off a SageMaker training job to train a XGBoost Binary Classifier. 

### Upload Dataset to S3 and Import


To get started, you can use one of the following methods to import the Titanic dataset into Data Wrangler:
+ Importing the dataset directly from the Data Wrangler flow
+ Uploading the dataset to Amazon S3 and then importing it into Data Wrangler

To import the dataset directly into Data Wrangler, open the flow and choose **Use Sample Dataset**.

Uploading the dataset to Amazon S3 and importing it into Data Wrangler is closer to the experience you have importing your own data. The following information tells you how to upload your dataset and import it.

Before you start importing the data into Data Wrangler, download the [Titanic dataset](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/walkthrough_titanic.csv) and upload it to an Amazon S3 (Amazon S3) bucket in the AWS Region in which you want to complete this demo.

If you are a new user of Amazon S3, you can do this using drag and drop in the Amazon S3 console. To learn how, see [Uploading Files and Folders by Using Drag and Drop](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/upload-objects.html#upload-objects-by-drag-and-drop) in the Amazon Simple Storage Service User Guide.

**Important**  
Upload your dataset to an S3 bucket in the same AWS Region you want to use to complete this demo. 

When your dataset has been successfully uploaded to Amazon S3, you can import it into Data Wrangler.

**Import the Titanic dataset to Data Wrangler**

1. Choose the **Import data** button in your **Data flow** tab or choose the **Import** tab.

1. Select **Amazon S3**.

1. Use the **Import a dataset from S3** table to find the bucket to which you added the Titanic dataset. Choose the Titanic dataset CSV file to open the **Details** pane.

1. Under **Details**, the **File type** should be CSV. Check **First row is header** to specify that the first row of the dataset is a header. You can also name the dataset something more friendly, such as **Titanic-train**.

1. Choose the **Import ** button.

When your dataset is imported into Data Wrangler, it appears in your **Data Flow** tab. You can double click on a node to enter the node detail view, which allows you to add transformations or analysis. You can use the plus icon for a quick access to the navigation. In the next section, you use this data flow to add analysis and transform steps.

### Data Flow


In the data flow section, the only steps in the data flow are your recently imported dataset and a **Data type** step. After applying transformations, you can come back to this tab and see what the data flow looks like. Now, add some basic transformations under the **Prepare** and **Analyze** tabs. 

#### Prepare and Visualize


Data Wrangler has built-in transformations and visualizations that you can use to analyze, clean, and transform your data. 

The **Data** tab of the node detail view lists all built-in transformations in the right panel, which also contains an area in which you can add custom transformations. The following use case showcases how to use these transformations.

To get information that might help you with data exploration and feature engineering, create a data quality and insights report. The information from the report can help you clean and process your data. It gives you information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention. For more information about creating a report, see [Get Insights On Data and Data Quality](data-wrangler-data-insights.md).

##### Data Exploration


First, create a table summary of the data using an analysis. Do the following:

1. Choose the **\$1** next to the **Data type** step in your data flow and select **Add analysis**.

1. In the **Analysis** area, select **Table summary** from the dropdown list.

1. Give the table summary a **Name**.

1. Select **Preview** to preview the table that will be created.

1. Choose **Save** to save it to your data flow. It appears under **All Analyses**.

Using the statistics you see, you can make observations similar to the following about this dataset: 
+ Fare average (mean) is around \$133, while the max is over \$1500. This column likely has outliers. 
+ This dataset uses *?* to indicate missing values. A number of columns have missing values: *cabin*, *embarked*, and *home.dest*
+ The age category is missing over 250 values.

Next, clean your data using the insights gained from these stats. 

##### Drop Unused Columns


Using the analysis from the previous section, clean up the dataset to prepare it for training. To add a new transform to your data flow, choose **\$1** next to the **Data type** step in your data flow and choose **Add transform**.

First, drop columns that you don't want to use for training. You can use [pandas](https://pandas.pydata.org/) data analysis library to do this, or you can use one of the built-in transforms.

Use the following procedure to drop the unused columns.

To drop the unused columns.

1. Open the Data Wrangler flow.

1. There are two nodes in your Data Wrangler flow. Choose the **\$1** to the right of the **Data types** node.

1. Choose **Add transform**.

1. In the **All steps** column, choose **Add step**.

1. In the **Standard** transform list, choose **Manage Columns**. The standard transformations are ready-made, built-in transformations. Make sure that **Drop column** is selected.

1. Under **Columns to drop**, check the following column names:
   + cabin
   + ticket
   + name
   + sibsp
   + parch
   + home.dest
   + boat
   + body

1. Choose **Preview**.

1. Verify that the columns have been dropped, then choose **Add**.

To do this using pandas, follow these steps.

1. In the **All steps** column, choose **Add step**.

1. In the **Custom** transform list, choose **Custom transform**.

1. Provide a name for your transformation, and choose **Python (Pandas)** from the dropdown list.

1. Enter the following Python script in the code box.

   ```
   cols = ['name', 'ticket', 'cabin', 'sibsp', 'parch', 'home.dest','boat', 'body']
   df = df.drop(cols, axis=1)
   ```

1. Choose **Preview** to preview the change, and then choose **Add** to add the transformation. 

##### Clean up Missing Values


Now, clean up missing values. You can do this with the **Handling missing values** transform group.

A number of columns have missing values. Of the remaining columns, *age* and *fare* contain missing values. Inspect this using a **Custom Transform**.

Using the **Python (Pandas)** option, use the following to quickly review the number of entries in each column:

```
df.info()
```

![\[Example review the number of entries in each column.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/inspect-missing-pandas.png)


To drop rows with missing values in the *age* category, do the following: 

1. Choose **Handle missing**. 

1. Choose **Drop missing** for the **Transformer**.

1. Choose *age* for the **Input column**.

1. Choose **Preview** to see the new data frame, and then choose **Add** to add the transform to your flow.

1. Repeat the same process for *fare*. 

You can use `df.info()` in the **Custom transform** section to confirm that all rows now have 1,045 values.

##### Custom Pandas: Encode


Try flat encoding using Pandas. Encoding categorical data is the process of creating a numerical representation for categories. For example, if your categories are `Dog` and `Cat`, you may encode this information into two vectors: `[1,0]` to represent `Dog`, and `[0,1]` to represent `Cat`.

1. In the **Custom Transform** section, choose **Python (Pandas)** from the dropdown list.

1. Enter the following in the code box.

   ```
   import pandas as pd
   
   dummies = []
   cols = ['pclass','sex','embarked']
   for col in cols:
       dummies.append(pd.get_dummies(df[col]))
       
   encoded = pd.concat(dummies, axis=1)
   
   df = pd.concat((df, encoded),axis=1)
   ```

1. Choose **Preview** to preview the change. The encoded version of each column is added to the dataset. 

1. Choose **Add** to add the transformation. 

#### Custom SQL: SELECT Columns


Now, select the columns you want to keep using SQL. For this demo, select the columns listed in the following `SELECT` statement. Because *survived* is your target column for training, put that column first.

1. In the **Custom Transform** section, select **SQL (PySpark SQL)** from the dropdown list.

1. Enter the following in the code box.

   ```
   SELECT survived, age, fare, 1, 2, 3, female, male, C, Q, S FROM df;
   ```

1. Choose **Preview** to preview the change. The columns listed in your `SELECT` statement are the only remaining columns.

1. Choose **Add** to add the transformation. 

### Export to a Data Wrangler Notebook


When you've finished creating a data flow, you have a number of export options. The following section explains how to export to a Data Wrangler job notebook. A Data Wrangler job is used to process your data using the steps defined in your data flow. To learn more about all export options, see [Export](data-wrangler-data-export.md).

#### Export to Data Wrangler Job Notebook


When you export your data flow using a **Data Wrangler job**, the process automatically creates a Jupyter Notebook. This notebook automatically opens in your Studio Classic instance and is configured to run a SageMaker Processing job to run your Data Wrangler data flow, which is referred to as a Data Wrangler job. 

1. Save your data flow. Select **File** and then select **Save Data Wrangler Flow**.

1. Back to the **Data Flow** tab, select the last step in your data flow (SQL), then choose the **\$1** to open the navigation.

1. Choose **Export**, and **Amazon S3 (via Jupyter Notebook)**. This opens a Jupyter Notebook.  
![\[Example showing how to open the navigation in the data flow tab in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/export-select-step.png)

1. Choose any **Python 3 (Data Science)** kernel for the **Kernel**. 

1. When the kernel starts, run the cells in the notebook book until **Kick off SageMaker Training Job (Optional)**. 

1. Optionally, you can run the cells in **Kick off SageMaker Training Job (Optional)** if you want to create a SageMaker AI training job to train an XGBoost classifier. You can find the cost to run a SageMaker training job in [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/). 

   Alternatively, you can add the code blocks found in [Training XGBoost Classifier](#data-wrangler-getting-started-train-xgboost) to the notebook and run them to use the [XGBoost](https://xgboost.readthedocs.io/en/latest/) open source library to train an XGBoost classifier. 

1. Uncomment and run the cell under **Cleanup** and run it to revert the SageMaker Python SDK to its original version.

You can monitor your Data Wrangler job status in the SageMaker AI console in the **Processing** tab. Additionally, you can monitor your Data Wrangler job using Amazon CloudWatch. For additional information, see [Monitor Amazon SageMaker Processing Jobs with CloudWatch Logs and Metrics](https://docs.aws.amazon.com/sagemaker/latest/dg/processing-job.html#processing-job-cloudwatch). 

If you kicked off a training job, you can monitor its status using the SageMaker AI console under **Training jobs** in the **Training section**.

#### Training XGBoost Classifier


You can train an XGBoost Binary Classifier using either a Jupyter notebook or a Amazon SageMaker Autopilot. You can use Autopilot to automatically train and tune models on the data that you've transformed directly from your Data Wrangler flow. For information about Autopilot, see [Automatically Train Models on Your Data Flow](data-wrangler-autopilot.md).

In the same notebook that kicked off the Data Wrangler job, you can pull the data and train an XGBoost Binary Classifier using the prepared data with minimal data preparation. 

1. First, upgrade necessary modules using `pip` and remove the \$1SUCCESS file (this last file is problematic when using `awswrangler`).

   ```
   ! pip install --upgrade awscli awswrangler boto sklearn
   ! aws s3 rm {output_path} --recursive  --exclude "*" --include "*_SUCCESS*"
   ```

1. Read the data from Amazon S3. You can use `awswrangler` to recursively read all the CSV files in the S3 prefix. The data is then split into features and labels. The label is the first column of the dataframe.

   ```
   import awswrangler as wr
   
   df = wr.s3.read_csv(path=output_path, dataset=True)
   X, y = df.iloc[:,:-1],df.iloc[:,-1]
   ```
   + Finally, create DMatrices (the XGBoost primitive structure for data) and do cross-validation using the XGBoost binary classification.

     ```
     import xgboost as xgb
     
     dmatrix = xgb.DMatrix(data=X, label=y)
     
     params = {"objective":"binary:logistic",'learning_rate': 0.1, 'max_depth': 5, 'alpha': 10}
     
     xgb.cv(
         dtrain=dmatrix, 
         params=params, 
         nfold=3,
         num_boost_round=50,
         early_stopping_rounds=10,
         metrics="rmse", 
         as_pandas=True, 
         seed=123)
     ```

#### Shut down Data Wrangler


When you are finished using Data Wrangler, we recommend that you shut down the instance it runs on to avoid incurring additional charges. To learn how to shut down the Data Wrangler app and associated instance, see [Shut Down Data Wrangler](data-wrangler-shut-down.md). 

# Import


You can use Amazon SageMaker Data Wrangler to import data from the following *data sources*: Amazon Simple Storage Service (Amazon S3), Amazon Athena, Amazon Redshift, and Snowflake. The dataset that you import can include up to 1000 columns.

**Topics**
+ [

## Import data from Amazon S3
](#data-wrangler-import-s3)
+ [

## Import data from Athena
](#data-wrangler-import-athena)
+ [

## Import data from Amazon Redshift
](#data-wrangler-import-redshift)
+ [

## Import data from Amazon EMR
](#data-wrangler-emr)
+ [

## Import data from Databricks (JDBC)
](#data-wrangler-databricks)
+ [

## Import data from Salesforce Data Cloud
](#data-wrangler-import-salesforce-data-cloud)
+ [

## Import data from Snowflake
](#data-wrangler-snowflake)
+ [

## Import Data From Software as a Service (SaaS) Platforms
](#data-wrangler-import-saas)
+ [

## Imported Data Storage
](#data-wrangler-import-storage)

Some data sources allow you to add multiple *data connections*:
+ You can connect to multiple Amazon Redshift clusters. Each cluster becomes a data source. 
+ You can query any Athena database in your account to import data from that database.


When you import a dataset from a data source, it appears in your data flow. Data Wrangler automatically infers the data type of each column in your dataset. To modify these types, select the **Data types** step and select **Edit data types**.

When you import data from Athena or Amazon Redshift, the imported data is automatically stored in the default SageMaker AI S3 bucket for the AWS Region in which you are using Studio Classic. Additionally, Athena stores data you preview in Data Wrangler in this bucket. To learn more, see [Imported Data Storage](#data-wrangler-import-storage).

**Important**  
The default Amazon S3 bucket may not have the least permissive security settings, such as bucket policy and server-side encryption (SSE). We strongly recommend that you [ Add a Bucket Policy To Restrict Access to Datasets Imported to Data Wrangler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-security.html#data-wrangler-security-bucket-policy). 

**Important**  
In addition, if you use the managed policy for SageMaker AI, we strongly recommend that you scope it down to the most restrictive policy that allows you to perform your use case. For more information, see [Grant an IAM Role Permission to Use Data Wrangler](data-wrangler-security.md#data-wrangler-security-iam-policy).

All data sources except for Amazon Simple Storage Service (Amazon S3) require you to specify a SQL query to import your data. For each query, you must specify the following:
+ **Data catalog**
+ **Database**
+ **Table**

You can specify the name of the database or the data catalog in either the drop down menus or within the query. The following are example queries:
+ `select * from example-data-catalog-name.example-database-name.example-table-name` – The query doesn't use anything specified in the dropdown menus of the user-interface (UI) to run. It queries `example-table-name` within `example-database-name` within `example-data-catalog-name`.
+ `select * from example-database-name.example-table-name` – The query uses the data catalog that you've specified in the **Data catalog** dropdown menu to run. It queries `example-table-name` within `example-database-name` within the data catalog that you've specified.
+ `select * from example-table-name` – The query requires you to select fields for both the **Data catalog** and **Database name** dropdown menus. It queries `example-table-name` within the data catalog within the database and data catalog that you've specified.

The link between Data Wrangler and the data source is a *connection*. You use the connection to import data from your data source.

There are the following types of connections:
+ Direct
+ Cataloged

Data Wrangler always has access to the most recent data in a direct connection. If the data in the data source has been updated, you can use the connection to import the data. For example, if someone adds a file to one of your Amazon S3 buckets, you can import the file.

A cataloged connection is the result of a data transfer. The data in the cataloged connection doesn't necessarily have the most recent data. For example, you might set up a data transfer between Salesforce and Amazon S3. If there's an update to the Salesforce data, you must transfer the data again. You can automate the process of transferring data. For more information about data transfers, see [Import Data From Software as a Service (SaaS) Platforms](#data-wrangler-import-saas).

## Import data from Amazon S3


You can use Amazon Simple Storage Service (Amazon S3) to store and retrieve any amount of data, at any time, from anywhere on the web. You can accomplish these tasks using the AWS Management Console, which is a simple and intuitive web interface, and the Amazon S3 API. If you've stored your dataset locally, we recommend that you add it to an S3 bucket for import into Data Wrangler. To learn how, see [Uploading an object to a bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/PuttingAnObjectInABucket.html) in the Amazon Simple Storage Service User Guide. 

Data Wrangler uses [S3 Select](https://aws.amazon.com/s3/features/#s3-select) to allow you to preview your Amazon S3 files in Data Wrangler. You incur standard charges for each file preview. To learn more about pricing, see the **Requests & data retrievals** tab on [Amazon S3 pricing](https://aws.amazon.com/s3/pricing/). 

**Important**  
If you plan to export a data flow and launch a Data Wrangler job, ingest data into a SageMaker AI feature store, or create a SageMaker AI pipeline, be aware that these integrations require Amazon S3 input data to be located in the same AWS region.

**Important**  
If you're importing a CSV file, make sure it meets the following requirements:  
A record in your dataset can't be longer than one line.
A backslash, `\`, is the only valid escape character.
Your dataset must use one of the following delimiters:  
Comma – `,`
Colon – `:`
Semicolon – `;`
Pipe – `|`
Tab – `[TAB]`
To save space, you can import compressed CSV files.

Data Wrangler gives you the ability to either import the entire dataset or sample a portion of it. For Amazon S3, it provides the following sampling options:
+ None – Import the entire dataset.
+ First K – Sample the first K rows of the dataset, where K is an integer that you specify.
+ Randomized – Takes a random sample of a size that you specify.
+ Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a column.

After you've imported your data, you can also use the sampling transformer to take one or more samples from your entire dataset. For more information about the sampling transformer, see [Sampling](data-wrangler-transform.md#data-wrangler-transform-sampling).

You can use one of the following resource identifiers to import your data:
+ An Amazon S3 URI that uses an Amazon S3 bucket or Amazon S3 access point
+ An Amazon S3 access point alias
+ An Amazon Resource Name (ARN) that uses an Amazon S3 access point or Amazon S3 bucket

Amazon S3 access points are named network endpoints that are attached to the buckets. Each access point has distinct permissions and network controls that you can configure. For more information about access points, see [Managing data access with Amazon S3 access points](https://docs.aws.amazon.com/AmazonS3/latest/userguide/access-points.html).

**Important**  
If you're using an Amazon Resource Name (ARN) to import your data, it must be for a resource located in the same AWS Region that you're using to access Amazon SageMaker Studio Classic.

You can import either a single file or multiple files as a dataset. You can use the multifile import operation when you have a dataset that is partitioned into separate files. It takes all of the files from an Amazon S3 directory and imports them as a single dataset. For information on the types of files that you can import and how to import them, see the following sections.

------
#### [ Single File Import ]

You can import single files in the following formats:
+ Comma Separated Values (CSV)
+ Parquet
+ Javascript Object Notation (JSON)
+ Optimized Row Columnar (ORC)
+ Image – Data Wrangler uses OpenCV to import images. For more information about supported image formats, see [Image file reading and writing](https://docs.opencv.org/3.4/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56).

For files formatted in JSON, Data Wrangler supports both JSON lines (.jsonl) and JSON documents (.json). When you preview your data, it automatically shows the JSON in tabular format. For nested JSON documents that are larger than 5 MB, Data Wrangler shows the schema for the structure and the arrays as values in the dataset. Use the **Flatten structured** and **Explode array** operators to display the nested values in tabular format. For more information, see [Unnest JSON Data](data-wrangler-transform.md#data-wrangler-transform-flatten-column) and [Explode Array](data-wrangler-transform.md#data-wrangler-transform-explode-array).

When you choose a dataset, you can rename it, specify the file type, and identify the first row as a header.

You can import a dataset that you've partitioned into multiple files in an Amazon S3 bucket in a single import step.

**To import a dataset into Data Wrangler from a single file that you've stored in Amazon S3:**

1. If you are not currently on the **Import** tab, choose **Import**.

1. Under **Available**, choose **Amazon S3**.

1. From the **Import tabular, image, or time-series data from S3**, do one of the following:
   + Choose an Amazon S3 bucket from the tabular view and navigate to the file that you're importing.
   + For **S3 source**, specify an Amazon S3 bucket or an Amazon S3 URI and select **Go**. The Amazon S3 URIs can be in one of the following formats:
     + `s3://amzn-s3-demo-bucket/example-prefix/example-file`
     + *example-access-point*-*aqfqprnstn7aefdfbarligizwgyfouse1a*-s3alias/datasets/*example-file*
     + `s3://arn:aws:s3:AWS-Region:111122223333:accesspoint/example-prefix/example-file`

1. Choose the dataset to open the **Import settings** pane.

1. If your CSV file has a header, select the checkbox next to **Add header to table**.

1. Use the **Preview** table to preview your dataset. This table shows up to 100 rows. 

1. In the **Details** pane, verify or change the **Name** and **File Type** for your dataset. If you add a **Name** that contains spaces, these spaces are replaced with underscores when your dataset is imported. 

1. Specify the sampling configuration that you'd like to use. 

1. Choose **Import**.

------
#### [ Multifile Import ]

The following are the requirements for importing multiple files:
+ The files must be in the same folder of your Amazon S3 bucket.
+ The files must either share the same header or have no header.

Each file must be in one of the following formats:
+ CSV
+ Parquet
+ Optimized Row Columnar (ORC)
+ Image – Data Wrangler uses OpenCV to import images. For more information about supported image formats, see [Image file reading and writing](https://docs.opencv.org/3.4/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56).

Use the following procedure to import multiple files.

**To import a dataset into Data Wrangler from multiple files that you've stored in an Amazon S3 directory**

1. If you are not currently on the **Import** tab, choose **Import**.

1. Under **Available**, choose **Amazon S3**.

1. From the **Import tabular, image, or time-series data from S3**, do one of the following:
   + Choose an Amazon S3 bucket from the tabular view and navigate to the folder containing the files that you're importing.
   + For **S3 source**, specify the Amazon S3 bucket or an Amazon S3 URI with your files and select **Go**. The following are valid URIs:
     + `s3://amzn-s3-demo-bucket/example-prefix/example-prefix`
     + `example-access-point-aqfqprnstn7aefdfbarligizwgyfouse1a-s3alias/example-prefix/`
     + `s3://arn:aws:s3:AWS-Region:111122223333:accesspoint/example-prefix`

1. Select the folder containing the files that you want to import. Each file must be in one of the supported formats. Your files must be the same data type.

1. If your folder contains CSV files with headers, select the checkbox next to **First row is header**.

1. If your files are nested within other folders, select the checkbox next to **Include nested directories**.

1. (Optional) Choose **Add filename column** add a column to the dataset that shows the filename for each observation.

1. (Optional) By default, Data Wrangler doesn't show you a preview of a folder. You can activate previewing by choosing the blue **Preview off** button. A preview shows the first 10 rows of the first 10 files in the folder.

1. In the **Details** pane, verify or change the **Name** and **File Type** for your dataset. If you add a **Name** that contains spaces, these spaces are replaced with underscores when your dataset is imported. 

1. Specify the sampling configuration that you'd like to use. 

1. Choose **Import dataset**.

------

You can also use parameters to import a subset of files that match a pattern. Parameters help you more selectively pick the files that you're importing. To start using parameters, edit the data source and apply them to the path that you're using to import the data. For more information, see [Reusing Data Flows for Different Datasets](data-wrangler-parameterize.md).

## Import data from Athena


Use Amazon Athena to import your data from Amazon Simple Storage Service (Amazon S3) into Data Wrangler. In Athena, you write standard SQL queries to select the data that you're importing from Amazon S3. For more information, see [What is Amazon Athena?](https://docs.aws.amazon.com/athena/latest/ug/what-is.html)

You can use the AWS Management Console to set up Amazon Athena. You must create at least one database in Athena before you start running queries. For more information about getting started with Athena, see [Getting started](https://docs.aws.amazon.com/athena/latest/ug/getting-started.html).

Athena is directly integrated with Data Wrangler. You can write Athena queries without having to leave the Data Wrangler UI.

In addition to writing simple Athena queries in Data Wrangler, you can also use:
+ Athena workgroups for query result management. For more information about workgroups, see [Managing query results](#data-wrangler-import-manage-results).
+ Lifecycle configurations for setting data retention periods. For more information about data retention, see [Setting data retention periods](#data-wrangler-import-athena-retention).

### Query Athena within Data Wrangler


**Note**  
Data Wrangler does not support federated queries.

If you use AWS Lake Formation with Athena, make sure your Lake Formation IAM permissions do not override IAM permissions for the database `sagemaker_data_wrangler`.

Data Wrangler gives you the ability to either import the entire dataset or sample a portion of it. For Athena, it provides the following sampling options:
+ None – Import the entire dataset.
+ First K – Sample the first K rows of the dataset, where K is an integer that you specify.
+ Randomized – Takes a random sample of a size that you specify.
+ Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a column.

The following procedure shows how to import a dataset from Athena into Data Wrangler.

**To import a dataset into Data Wrangler from Athena**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Amazon Athena**.

1. For **Data Catalog**, choose a data catalog.

1. Use the **Database** dropdown list to select the database that you want to query. When you select a database, you can preview all tables in your database using the **Tables** listed under **Details**.

1. (Optional) Choose **Advanced configuration**.

   1. Choose a **Workgroup**.

   1. If your workgroup hasn't enforced the Amazon S3 output location or if you don't use a workgroup, specify a value for **Amazon S3 location of query results**.

   1. (Optional) For **Data retention period**, select the checkbox to set a data retention period and specify the number of days to store the data before it's deleted.

   1. (Optional) By default, Data Wrangler saves the connection. You can choose to deselect the checkbox and not save the connection.

1. For **Sampling**, choose a sampling method. Choose **None** to turn off sampling.

1. Enter your query in the query editor and use the **Run** button to run the query. After a successful query, you can preview your result under the editor.
**Note**  
Salesforce data uses the `timestamptz` type. If you're querying the timestamp column that you've imported to Athena from Salesforce, cast the data in the column to the `timestamp` type. The following query casts the timestamp column to the correct type.  

   ```
   # cast column timestamptz_col as timestamp type, and name it as timestamp_col
   select cast(timestamptz_col as timestamp) as timestamp_col from table
   ```

1. To import the results of your query, select **Import**.

After you complete the preceding procedure, the dataset that you've queried and imported appears in the Data Wrangler flow.

By default, Data Wrangler saves the connection settings as a new connection. When you import your data, the query that you've already specified appears as a new connection. The saved connections store information about the Athena workgroups and Amazon S3 buckets that you're using. When you're connecting to the data source again, you can choose the saved connection.

### Managing query results


Data Wrangler supports using Athena workgroups to manage the query results within an AWS account. You can specify an Amazon S3 output location for each workgroup. You can also specify whether the output of the query can go to different Amazon S3 locations. For more information, see [Using Workgroups to Control Query Access and Costs](https://docs.aws.amazon.com/athena/latest/ug/manage-queries-control-costs-with-workgroups.html).

Your workgroup might be configured to enforce the Amazon S3 query output location. You can't change the output location of the query results for those workgroups.

If you don't use a workgroup or specify an output location for your queries, Data Wrangler uses the default Amazon S3 bucket in the same AWS Region in which your Studio Classic instance is located to store Athena query results. It creates temporary tables in this database to move the query output to this Amazon S3 bucket. It deletes these tables after data has been imported; however the database, `sagemaker_data_wrangler`, persists. To learn more, see [Imported Data Storage](#data-wrangler-import-storage).

To use Athena workgroups, set up the IAM policy that gives access to workgroups. If you're using a `SageMaker AI-Execution-Role`, we recommend adding the policy to the role. For more information about IAM policies for workgroups, see [IAM policies for accessing workgroups](https://docs.aws.amazon.com/athena/latest/ug/workgroups-iam-policy.html). For example workgroup policies, see [Workgroup example policies](https://docs.aws.amazon.com/athena/latest/ug/example-policies-workgroup.html).

### Setting data retention periods


Data Wrangler automatically sets a data retention period for the query results. The results are deleted after the length of the retention period. For example, the default retention period is five days. The results of the query are deleted after five days. This configuration is designed to help you clean up data that you're no longer using. Cleaning up your data prevents unauthorized users from gaining access. It also helps control the costs of storing your data on Amazon S3.

If you don't set a retention period, the Amazon S3 lifecycle configuration determines the duration that the objects are stored. The data retention policy that you've specified for the lifecycle configuration removes any query results that are older than the lifecycle configuration that you've specified. For more information, see [Setting lifecycle configuration on a bucket](https://docs.aws.amazon.com/AmazonS3/latest/userguide/how-to-set-lifecycle-configuration-intro.html).

Data Wrangler uses Amazon S3 lifecycle configurations to manage data retention and expiration. You must give your Amazon SageMaker Studio Classic IAM execution role permissions to manage bucket lifecycle configurations. Use the following procedure to give permissions.

To give permissions to manage the lifecycle configuration do the following.

1. Sign in to the AWS Management Console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**.

1. In the search bar, specify the Amazon SageMaker AI execution role that Amazon SageMaker Studio Classic is using.

1. Choose the role.

1. Choose **Add permissions**.

1. Choose **Create inline policy**.

1. For **Service**, specify **S3** and choose it.

1. Under the **Read** section, choose **GetLifecycleConfiguration**.

1. Under the **Write** section, choose **PutLifecycleConfiguration**.

1. For **Resources**, choose **Specific**.

1. For **Actions**, select the arrow icon next to **Permissions management**.

1. Choose **PutResourcePolicy**.

1. For **Resources**, choose **Specific**.

1. Choose the checkbox next to **Any in this account**.

1. Choose **Review policy**.

1. For **Name**, specify a name.

1. Choose **Create policy**.

## Import data from Amazon Redshift


Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. The first step to create a data warehouse is to launch a set of nodes, called an Amazon Redshift cluster. After you provision your cluster, you can upload your dataset and then perform data analysis queries. 

You can connect to and query one or more Amazon Redshift clusters in Data Wrangler. To use this import option, you must create at least one cluster in Amazon Redshift. To learn how, see [Getting started with Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/gsg/getting-started.html).

You can output the results of your Amazon Redshift query in one of the following locations:
+ The default Amazon S3 bucket
+ An Amazon S3 output location that you specify

You can either import the entire dataset or sample a portion of it. For Amazon Redshift, it provides the following sampling options:
+ None – Import the entire dataset.
+ First K – Sample the first K rows of the dataset, where K is an integer that you specify.
+ Randomized – Takes a random sample of a size that you specify.
+ Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a column.

The default Amazon S3 bucket is in the same AWS Region in which your Studio Classic instance is located to store Amazon Redshift query results. For more information, see [Imported Data Storage](#data-wrangler-import-storage).

For either the default Amazon S3 bucket or the bucket that you specify, you have the following encryption options:
+ The default AWS service-side encryption with an Amazon S3 managed key (SSE-S3)
+  An AWS Key Management Service (AWS KMS) key that you specify

An AWS KMS key is an encryption key that you create and manage. For more information on KMS keys, see [AWS Key Management Service](https://docs.aws.amazon.com//kms/latest/developerguide/overview.html).

You can specify an AWS KMS key using either the key ARN or the ARN of your AWS account.

If you use the IAM managed policy, `AmazonSageMakerFullAccess`, to grant a role permission to use Data Wrangler in Studio Classic, your **Database User** name must have the prefix `sagemaker_access`.

Use the following procedures to learn how to add a new cluster. 

**Note**  
Data Wrangler uses the Amazon Redshift Data API with temporary credentials. To learn more about this API, refer to [Using the Amazon Redshift Data API](https://docs.aws.amazon.com//redshift/latest/mgmt/data-api.html) in the Amazon Redshift Management Guide. 

**To connect to a Amazon Redshift cluster**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Amazon Athena**.

1. Choose **Amazon Redshift**.

1. Choose **Temporary credentials (IAM)** for **Type**.

1. Enter a **Connection Name**. This is a name used by Data Wrangler to identify this connection. 

1. Enter the **Cluster Identifier** to specify to which cluster you want to connect. Note: Enter only the cluster identifier and not the full endpoint of the Amazon Redshift cluster.

1. Enter the **Database Name** of the database to which you want to connect.

1. Enter a **Database User** to identify the user you want to use to connect to the database. 

1. For **UNLOAD IAM Role**, enter the IAM role ARN of the role that the Amazon Redshift cluster should assume to move and write data to Amazon S3. For more information about this role, see [Authorizing Amazon Redshift to access other AWS services on your behalf](https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html) in the Amazon Redshift Management Guide. 

1. Choose **Connect**.

1. (Optional) For **Amazon S3 output location**, specify the S3 URI to store the query results.

1. (Optional) For **KMS key ID**, specify the ARN of the AWS KMS key or alias. The following image shows you where you can find either key in the AWS Management Console.  
![\[The location of the AWS KMS alias ARN, alias name, and key ARN in the AWS KMS console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/kms-alias-redacted.png)

The following image shows all the fields from the preceding procedure.

![\[The Add Amazon Redshift connection panel.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/redshift-connection.png)


After your connection is successfully established, it appears as a data source under **Data Import**. Select this data source to query your database and import data.

**To query and import data from Amazon Redshift**

1. Select the connection that you want to query from **Data Sources**.

1. Select a **Schema**. To learn more about Amazon Redshift Schemas, see [Schemas](https://docs.aws.amazon.com/redshift/latest/dg/r_Schemas_and_tables.html) in the Amazon Redshift Database Developer Guide.

1. (Optional) Under **Advanced configuration**, specify the **Sampling** method that you'd like to use.

1. Enter your query in the query editor and choose **Run** to run the query. After a successful query, you can preview your result under the editor.

1. Select **Import dataset** to import the dataset that has been queried. 

1. Enter a **Dataset name**. If you add a **Dataset name** that contains spaces, these spaces are replaced with underscores when your dataset is imported. 

1. Choose **Add**.

To edit a dataset, do the following.

1. Navigate to your Data Wrangler flow.

1. Choose the \$1 next to **Source - Sampled**.

1. Change the data that you're importing.

1. Choose **Apply**

## Import data from Amazon EMR


You can use Amazon EMR as a data source for your Amazon SageMaker Data Wrangler flow. Amazon EMR is a managed cluster platform that you can use process and analyze large amounts of data. For more information about Amazon EMR, see [What is Amazon EMR?](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html). To import a dataset from EMR, you connect to it and query it. 

**Important**  
You must meet the following prerequisites to connect to an Amazon EMR cluster:  
You have an Amazon VPC in the Region that you're using to launch Amazon SageMaker Studio Classic and Amazon EMR.
Both Amazon EMR and Amazon SageMaker Studio Classic must be launched in private subnets. They can be in the same subnet or in different ones.
Amazon SageMaker Studio Classic must be in VPC-only mode.  
For more information about creating a VPC, see [Create a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-vpcs.html#Create-VPC).  
For more information about creating a VPC, see [Connect SageMaker Studio Classic Notebooks in a VPC to External Resources](https://docs.aws.amazon.com/vpc/latest/userguide/studio-notebooks-and-internet-access.html).
The Amazon EMR clusters that you're running must be in the same Amazon VPC.
The Amazon EMR clusters and the Amazon VPC must be in the same AWS account.
Your Amazon EMR clusters are running Hive or Presto.  
Hive clusters must allow inbound traffic from Studio Classic security groups on port 10000.
Presto clusters must allow inbound traffic from Studio Classic security groups on port 8889.  
The port number is different for Amazon EMR clusters using IAM roles. Navigate to the end of the prerequisites section for more information.
Amazon SageMaker Studio Classic must run Jupyter Lab Version 3. For information about updating the Jupyter Lab Version, see [View and update the JupyterLab version of an application from the console](studio-jl.md#studio-jl-view).
Amazon SageMaker Studio Classic has an IAM role that controls user access. The default IAM role that you're using to run Amazon SageMaker Studio Classic doesn't have policies that can give you access to Amazon EMR clusters. You must attach the policy granting permissions to the IAM role. For more information, see [Configure listing Amazon EMR clusters](studio-notebooks-configure-discoverability-emr-cluster.md).
The IAM role must also have the following policy attached `secretsmanager:PutResourcePolicy`.
If you're using a Studio Classic domain that you've already created, make sure that its `AppNetworkAccessType` is in VPC-only mode. For information about updating a domain to use VPC-only mode, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md).
You must have Hive or Presto installed on your cluster.
The Amazon EMR release must be version 5.5.0 or later.  
Amazon EMR supports auto termination. Auto termination stops idle clusters from running and prevents you from incurring costs. The following are the releases that support auto termination:  
For 6.x releases, version 6.1.0 or later.
For 5.x releases, version 5.30.0 or later.
Use the following pages to set up IAM runtime roles for the Amazon EMR cluster. You must enable in-transit encryption when you're using runtime roles:  
[Prerequisites for launching an Amazon EMR cluster with a runtime role](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-steps-runtime-roles.html#emr-steps-runtime-roles-configure)
[Launch an Amazon EMR cluster with role-based access control](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-steps-runtime-roles.html#emr-steps-runtime-roles-launch)
You must Lake Formation as a governance tool for the data within your databases. You must also use external data filtering for access control.  
For more information about Lake Formation, see [What is AWS Lake Formation?](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html)
For more information about integrating Lake Formation into Amazon EMR, see [Integrating third-party services with Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/Integrating-with-LakeFormation.html).
The version of your cluster must be 6.9.0 or later.
Access to AWS Secrets Manager. For more information about Secrets Manager see [What is AWS Secrets Manager?](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html)
Hive clusters must allow inbound traffic from Studio Classic security groups on port 10000.

An Amazon VPC is a virtual network that is logically isolated from other networks on the AWS cloud. Amazon SageMaker Studio Classic and your Amazon EMR cluster only exist within the Amazon VPC.

Use the following procedure to launch Amazon SageMaker Studio Classic in an Amazon VPC.

To launch Studio Classic within a VPC, do the following.

1. Navigate to the SageMaker AI console at [https://console.aws.amazon.com/sagemaker/](https://console.aws.amazon.com/sagemaker/).

1. Choose **Launch SageMaker Studio Classic**.

1. Choose **Standard setup**.

1. For **Default execution role**, choose the IAM role to set up Studio Classic.

1. Choose the VPC where you've launched the Amazon EMR clusters.

1. For **Subnet**, choose a private subnet.

1. For **Security group(s)**, specify the security groups that you're using to control between your VPC.

1. Choose **VPC Only**.

1. (Optional) AWS uses a default encryption key. You can specify an AWS Key Management Service key to encrypt your data.

1. Choose **Next**.

1. Under **Studio settings**, choose the configurations that are best suited to you.

1. Choose **Next** to skip the SageMaker Canvas settings.

1. Choose **Next** to skip the RStudio settings.

If you don't have an Amazon EMR cluster ready, you can use the following procedure to create one. For more information about Amazon EMR, see [What is Amazon EMR?](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-what-is-emr.html)

To create a cluster, do the following.

1. Navigate to the AWS Management Console.

1. In the search bar, specify **Amazon EMR**.

1. Choose **Create cluster**.

1. For **Cluster name**, specify the name of your cluster.

1. For **Release**, select the release version of the cluster.
**Note**  
Amazon EMR supports auto termination for the following releases:  
For 6.x releases, releases 6.1.0 or later
For 5.x releases, releases 5.30.0 or later
Auto termination stops idle clusters from running and prevents you from incurring costs.

1. (Optional) For **Applications**, choose **Presto**.

1. Choose the application that you're running on the cluster.

1. Under **Networking**, for **Hardware configuration**, specify the hardware configuration settings.
**Important**  
For **Networking**, choose the VPC that is running Amazon SageMaker Studio Classic and choose a private subnet.

1. Under **Security and access**, specify the security settings.

1. Choose **Create**.

For a tutorial about creating an Amazon EMR cluster, see [Getting started with Amazon EMR](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html). For information about best practices for configuring a cluster, see [Considerations and best practices](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-ha-considerations.html).

**Note**  
For security best practices, Data Wrangler can only connect to VPCs on private subnets. You can't connect to the master node unless you use AWS Systems Manager for your Amazon EMR instances. For more information, see [Securing access to EMR clusters using AWS Systems Manager](https://aws.amazon.com/blogs/big-data/securing-access-to-emr-clusters-using-aws-systems-manager/).

You can currently use the following methods to access an Amazon EMR cluster:
+ No authentication
+ Lightweight Directory Access Protocol (LDAP)
+ IAM (Runtime role)

Not using authentication or using LDAP can require you to create multiple clusters and Amazon EC2 instance profiles. If you’re an administrator, you might need to provide groups of users with different levels of access to the data. These methods can result in administrative overhead that makes it more difficult to manage your users.

We recommend using an IAM runtime role that gives multiple users the ability to connect to the same Amazon EMR cluster. A runtime role is an IAM role that you can assign to a user who is connecting to an Amazon EMR cluster. You can configure the runtime IAM role to have permissions that are specific to each group of users.

Use the following sections to create a Presto or Hive Amazon EMR cluster with LDAP activated.

------
#### [ Presto ]

**Important**  
To use AWS Glue as a metastore for Presto tables, select **Use** for **Presto table metadata** to store the results of your Amazon EMR queries in a AWS Glue data catalog when you're launching an EMR cluster. Storing the query results in a AWS Glue data catalog can save you from incurring charges.  
To query large datasets on Amazon EMR clusters, you must add the following properties to the Presto configuration file on your Amazon EMR clusters:  

```
[{"classification":"presto-config","properties":{
"http-server.max-request-header-size":"5MB",
"http-server.max-response-header-size":"5MB"}}]
```
You can also modify the configuration settings when you launch the Amazon EMR cluster.  
The configuration file for your Amazon EMR cluster is located under the following path: `/etc/presto/conf/config.properties`.

Use the following procedure to create a Presto cluster with LDAP activated.

To create a cluster, do the following.

1. Navigate to the AWS Management Console.

1. In the search bar, specify **Amazon EMR**.

1. Choose **Create cluster**.

1. For **Cluster name**, specify the name of your cluster.

1. For **Release**, select the release version of the cluster.
**Note**  
Amazon EMR supports auto termination for the following releases:  
For 6.x releases, releases 6.1.0 or later
For 5.x releases, releases 5.30.0 or later
Auto termination stops idle clusters from running and prevent you from incurring costs.

1. Choose the application that you're running on the cluster.

1. Under **Networking**, for **Hardware configuration**, specify the hardware configuration settings.
**Important**  
For **Networking**, choose the VPC that is running Amazon SageMaker Studio Classic and choose a private subnet.

1. Under **Security and access**, specify the security settings.

1. Choose **Create**.

------
#### [ Hive ]

**Important**  
To use AWS Glue as a metastore for Hive tables, select **Use** for **Hive table metadata** to store the results of your Amazon EMR queries in a AWS Glue data catalog when you're launching an EMR cluster. Storing the query results in a AWS Glue data catalog can save you from incurring charges.  
To be able to query large datasets on Amazon EMR clusters, add the following properties to Hive configuration file on your Amazon EMR clusters:  

```
[{"classification":"hive-site", "properties"
:{"hive.resultset.use.unique.column.names":"false"}}]
```
You can also modify the configuration settings when you launch the Amazon EMR cluster.  
The configuration file for your Amazon EMR cluster is located under the following path: `/etc/hive/conf/hive-site.xml`. You can specify the following property and restart the cluster:  

```
<property>
    <name>hive.resultset.use.unique.column.names</name>
    <value>false</value>
</property>
```

Use the following procedure to create a Hive cluster with LDAP activated.

To create a Hive cluster with LDAP activated, do the following.

1. Navigate to the AWS Management Console.

1. In the search bar, specify **Amazon EMR**.

1. Choose **Create cluster**.

1. Choose **Go to advanced options**.

1. For **Release**, select an Amazon EMR release version.

1. The **Hive** configuration option is selected by default. Make sure the **Hive** option has a checkbox next to it.

1. (Optional) You can also select **Presto** as a configuration option to activate both Hive and Presto on your cluster.

1. (Optional) Select **Use for Hive table metadata** to store the results of your Amazon EMR queries in a AWS Glue data catalog. Storing the query results in a AWS Glue catalog can save you from incurring charges. For more information, see [Using the AWS Glue Data Catalog as the metastore for Hive](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html).
**Note**  
Storing the query results in a data catalog requires Amazon EMR version 5.8.0 or later.

1. Under **Enter configuration**, specify the following JSON:

   ```
   [
     {
       "classification": "hive-site",
       "properties": {
         "hive.server2.authentication.ldap.baseDN": "dc=example,dc=org",
         "hive.server2.authentication": "LDAP",
         "hive.server2.authentication.ldap.url": "ldap://ldap-server-dns-name:389"
       }
     }
   ]
   ```
**Note**  
As a security best practice, we recommend enabling SSL for HiveServer by adding a few properties in the preceding hive-site JSON. For more information, see [Enable SSL on HiveServer2](https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/configuring-wire-encryption/content/enable_ssl_on_hiveserver2.html).

1. Specify the remaining cluster settings and create a cluster.

------

Use the following sections to use LDAP authentication for Amazon EMR clusters that you've already created.

------
#### [ LDAP for Presto ]

Using LDAP on a cluster running Presto requires access to the Presto coordinator through HTTPS. Do the following to provide access:
+ Activate access on port 636
+ Enable SSL for the Presto coordinator

Use the following template to configure Presto:

```
- Classification: presto-config
     ConfigurationProperties:
        http-server.authentication.type: 'PASSWORD'
        http-server.https.enabled: 'true'
        http-server.https.port: '8889'
        http-server.http.port: '8899'
        node-scheduler.include-coordinator: 'true'
        http-server.https.keystore.path: '/path/to/keystore/path/for/presto'
        http-server.https.keystore.key: 'keystore-key-password'
        discovery.uri: 'http://master-node-dns-name:8899'
- Classification: presto-password-authenticator
     ConfigurationProperties:
        password-authenticator.name: 'ldap'
        ldap.url: !Sub 'ldaps://ldap-server-dns-name:636'
        ldap.user-bind-pattern: "uid=${USER},dc=example,dc=org"
        internal-communication.authentication.ldap.user: "ldap-user-name"
        internal-communication.authentication.ldap.password: "ldap-password"
```

For information about setting up LDAP in Presto, see the following resources:
+ [LDAP Authentication](https://prestodb.io/docs/current/security/ldap.html)
+ [Using LDAP Authentication for Presto on Amazon EMR](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-presto-ldap.html)

**Note**  
As a security best practice, we recommend enabling SSL for Presto. For more information, see [Secure Internal Communication](https://prestodb.io/docs/current/security/internal-communication.html).

------
#### [ LDAP for Hive ]

To use LDAP for Hive for a cluster that you've created, use the following procedure [Reconfigure an instance group in the console.](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps-running-cluster.html#emr-configure-apps-running-cluster-considerations)

You're specifying the name of the cluster to which you're connecting.

```
[
  {
    "classification": "hive-site",
    "properties": {
      "hive.server2.authentication.ldap.baseDN": "dc=example,dc=org",
      "hive.server2.authentication": "LDAP",
      "hive.server2.authentication.ldap.url": "ldap://ldap-server-dns-name:389"
    }
  }
]
```

------

Use the following procedure to import data from a cluster.

To import data from a cluster, do the following.

1. Open a Data Wrangler flow.

1. Choose **Create Connection**.

1. Choose **Amazon EMR**.

1. Do one of the following.
   + (Optional) For **Secrets ARN**, specify the Amazon Resource Number (ARN) of the database within the cluster. Secrets provide additional security. For more information about secrets, see [What is AWS Secrets Manager?](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html) For information about creating a secret for your cluster, see [Creating a AWS Secrets Manager secret for your cluster](#data-wrangler-emr-secrets-manager).
**Important**  
You must specify a secret if you're using an IAM runtime role for authentication.
   + From the dropdown table, choose a cluster.

1. Choose **Next**.

1. For **Select an endpoint for *example-cluster-name* cluster**, choose a query engine.

1. (Optional) Select **Save connection**.

1. Choose **Next, select login** and choose one of the following:
   + No authentication
   + LDAP
   + IAM

1. For **Login into *example-cluster-name* cluster**, specify the **Username** and **Password** for the cluster.

1. Choose **Connect**.

1. In the query editor specify a SQL query.

1. Choose **Run**.

1. Choose **Import**.

### Creating a AWS Secrets Manager secret for your cluster


If you're using an IAM runtime role to access your Amazon EMR cluster, you must store the credentials that you're using to access the Amazon EMR as a Secrets Manager secret. You store all the credentials that you use to access the cluster within the secret.

You must store the following information in the secret:
+ JDBC endpoint – `jdbc:hive2://`
+ DNS name – The DNS name of your Amazon EMR cluster. It's either the endpoint for the primary node or the hostname.
+ Port – `8446`

You can also store the following additional information within the secret:
+ IAM role – The IAM role that you're using to access the cluster. Data Wrangler uses your SageMaker AI execution role by default.
+ Truststore path – By default, Data Wrangler creates a truststore path for you. You can also use your own truststore path. For more information about truststore paths, see [In-transit encryption in HiveServer2](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/hs2-encryption-intransit.html).
+ Truststore password – By default, Data Wrangler creates a truststore password for you. You can also use your own truststore path. For more information about truststore paths, see [In-transit encryption in HiveServer2](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/hs2-encryption-intransit.html).

Use the following procedure to store the credentials within a Secrets Manager secret.

To store your credentials as a secret, do the following.

1. Navigate to the AWS Management Console.

1. In the search bar, specify Secrets Manager.

1. Choose **AWS Secrets Manager**.

1. Choose **Store a new secret**.

1. For **Secret type**, choose **Other type of secret**.

1. Under **Key/value** pairs, select **Plaintext**.

1. For clusters running Hive, you can use the following template for IAM authentication.

   ```
   {"jdbcURL": ""
    "iam_auth": {"endpoint": "jdbc:hive2://", #required
                   "dns": "ip-xx-x-xxx-xxx.ec2.internal", #required 
                   "port": "10000", #required
                 "cluster_id": "j-xxxxxxxxx", #required
                 "iam_role": "arn:aws:iam::xxxxxxxx:role/xxxxxxxxxxxx", #optional
                 "truststore_path": "/etc/alternatives/jre/lib/security/cacerts", #optional
                 "truststore_password": "changeit" #optional
                 
                 }}
   ```
**Note**  
After you import your data, you apply transformations to them. You then export the data that you've transformed to a specific location. If you're using a Jupyter notebook to export your transformed data to Amazon S3, you must use the truststore path specified in the preceding example.

A Secrets Manager secret stores the JDBC URL of the Amazon EMR cluster as a secret. Using a secret is more secure than directly entering in your credentials.

Use the following procedure to store the JDBC URL as a secret.

To store the JDBC URL as a secret, do the following.

1. Navigate to the AWS Management Console.

1. In the search bar, specify Secrets Manager.

1. Choose **AWS Secrets Manager**.

1. Choose **Store a new secret**.

1. For **Secret type**, choose **Other type of secret**.

1. For **Key/value pairs**, specify `jdbcURL` as the key and a valid JDBC URL as the value.

   The format of a valid JDBC URL depends on whether you use authentication and whether you use Hive or Presto as the query engine. The following list shows the valid JBDC URL formats for the different possible configurations.
   + Hive, no authentication – `jdbc:hive2://emr-cluster-master-public-dns:10000/;`
   + Hive, LDAP authentication – `jdbc:hive2://emr-cluster-master-public-dns-name:10000/;AuthMech=3;UID=david;PWD=welcome123;`
   + For Hive with SSL enabled, the JDBC URL format depends on whether you use a Java Keystore File for the TLS configuration. The Java Keystore File helps verify the identity of the master node of the Amazon EMR cluster. To use a Java Keystore File, generate it on an EMR cluster and upload it to Data Wrangler. To generate a file, use the following command on the Amazon EMR cluster, `keytool -genkey -alias hive -keyalg RSA -keysize 1024 -keystore hive.jks`. For information about running commands on an Amazon EMR cluster, see [Securing access to EMR clusters using AWS Systems Manager](https://aws.amazon.com/blogs/big-data/securing-access-to-emr-clusters-using-aws-systems-manager/). To upload a file, choose the upward arrow on the left-hand navigation of the Data Wrangler UI.

     The following are the valid JDBC URL formats for Hive with SSL enabled:
     + Without a Java Keystore File – `jdbc:hive2://emr-cluster-master-public-dns:10000/;AuthMech=3;UID=user-name;PWD=password;SSL=1;AllowSelfSignedCerts=1;`
     + With a Java Keystore File – `jdbc:hive2://emr-cluster-master-public-dns:10000/;AuthMech=3;UID=user-name;PWD=password;SSL=1;SSLKeyStore=/home/sagemaker-user/data/Java-keystore-file-name;SSLKeyStorePwd=Java-keystore-file-passsword;`
   + Presto, no authentication – jdbc:presto://*emr-cluster-master-public-dns*:8889/;
   + For Presto with LDAP authentication and SSL enabled, the JDBC URL format depends on whether you use a Java Keystore File for the TLS configuration. The Java Keystore File helps verify the identity of the master node of the Amazon EMR cluster. To use a Java Keystore File, generate it on an EMR cluster and upload it to Data Wrangler. To upload a file, choose the upward arrow on the left-hand navigation of the Data Wrangler UI. For information about creating a Java Keystore File for Presto, see [Java Keystore File for TLS](https://prestodb.io/docs/current/security/tls.html#server-java-keystore). For information about running commands on an Amazon EMR cluster, see [Securing access to EMR clusters using AWS Systems Manager](https://aws.amazon.com/blogs/big-data/securing-access-to-emr-clusters-using-aws-systems-manager/).
     + Without a Java Keystore File – `jdbc:presto://emr-cluster-master-public-dns:8889/;SSL=1;AuthenticationType=LDAP Authentication;UID=user-name;PWD=password;AllowSelfSignedServerCert=1;AllowHostNameCNMismatch=1;`
     + With a Java Keystore File – `jdbc:presto://emr-cluster-master-public-dns:8889/;SSL=1;AuthenticationType=LDAP Authentication;SSLTrustStorePath=/home/sagemaker-user/data/Java-keystore-file-name;SSLTrustStorePwd=Java-keystore-file-passsword;UID=user-name;PWD=password;`

Throughout the process of importing data from an Amazon EMR cluster, you might run into issues. For information about troubleshooting them, see [Troubleshooting issues with Amazon EMR](data-wrangler-trouble-shooting.md#data-wrangler-trouble-shooting-emr).

## Import data from Databricks (JDBC)


You can use Databricks as a data source for your Amazon SageMaker Data Wrangler flow. To import a dataset from Databricks, use the JDBC (Java Database Connectivity) import functionality to access to your Databricks database. After you access the database, specify a SQL query to get the data and import it.

We assume that you have a running Databricks cluster and that you've configured your JDBC driver to it. For more information, see the following Databricks documentation pages:
+ [JDBC driver](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#jdbc-driver)
+ [JDBC configuration and connection parameters](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#jdbc-configuration-and-connection-parameters)
+ [Authentication parameters](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#authentication-parameters)

Data Wrangler stores your JDBC URL in AWS Secrets Manager. You must give your Amazon SageMaker Studio Classic IAM execution role permissions to use Secrets Manager. Use the following procedure to give permissions.

To give permissions to Secrets Manager, do the following.

1. Sign in to the AWS Management Console and open the IAM console at [https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/).

1. Choose **Roles**.

1. In the search bar, specify the Amazon SageMaker AI execution role that Amazon SageMaker Studio Classic is using.

1. Choose the role.

1. Choose **Add permissions**.

1. Choose **Create inline policy**.

1. For **Service**, specify **Secrets Manager** and choose it.

1. For **Actions**, select the arrow icon next to **Permissions management**.

1. Choose **PutResourcePolicy**.

1. For **Resources**, choose **Specific**.

1. Choose the checkbox next to **Any in this account**.

1. Choose **Review policy**.

1. For **Name**, specify a name.

1. Choose **Create policy**.

You can use partitions to import your data more quickly. Partitions give Data Wrangler the ability to process the data in parallel. By default, Data Wrangler uses 2 partitions. For most use cases, 2 partitions give you near-optimal data processing speeds.

If you choose to specify more than 2 partitions, you can also specify a column to partition the data. The type of the values in the column must be numeric or date.

We recommend using partitions only if you understand the structure of the data and how it's processed.

You can either import the entire dataset or sample a portion of it. For a Databricks database, it provides the following sampling options:
+ None – Import the entire dataset.
+ First K – Sample the first K rows of the dataset, where K is an integer that you specify.
+ Randomized – Takes a random sample of a size that you specify.
+ Stratified – Takes a stratified random sample. A stratified sample preserves the ratio of values in a column.

Use the following procedure to import your data from a Databricks database.

To import data from Databricks, do the following.

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. From the **Import data** tab of your Data Wrangler flow, choose **Databricks**.

1. Specify the following fields:
   + **Dataset name** – A name that you want to use for the dataset in your Data Wrangler flow.
   + **Driver** – **com.simba.spark.jdbc.Driver**.
   + **JDBC URL** – The URL of the Databricks database. The URL formatting can vary between Databricks instances. For information about finding the URL and the specifying the parameters within it, see [JDBC configuration and connection parameters](https://docs.databricks.com/integrations/bi/jdbc-odbc-bi.html#jdbc-configuration-and-connection-parameters). The following is an example of how a URL can be formatted: jdbc:spark://aws-sagemaker-datawrangler.cloud.databricks.com:443/default;transportMode=http;ssl=1;httpPath=sql/protocolv1/o/3122619508517275/0909-200301-cut318;AuthMech=3;UID=*token*;PWD=*personal-access-token*.
**Note**  
You can specify a secret ARN that contains the JDBC URL instead of specifying the JDBC URL itself. The secret must contain a key-value pair with the following format: `jdbcURL:JDBC-URL`. For more information, see [What is Secrets Manager?](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html).

1. Specify a SQL SELECT statement.
**Note**  
Data Wrangler doesn't support Common Table Expressions (CTE) or temporary tables within a query.

1. For **Sampling**, choose a sampling method.

1. Choose **Run**. 

1. (Optional) For the **PREVIEW**, choose the gear to open the **Partition settings**. 

   1. Specify the number of partitions. You can partition by column if you specify the number of partitions:
     + **Enter number of partitions** – Specify a value greater than 2.
     + (Optional) **Partition by column** – Specify the following fields. You can only partition by a column if you've specified a value for **Enter number of partitions**.
       + **Select column** – Select the column that you're using for the data partition. The data type of the column must be numeric or date.
       + **Upper bound** – From the values in the column that you've specified, the upper bound is the value that you're using in the partition. The value that you specify doesn't change the data that you're importing. It only affects the speed of the import. For the best performance, specify an upper bound that's close to the column's maximum.
       + **Lower bound** – From the values in the column that you've specified, the lower bound is the value that you're using in the partition. The value that you specify doesn't change the data that you're importing. It only affects the speed of the import. For the best performance, specify a lower bound that's close to the column's minimum.

1. Choose **Import**.

## Import data from Salesforce Data Cloud


You can use Salesforce Data Cloud as a data source in Amazon SageMaker Data Wrangler to prepare the data in your Salesforce Data Cloud for machine learning.

With Salesforce Data Cloud as a data source in Data Wrangler, you can quickly connect to your Salesforce data without writing a single line of code. You can join your Salesforce data with data from any other data source in Data Wrangler.

After you connect to the data cloud, you can do the following:
+ Visualize your data with built-in visualizations
+ Understand data and identify potential errors and extreme values
+ Transform data with more than 300 built-in transformations
+ Export the data that you've transformed

**Topics**
+ [

### Administrator setup
](#data-wrangler-import-salesforce-data-cloud-administrator)
+ [

### Data Scientist Guide
](#data-wrangler-salesforce-data-cloud-ds)

### Administrator setup


**Important**  
Before you get started, make sure that your users are running Amazon SageMaker Studio Classic version 1.3.0 or later. For information about checking the version of Studio Classic and updating it, see [Prepare ML Data with Amazon SageMaker Data Wrangler](data-wrangler.md).

When you're setting up access to Salesforce Data Cloud, you must complete the following tasks:
+ Getting your Salesforce Domain URL. Salesforce also refers to the Domain URL as your org's URL.
+ Getting OAuth credentials from Salesforce. 
+ Getting the authorization URL and token URL for your Salesforce Domain.
+ Creating a AWS Secrets Manager secret with the OAuth configuration.
+ Creating a lifecycle configuration that Data Wrangler uses to read the credentials from the secret.
+ Giving Data Wrangler permissions to read the secret.

After you perform the preceding tasks, your users can log into the Salesforce Data Cloud using OAuth.

**Note**  
Your users might run into issues after you've set everything up. For information about troubleshooting, see [Troubleshooting with Salesforce](data-wrangler-trouble-shooting.md#data-wrangler-troubleshooting-salesforce-data-cloud).

Use the following procedure to get the Domain URL.

1. Navigate to the [Salesforce](login.salesforce.com) login page.

1. For **Quick find**, specify **My Domain**.

1. Copy the value of **Current My Domain URL** to a text file.

1. Add `https://` to the beginning of the URL. 

After you get the Salesforce Domain URL, you can use the following procedure to get the login credentials from Salesforce and allow Data Wrangler to access your Salesforce data.

To get the log in credentials from Salesforce and provide access to Data Wrangler, do the following.

1. Navigate to your Salesforce Domain URL and log into your account.

1. Choose the gear icon.

1. In the search bar that appears, specify **App Manager**.

1. Select **New Connected App**.

1. Specify the following fields:
   + Connected App Name – You can specify any name, but we recommend choosing a name that includes Data Wrangler. For example, you can specify **Salesforce Data Cloud Data Wrangler Integration**.
   + API name – Use the default value.
   + Contact Email – Specify your email address.
   + Under **API heading (Enable OAuth Settings)**, select the checkbox to activate OAuth settings.
   + For **Callback URL** specify the Amazon SageMaker Studio Classic URL. To get the URL for Studio Classic, access it from the AWS Management Console and copy the URL.

1. Under **Selected OAuth Scopes**, move the following from the **Available OAuth Scopes** to **Selected OAuth Scopes**:
   + Manage user data via APIs (`api`)
   + Perform requests at any time (`refresh_token`, `offline_access`)
   + Perform ANSI SQL queries on Salesforce Data Cloud data (`cdp_query_api`)
   + Manage Salesforce Customer Data Platform profile data (`cdp_profile_api`)

1. Choose **Save**. After you save your changes, Salesforce opens a new page.

1. Choose **Continue**

1. Navigate to **Consumer Key and Secret**.

1. Choose **Manage Consumer Details**. Salesforce redirects you to a new page where you might have to pass two-factor authentication.

1. 
**Important**  
Copy the Consumer Key and Consumer Secret to a text editor. You need this information to connect the data cloud to Data Wrangler.

1. Navigate back to **Manage Connected Apps**.

1. Navigate to **Connected App Name **and the name of your application.

1. Choose **Manage**.

   1. Select **Edit Policies**.

   1. Change **IP Relaxation** to **Relax IP restrictions**.

   1. Choose **Save**.

After you provide access to your Salesforce Data Cloud, you need to provide permissions for your users. Use the following procedure to provide them with permissions.

To provide your users with permissions, do the following.

1. Navigate to the setup home page.

1. On the left-hand navigation, search for **Users** and choose the **Users** menu item.

1. Choose the hyperlink with your user name.

1. Navigate to **Permission Set Assignments**.

1. Choose **Edit Assignments**.

1. Add the following permissions:
   + **Customer Data Platform Admin**
   + **Customer Data Platform Data Aware Specialist**

1. Choose **Save**.

After you get the information for your Salesforce Domain, you must get the authorization URL and the token URL for the AWS Secrets Manager secret that you're creating.

Use the following procedure to get the authorization URL and the token URL.

**To get the authorization URL and token URL**

1. Navigate to your Salesforce Domain URL.

1. Use one of the following methods to get the URLs. If you are on a Linux distribution with `curl` and `jq` installed, we recommend using the method that only works on Linux.
   + (Linux only) Specify the following command in your terminal.

     ```
     curl salesforce-domain-URL/.well-known/openid-configuration | \
     jq '. | { authorization_url: .authorization_endpoint, token_url: .token_endpoint }' | \
     jq '.  += { identity_provider: "SALESFORCE", client_id: "example-client-id", client_secret: "example-client-secret" }'
     ```
   + 

     1. Navigate to **example-org-URL*/.well-known/openid-configuration* in your browser.

     1. Copy the `authorization_endpoint` and `token_endpoint` to a text editor.

     1. Create the following JSON object:

        ```
        {
          "identity_provider": "SALESFORCE",
          "authorization_url": "example-authorization-endpoint", 
          "token_url": "example-token-endpoint",
          "client_id": "example-consumer-key",
          "client_secret": "example-consumer-secret"
        }
        ```

After you create the OAuth configuration object, you can create a AWS Secrets Manager secret that stores it. Use the following procedure to create the secret.

To create a secret, do the following.

1. Navigate to the [AWS Secrets Manager console](https://console.aws.amazon.com/secretsmanager/).

1. Choose **Store a secret**.

1. Select **Other type of secret**.

1. Under **Key/value** pairs select **Plaintext**.

1. Replace the empty JSON with the following configuration settings.

   ```
   {
     "identity_provider": "SALESFORCE",
     "authorization_url": "example-authorization-endpoint", 
     "token_url": "example-token-endpoint",
     "client_id": "example-consumer-key",
     "client_secret": "example-consumer-secret"
   }
   ```

1. Choose **Next**.

1. For **Secret Name**, specify the name of the secret.

1. Under **Tags**, choose **Add**.

   1. For the **Key**, specify **sagemaker:partner**. For **Value**, we recommend specifying a value that might be useful for your use case. However, you can specify anything.
**Important**  
You must create the key. You can't import your data from Salesforce if you don't create it.

1. Choose **Next**.

1. Choose **Store**.

1. Choose the secret you've created.

1. Make a note of the following fields:
   + The Amazon Resource Number (ARN) of the secret
   + The name of the secret

After you've created the secret, you must add permissions for Data Wrangler to read the secret. Use the following procedure to add permissions.

To add read permissions for Data Wrangler, do the following.

1. Navigate to the [Amazon SageMaker AI console](https://console.aws.amazon.com/sagemaker/).

1. Choose **domains**.

1. Choose the domain that you're using to access Data Wrangler.

1. Choose your **User Profile**.

1. Under **Details**, find the **Execution role**. Its ARN is in the following format: `arn:aws:iam::111122223333:role/example-role`. Make a note of the SageMaker AI execution role. Within the ARN, it's everything after `role/`.

1. Navigate to the [IAM console](https://console.aws.amazon.com/iam).

1. In the **Search IAM** search bar, specify the name of the SageMaker AI execution role.

1. Choose the role.

1. Choose **Add permissions**.

1. Choose **Create inline policy**.

1. Choose the JSON tab.

1. Specify the following policy within the editor.

------
#### [ JSON ]

****  

   ```
   {
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:GetSecretValue",
                "secretsmanager:PutSecretValue"
            ],
            "Resource": "arn:aws:secretsmanager:*:*:secret:*",
            "Condition": {
                "ForAnyValue:StringLike": {
                    "aws:ResourceTag/sagemaker:partner": "*"
                }
            }
        },
        {
            "Effect": "Allow",
            "Action": [
                "secretsmanager:UpdateSecret"
            ],
            "Resource": "arn:aws:secretsmanager:*:*:secret:AmazonSageMaker-*"
        }
    ]
   }
   ```

------

1. Choose **Review Policy**.

1. For **Name**, specify a name.

1. Choose **Create policy**.

After you've given Data Wrangler permissions to read the secret, you must add a Lifecycle Configuration that uses your Secrets Manager secret to your Amazon SageMaker Studio Classic user profile.

Use the following procedure to create a lifecycle configuration and add it to the Studio Classic profile.

To create a lifecycle configuration and add it to the Studio Classic profile, do the following.

1. Navigate to the [Amazon SageMaker AI console](console.aws.amazon.com/sagemaker).

1. Choose **domains**.

1. Choose the domain that you're using to access Data Wrangler.

1. Choose your **User Profile**.

1. If you see the following applications, delete them:
   + KernelGateway
   + JupyterKernel
**Note**  
Deleting the applications updates Studio Classic. It can take a while for the updates to happen.

1. While you're waiting for updates to happen, choose **Lifecycle configurations**.

1. Make sure the page you're on says **Studio Classic Lifecycle configurations**.

1. Choose **Create configuration**.

1. Make sure **Jupyter server app** has been selected.

1. Choose **Next**.

1. For **Name**, specify a name for the configuration.

1. For **Scripts**, specify the following script:

   ```
   #!/bin/bash
   set -eux
   
   cat > ~/.sfgenie_identity_provider_oauth_config <<EOL
   {
       "secret_arn": "secrets-arn-containing-salesforce-credentials"
   }
   EOL
   ```

1. Choose **Submit**.

1. On the left hand navigation, choose **domains**.

1. Choose your domain.

1. Choose **Environment**.

1. Under **Lifecycle configurations for personal Studio Classic apps**, choose **Attach**. 

1. Select **Existing configuration**.

1. Under **Studio Classic Lifecycle configurations** select the lifecycle configuration that you've created.

1. Choose **Attach to domain**.

1. Select the checkbox next to the lifecycle configuration that you've attached.

1. Select **Set as default**.

You might run into issues when you set up your lifecycle configuration. For information about debugging them, see [Debug Lifecycle Configurations in Amazon SageMaker Studio Classic](studio-lcc-debug.md).

### Data Scientist Guide


Use the following to connect Salesforce Data Cloud and access your data in Data Wrangler.

**Important**  
Your administrator needs to use the information in the preceding sections to set up Salesforce Data Cloud. If you're running into issues, contact them for troubleshooting help.

To open Studio Classic and check its version, see the following procedure.

1. Use the steps in [Prerequisites](data-wrangler-getting-started.md#data-wrangler-getting-started-prerequisite) to access Data Wrangler through Amazon SageMaker Studio Classic.

1. Next to the user you want to use to launch Studio Classic, select **Launch app**.

1. Choose **Studio**.

**To create a dataset in Data Wrangler with data from the Salesforce Data Cloud**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Salesforce Data Cloud**.

1. For **Connection name**, specify a name for your connection to the Salesforce Data Cloud.

1. For **Org URL**, specify the organization URL in your Salesforce account. You can get the URL from your administrator.s

1. Choose **Connect**.

1. Specify your credentials to log into Salesforce.

You can begin creating a dataset using data from Salesforce Data Cloud after you've connected to it.

After you select a table, you can write queries and run them. The output of your query shows under **Query results**.

After you have settled on the output of your query, you can then import the output of your query into a Data Wrangler flow to perform data transformations. 

After you've created a dataset, navigate to the **Data flow** screen to start transforming your data.

## Import data from Snowflake


You can use Snowflake as a data source in SageMaker Data Wrangler to prepare data in Snowflake for machine learning.

With Snowflake as a data source in Data Wrangler, you can quickly connect to Snowflake without writing a single line of code. You can join your data in Snowflake with data from any other data source in Data Wrangler.

Once connected, you can interactively query data stored in Snowflake, transform data with more than 300 preconfigured data transformations, understand data and identify potential errors and extreme values with a set of robust preconfigured visualization templates, quickly identify inconsistencies in your data preparation workflow, and diagnose issues before models are deployed into production. Finally, you can export your data preparation workflow to Amazon S3 for use with other SageMaker AI features such as Amazon SageMaker Autopilot, Amazon SageMaker Feature Store and Amazon SageMaker Pipelines.

You can encrypt the output of your queries using an AWS Key Management Service key that you've created. For more information about AWS KMS, see [AWS Key Management Service](https://docs.aws.amazon.com//kms/latest/developerguide/overview.html).

**Topics**
+ [

### Administrator Guide
](#data-wrangler-snowflake-admin)
+ [

### Data Scientist Guide
](#data-wrangler-snowflake-ds)

### Administrator Guide


**Important**  
To learn more about granular access control and best practices, see [Security Access Control](https://docs.snowflake.com/en/user-guide/security-access-control.html). 

This section is for Snowflake administrators who are setting up access to Snowflake from within SageMaker Data Wrangler.

**Important**  
You are responsible for managing and monitoring the access control within Snowflake. Data Wrangler does not add a layer of access control with respect to Snowflake.   
Access control includes the following:  
The data that a user accesses
(Optional) The storage integration that provides Snowflake the ability to write query results to an Amazon S3 bucket
The queries that a user can run

#### (Optional) Configure Snowflake Data Import Permissions


By default, Data Wrangler queries the data in Snowflake without creating a copy of it in an Amazon S3 location. Use the following information if you're configuring a storage integration with Snowflake. Your users can use a storage integration to store their query results in an Amazon S3 location.

Your users might have different levels of access of sensitive data. For optimal data security, provide each user with their own storage integration. Each storage integration should have its own data governance policy.

This feature is currently not available in the opt-in Regions.

Snowflake requires the following permissions on an S3 bucket and directory to be able to access files in the directory:
+ `s3:GetObject`
+ `s3:GetObjectVersion`
+ `s3:ListBucket`
+ `s3:ListObjects`
+ `s3:GetBucketLocation`

**Create an IAM policy**

You must create an IAM policy to configure access permissions for Snowflake to load and unload data from an Amazon S3 bucket.

The following is the JSON policy document that you use to create the policy:

```
# Example policy for S3 write access
# This needs to be updated
{
"Version": "2012-10-17",		 	 	 
"Statement": [
  {
    "Effect": "Allow",
    "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:GetObjectVersion",
        "s3:DeleteObject",
        "s3:DeleteObjectVersion"
    ],
    "Resource": "arn:aws:s3:::bucket/prefix/*"
  },
  {
    "Effect": "Allow",
    "Action": [
        "s3:ListBucket"
    ],
    "Resource": "arn:aws:s3:::bucket/",
    "Condition": {
        "StringLike": {
            "s3:prefix": ["prefix/*"]
        }
    }
  }
 ]
}
```

For information and procedures about creating policies with policy documents, see [Creating IAM policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create.html).

For documentation that provides an overview of using IAM permissions with Snowflake, see the following resources:
+ [What is IAM?](https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html)
+ [Create the IAM Role in AWS](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html#step-2-create-the-iam-role-in-aws)
+ [Create a Cloud Storage Integration in Snowflake](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html#step-3-create-a-cloud-storage-integration-in-snowflake)
+ [Retrieve the AWS IAM User for your Snowflake Account](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html#step-4-retrieve-the-aws-iam-user-for-your-snowflake-account)
+ [Grant the IAM User Permissions to Access Bucket](https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html#step-5-grant-the-iam-user-permissions-to-access-bucket-objects).

To grant the data scientist's Snowflake role usage permission to the storage integration, you must run `GRANT USAGE ON INTEGRATION integration_name TO snowflake_role;`.
+ `integration_name` is the name of your storage integration.
+ `snowflake_role` is the name of the default [Snowflake role](https://docs.snowflake.com/en/user-guide/security-access-control-overview.html#roles) given to the data scientist user.

#### Setting up Snowflake OAuth Access


Instead of having your users directly enter their credentials into Data Wrangler, you can have them use an identity provider to access Snowflake. The following are links to the Snowflake documentation for the identity providers that Data Wrangler supports.
+ [Azure AD](https://docs.snowflake.com/en/user-guide/oauth-azure.html)
+ [Okta](https://docs.snowflake.com/en/user-guide/oauth-okta.html)
+ [Ping Federate](https://docs.snowflake.com/en/user-guide/oauth-pingfed.html)

Use the documentation from the preceding links to set up access to your identity provider. The information and procedures in this section help you understand how to properly use the documentation to access Snowflake within Data Wrangler.

Your identity provider needs to recognize Data Wrangler as an application. Use the following procedure to register Data Wrangler as an application within the identity provider:

1. Select the configuration that starts the process of registering Data Wrangler as an application.

1. Provide the users within the identity provider access to Data Wrangler.

1. Turn on OAuth client authentication by storing the client credentials as an AWS Secrets Manager secret.

1. Specify a redirect URL using the following format: https://*domain-ID*.studio.*AWS Region*.sagemaker.aws/jupyter/default/lab
**Important**  
You're specifying the Amazon SageMaker AI domain ID and AWS Region that you're using to run Data Wrangler.
**Important**  
You must register a URL for each Amazon SageMaker AI domain and AWS Region where you're running Data Wrangler. Users from a domain and AWS Region that don't have redirect URLs set up for them won't be able to authenticate with the identity provider to access the Snowflake connection.

1. Make sure that the authorization code and refresh token grant types are allowed for the Data Wrangler application.

Within your identity provider, you must set up a server that sends OAuth tokens to Data Wrangler at the user level. The server sends the tokens with Snowflake as the audience.

Snowflake uses the concept of roles that are distinct role the IAM roles used in AWS. You must configure the identity provider to use any role to use the default role associated with the Snowflake account. For example, if a user has `systems administrator` as the default role in their Snowflake profile, the connection from Data Wrangler to Snowflake uses `systems administrator` as the role.

Use the following procedure to set up the server.

To set up the server, do the following. You're working within Snowflake for all steps except the last one.

1. Start setting up the server or API.

1. Configure the authorization server to use the authorization code and refresh token grant types.

1. Specify the lifetime of the access token.

1. Set the refresh token idle timeout. The idle timeout is the time that the refresh token expires if it's not used.
**Note**  
If you're scheduling jobs in Data Wrangler, we recommend making the idle timeout time greater than the frequency of the processing job. Otherwise, some processing jobs might fail because the refresh token expired before they could run. When the refresh token expires, the user must re-authenticate by accessing the connection that they've made to Snowflake through Data Wrangler.

1. Specify `session:role-any` as the new scope.
**Note**  
For Azure AD, copy the unique identifier for the scope. Data Wrangler requires you to provide it with the identifier.

1. 
**Important**  
Within the External OAuth Security Integration for Snowflake, enable `external_oauth_any_role_mode`.

**Important**  
Data Wrangler doesn't support rotating refresh tokens. Using rotating refresh tokens might result in access failures or users needing to log in frequently.

**Important**  
If the refresh token expires, your users must reauthenticate by accessing the connection that they've made to Snowflake through Data Wrangler.

After you've set up the OAuth provider, you provide Data Wrangler with the information it needs to connect to the provider. You can use the documentation from your identity provider to get values for the following fields:
+ Token URL – The URL of the token that the identity provider sends to Data Wrangler.
+ Authorization URL – The URL of the authorization server of the identity provider.
+ Client ID – The ID of the identity provider.
+ Client secret – The secret that only the authorization server or API recognizes.
+ (Azure AD only) The OAuth scope credentials that you've copied.

You store the fields and values in a AWS Secrets Manager secret and add it to the Amazon SageMaker Studio Classic lifecycle configuration that you're using for Data Wrangler. A Lifecycle Configuration is a shell script. Use it to make the Amazon Resource Name (ARN) of the secret accessible to Data Wrangler. For information about creating secrets see [Move hardcoded secrets to AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html). For information about using lifecycle configurations in Studio Classic, see [Use Lifecycle Configurations to Customize Amazon SageMaker Studio Classic](studio-lcc.md).

**Important**  
Before you create a Secrets Manager secret, make sure that the SageMaker AI execution role that you're using for Amazon SageMaker Studio Classic has permissions to create and update secrets in Secrets Manager. For more information about adding permissions, see [Example: Permission to create secrets](https://docs.aws.amazon.com/secretsmanager/latest/userguide/auth-and-access_examples.html#auth-and-access_examples_create).

For Okta and Ping Federate, the following is the format of the secret:

```
{
    "token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
    "client_id":"example-client-id",
    "client_secret":"example-client-secret",
    "identity_provider":"OKTA"|"PING_FEDERATE",
    "authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/authorize"
}
```

For Azure AD, the following is the format of the secret:

```
{
    "token_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/token",
    "client_id":"example-client-id",
    "client_secret":"example-client-secret",
    "identity_provider":"AZURE_AD",
    "authorization_url":"https://identityprovider.com/oauth2/example-portion-of-URL-path/v2/authorize",
    "datasource_oauth_scope":"api://appuri/session:role-any)"
}
```

You must have a lifecycle configuration that uses the Secrets Manager secret that you've created. You can either create the lifecycle configuration or modify one that has already been created. The configuration must use the following script.

```
#!/bin/bash

set -eux

## Script Body

cat > ~/.snowflake_identity_provider_oauth_config <<EOL
{
    "secret_arn": "example-secret-arn"
}
EOL
```

For information about setting up lifecycle configurations, see [Create and Associate a Lifecycle Configuration with Amazon SageMaker Studio Classic](studio-lcc-create.md). When you're going through the process of setting up, do the following:
+ Set the application type of the configuration to `Jupyter Server`.
+ Attach the configuration to the Amazon SageMaker AI domain that has your users.
+ Have the configuration run by default. It must run every time a user logs into Studio Classic. Otherwise, the credentials saved in the configuration won't be available to your users when they're using Data Wrangler.
+ The lifecycle configuration creates a file with the name, `snowflake_identity_provider_oauth_config` in the user's home folder. The file contains the Secrets Manager secret. Make sure that it's in the user's home folder every time the Jupyter Server's instance is initialized.

#### Private Connectivity between Data Wrangler and Snowflake via AWS PrivateLink


This section explains how to use AWS PrivateLink to establish a private connection between Data Wrangler and Snowflake. The steps are explained in the following sections. 

##### Create a VPC


If you do not have a VPC set up, then follow the [Create a new VPC](https://docs.aws.amazon.com/directoryservice/latest/admin-guide/gsg_create_vpc.html#create_vpc) instructions to create one.

Once you have a chosen VPC you would like to use for establishing a private connection, provide the following credentials to your Snowflake Administrator to enable AWS PrivateLink:
+ VPC ID
+ AWS Account ID
+ Your corresponding account URL you use to access Snowflake

**Important**  
As described in Snowflake's documentation, enabling your Snowflake account can take up to two business days. 

##### Set up Snowflake AWS PrivateLink Integration


After AWS PrivateLink is activated, retrieve the AWS PrivateLink configuration for your Region by running the following command in a Snowflake worksheet. Log into your Snowflake console and enter the following under **Worksheets**: `select SYSTEM$GET_PRIVATELINK_CONFIG();` 

1. Retrieve the values for the following: `privatelink-account-name`, `privatelink_ocsp-url`, `privatelink-account-url`, and `privatelink_ocsp-url` from the resulting JSON object. Examples of each value are shown in the following snippet. Store these values for later use.

   ```
   privatelink-account-name: xxxxxxxx.region.privatelink
   privatelink-vpce-id: com.amazonaws.vpce.region.vpce-svc-xxxxxxxxxxxxxxxxx
   privatelink-account-url: xxxxxxxx.region.privatelink.snowflakecomputing.com
   privatelink_ocsp-url: ocsp.xxxxxxxx.region.privatelink.snowflakecomputing.com
   ```

1. Switch to your AWS Console and navigate to the VPC menu.

1. From the left side panel, choose the **Endpoints** link to navigate to the **VPC Endpoints** setup.

   Once there, choose **Create Endpoint**. 

1. Select the radio button for **Find service by name**, as shown in the following screenshot.   
![\[The Create Endpoint section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-radio.png)

1. In the **Service Name** field, paste in the value for `privatelink-vpce-id` that you retrieved in the preceding step and choose **Verify**. 

   If the connection is successful, a green alert saying **Service name found** appears on your screen and the **VPC** and **Subnet** options automatically expand, as shown in the following screenshot. Depending on your targeted Region, your resulting screen may show another AWS Region name.   
![\[The Create Endpoint section in the console showing the connection is successful.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-service-name-found.png)

1. Select the same VPC ID that you sent to Snowflake from the **VPC** dropdown list.

1. If you have not yet created a subnet, then perform the following set of instructions on creating a subnet. 

1. Select **Subnets** from the **VPC** dropdown list. Then select **Create subnet** and follow the prompts to create a subset in your VPC. Ensure you select the VPC ID you sent Snowflake. 

1. Under **Security Group Configuration**, select **Create New Security Group** to open the default **Security Group** screen in a new tab. In this new tab, select t**Create Security Group**. 

1. Provide a name for the new security group (such as `datawrangler-doc-snowflake-privatelink-connection`) and a description. Be sure to select the VPC ID you have used in previous steps. 

1. Add two rules to allow traffic from within your VPC to this VPC endpoint. 

   Navigate to your VPC under **Your VPCs** in a separate tab, and retrieve your CIDR block for your VPC. Then choose **Add Rule** in the **Inbound Rules** section. Select `HTTPS` for the type, leave the **Source** as **Custom** in the form, and paste in the value retrieved from the preceding `describe-vpcs` call (such as `10.0.0.0/16`). 

1. Choose **Create Security Group**. Retrieve the **Security Group ID** from the newly created security group (such as `sg-xxxxxxxxxxxxxxxxx`).

1. In the **VPC Endpoint** configuration screen, remove the default security group. Paste in the security group ID in the search field and select the checkbox.  
![\[The Security group section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-security-group.png)

1. Select **Create Endpoint**. 

1. If the endpoint creation is successful, you see a page that has a link to your VPC endpoint configuration, specified by the VPC ID. Select the link to view the configuration in full.   
![\[The endpoint Details section.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-success-endpoint.png)

   Retrieve the topmost record in the DNS names list. This can be differentiated from other DNS names because it only includes the Region name (such as `us-west-2`), and no Availability Zone letter notation (such as `us-west-2a`). Store this information for later use.

##### Configure DNS for Snowflake Endpoints in your VPC


This section explains how to configure DNS for Snowflake endpoints in your VPC. This allows your VPC to resolve requests to the Snowflake AWS PrivateLink endpoint. 

1. Navigate to the [Route 53 menu](https://console.aws.amazon.com/route53) within your AWS console.

1. Select the **Hosted Zones** option (if necessary, expand the left-hand menu to find this option).

1. Choose **Create Hosted Zone**.

   1. In the **Domain name** field, reference the value that was stored for `privatelink-account-url` in the preceding steps. In this field, your Snowflake account ID is removed from the DNS name and only uses the value starting with the Region identifier. A **Resource Record Set** is also created later for the subdomain, such as, `region.privatelink.snowflakecomputing.com`.

   1. Select the radio button for **Private Hosted Zone** in the **Type** section. Your Region code may not be `us-west-2`. Reference the DNS name returned to you by Snowflake.  
![\[The Create hosted zone page in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-create-hosted-zone.png)

   1. In the **VPCs to associate with the hosted zone** section, select the Region in which your VPC is located and the VPC ID used in previous steps.  
![\[The VPCs to associate with the hosted zone section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-vpc-hosted-zone.png)

   1. Choose **Create hosted zone**.

1. Next, create two records, one for `privatelink-account-url` and one for `privatelink_ocsp-url`.
   + In the **Hosted Zone** menu, choose **Create Record Set**.

     1. Under **Record name**, enter your Snowflake Account ID only (the first 8 characters in `privatelink-account-url`).

     1. Under **Record type**, select **CNAME**.

     1. Under **Value**, enter the DNS name for the regional VPC endpoint you retrieved in the last step of the *Set up the Snowflake AWS PrivateLink Integration* section.   
![\[The Quick create record section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-quick-create-record.png)

     1. Choose **Create records**.

     1. Repeat the preceding steps for the OCSP record we notated as `privatelink-ocsp-url`, starting with `ocsp` through the 8-character Snowflake ID for the record name (such as `ocsp.xxxxxxxx`).  
![\[The Quick create record section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-quick-create-ocsp.png)

##### Configure Route 53 Resolver Inbound Endpoint for your VPC


This section explains how to configure Route 53 resolvers inbound endpoints for your VPC.

1. Navigate to the [Route 53 menu](https://console.aws.amazon.com/route53) within your AWS console.
   + In the left hand panel in the **Security** section, select the **Security Groups** option.

1. Choose **Create Security Group**. 
   + Provide a name for your security group (such as `datawranger-doc-route53-resolver-sg`) and a description.
   + Select the VPC ID used in previous steps.
   + Create rules that allow for DNS over UDP and TCP from within the VPC CIDR block.   
![\[The Inbound rules section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-inbound-rules.png)
   + Choose **Create Security Group**. Note the **Security Group ID** because adds a rule to allow traffic to the VPC endpoint security group.

1. Navigate to the [Route 53 menu](https://console.aws.amazon.com/route53) within your AWS console.
   + In the **Resolver** section, select the **Inbound Endpoint** option.

1. Choose **Create Inbound Endpoint**. 
   + Provide an endpoint name.
   + From the **VPC in the Region** dropdown list, select the VPC ID you have used in all previous steps. 
   + In the **Security group for this endpoint** dropdown list, select the security group ID from Step 2 in this section.   
![\[The General settings for inbound endpoint section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-inbound-endpoint.png)
   + In the **IP Address** section, select an Availability Zones, select a subnet, and leave the radio selector for **Use an IP address that is selected automatically** selected for each IP address.   
![\[The IP Address section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-ip-address-1.png)
   + Choose **Submit**.

1. Select the **Inbound endpoint** after it has been created.

1. Once the inbound endpoint is created, note the two IP addresses for the resolvers.  
![\[The IP Addresses section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/snowflake-ip-addresses-2.png)

##### SageMaker AI VPC Endpoints


 This section explains how to create VPC endpoints for the following: Amazon SageMaker Studio Classic, SageMaker Notebooks, the SageMaker API, SageMaker Runtime Runtime, and Amazon SageMaker Feature Store Runtime.

**Create a security group that is applied to all endpoints.**

1. Navigate to the [EC2 menu](https://console.aws.amazon.com/ec2) in the AWS Console.

1. In the **Network & Security** section, select the **Security groups** option.

1. Choose **Create security group**.

1. Provide a security group name and description (such as `datawrangler-doc-sagemaker-vpce-sg`). A rule is added later to allow traffic over HTTPS from SageMaker AI to this group. 

**Creating the endpoints**

1. Navigate to the [VPC menu](https://console.aws.amazon.com/vpc) in the AWS console.

1. Select the **Endpoints** option.

1. Choose **Create Endpoint**.

1. Search for the service by entering its name in the **Search** field.

1. From the **VPC** dropdown list, select the VPC in which your Snowflake AWS PrivateLink connection exists.

1. In the **Subnets** section, select the subnets which have access to the Snowflake PrivateLink connection.

1. Leave the **Enable DNS Name** checkbox selected.

1. In the **Security Groups** section, select the security group you created in the preceding section.

1. Choose **Create Endpoint**.

**Configure Studio Classic and Data Wrangler**

This section explains how to configure Studio Classic and Data Wrangler.

1. Configure the security group.

   1. Navigate to the Amazon EC2 menu in the AWS Console.

   1. Select the **Security Groups** option in the **Network & Security** section.

   1. Choose **Create Security Group**. 

   1. Provide a name and description for your security group (such as `datawrangler-doc-sagemaker-studio`). 

   1. Create the following inbound rules.
      + The HTTPS connection to the security group you provisioned for the Snowflake PrivateLink connection you created in the *Set up the Snowflake PrivateLink Integration* step.
      + The HTTP connection to the security group you provisioned for the Snowflake PrivateLink connection you created in the *Set up the Snowflake PrivateLink Integration step*.
      + The UDP and TCP for DNS (port 53) to Route 53 Resolver Inbound Endpoint security group you create in step 2 of *Configure Route 53 Resolver Inbound Endpoint for your VPC*.

   1. Choose **Create Security Group** button in the lower right hand corner.

1. Configure Studio Classic.
   + Navigate to the SageMaker AI menu in the AWS console.
   + From the left hand console, Select the **SageMaker AI Studio Classic** option.
   + If you do not have any domains configured, the **Get Started** menu is present.
   + Select the **Standard Setup** option from the **Get Started** menu.
   + Under **Authentication method**, select **AWS Identity and Access Management (IAM)**.
   + From the **Permissions** menu, you can create a new role or use a pre-existing role, depending on your use case.
     + If you choose **Create a new role**, you are presented the option to provide an S3 bucket name, and a policy is generated for you.
     + If you already have a role created with permissions for the S3 buckets to which you require access, select the role from the dropdown list. This role should have the `AmazonSageMakerFullAccess` policy attached to it.
   + Select the **Network and Storage** dropdown list to configure the VPC, security, and subnets SageMaker AI uses.
     + Under **VPC**, select the VPC in which your Snowflake PrivateLink connection exists.
     + Under **Subnet(s)**, select the subnets which have access to the Snowflake PrivateLink connection.
     + Under **Network Access for Studio Classic**, select **VPC Only**.
     + Under **Security Group(s)**, select the security group you created in step 1.
   + Choose **Submit**.

1. Edit the SageMaker AI security group.
   + Create the following inbound rules:
     + Port 2049 to the inbound and outbound NFS Security Groups created automatically by SageMaker AI in step 2 (the security group names contain the Studio Classic domain ID).
     + Access to all TCP ports to itself (required for SageMaker AI for VPC Only).

1. Edit the VPC Endpoint Security Groups:
   + Navigate to the Amazon EC2 menu in the AWS console.
   + Locate the security group you created in a preceding step.
   + Add an inbound rule allowing for HTTPS traffic from the security group created in step 1.

1. Create a user profile.
   + From the **SageMaker Studio Classic Control Panel **, choose **Add User**.
   + Provide a user name. 
   + For the **Execution Role**, choose to create a new role or to use a pre-existing role.
     + If you choose **Create a new role**, you are presented the option to provide an Amazon S3 bucket name, and a policy is generated for you.
     + If you already have a role created with permissions to the Amazon S3 buckets to which you require access, select the role from the dropdown list. This role should have the `AmazonSageMakerFullAccess` policy attached to it.
   + Choose **Submit**. 

1. Create a data flow (follow the data scientist guide outlined in a preceding section). 
   + When adding a Snowflake connection, enter the value of `privatelink-account-name` (from the *Set up Snowflake PrivateLink Integration* step) into the **Snowflake account name (alphanumeric)** field, instead of the plain Snowflake account name. Everything else is left unchanged.

#### Provide information to the data scientist


Provide the data scientist with the information that they need to access Snowflake from Amazon SageMaker AI Data Wrangler.

**Important**  
Your users need to run Amazon SageMaker Studio Classic version 1.3.0 or later. For information about checking the version of Studio Classic and updating it, see [Prepare ML Data with Amazon SageMaker Data Wrangler](data-wrangler.md).

1. To allow your data scientist to access Snowflake from SageMaker Data Wrangler, provide them with one of the following:
   + For Basic Authentication, a Snowflake account name, user name, and password.
   + For OAuth, a user name and password in the identity provider.
   + For ARN, the Secrets Manager secret Amazon Resource Name (ARN).
   + A secret created with [AWS Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html) and the ARN of the secret. Use the following procedure below to create the secret for Snowflake if you choose this option.
**Important**  
If your data scientists use the **Snowflake Credentials (User name and Password)** option to connect to Snowflake, you can use [Secrets Manager](https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html) to store the credentials in a secret. Secrets Manager rotates secrets as part of a best practice security plan. The secret created in Secrets Manager is only accessible with the Studio Classic role configured when you set up a Studio Classic user profile. This requires you to add this permission, `secretsmanager:PutResourcePolicy`, to the policy that is attached to your Studio Classic role.  
We strongly recommend that you scope the role policy to use different roles for different groups of Studio Classic users. You can add additional resource-based permissions for the Secrets Manager secrets. See [Manage Secret Policy](https://docs.aws.amazon.com/secretsmanager/latest/userguide/manage_secret-policy.html) for condition keys you can use.  
For information about creating a secret, see [Create a secret](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create_secret.html). You're charged for the secrets that you create.

1. (Optional) Provide the data scientist with the name of the storage integration that you created using the following procedure [Create a Cloud Storage Integration in Snowflake](                                      https://docs.snowflake.com/en/user-guide/data-load-s3-config-storage-integration.html#step-3-create-a-cloud-storage-integration-in-snowflake). This is the name of the new integration and is called `integration_name` in the `CREATE INTEGRATION` SQL command you ran, which is shown in the following snippet: 

   ```
     CREATE STORAGE INTEGRATION integration_name
     TYPE = EXTERNAL_STAGE
     STORAGE_PROVIDER = S3
     ENABLED = TRUE
     STORAGE_AWS_ROLE_ARN = 'iam_role'
     [ STORAGE_AWS_OBJECT_ACL = 'bucket-owner-full-control' ]
     STORAGE_ALLOWED_LOCATIONS = ('s3://bucket/path/', 's3://bucket/path/')
     [ STORAGE_BLOCKED_LOCATIONS = ('s3://bucket/path/', 's3://bucket/path/') ]
   ```

### Data Scientist Guide


Use the following to connect Snowflake and access your data in Data Wrangler.

**Important**  
Your administrator needs to use the information in the preceding sections to set up Snowflake. If you're running into issues, contact them for troubleshooting help.

You can connect to Snowflake in one of the following ways:
+ Specifying your Snowflake credentials (account name, user name, and password) in Data Wrangler. 
+ Providing an Amazon Resource Name (ARN) of a secret containing the credentials.
+ Using an open standard for access delegation (OAuth) provider that connects to Snowflake. Your administrator can give you access to one of the following OAuth providers:
  + [Azure AD](https://docs.snowflake.com/en/user-guide/oauth-azure.html)
  + [Okta](https://docs.snowflake.com/en/user-guide/oauth-okta.html)
  + [Ping Federate](https://docs.snowflake.com/en/user-guide/oauth-pingfed.html)

Talk to your administrator about the method that you need to use to connect to Snowflake.

The following sections have information about how you can connect to Snowflake using the preceding methods.

------
#### [ Specifying your Snowflake Credentials ]

**To import a dataset into Data Wrangler from Snowflake using your credentials**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Snowflake**.

1. For **Connection name**, specify a name that uniquely identifies the connection.

1. For **Authentication method**, choose **Basic Username-Password**.

1. For **Snowflake account name (alphanumeric)**, specify the full name of the Snowflake account.

1. For **Username**, specify the username that you use to access the Snowflake account.

1. For **Password**, specify the password associated with the username.

1. (Optional) For **Advanced settings**. specify the following:
   + **Role** – A role within Snowflake. Some roles have access to different datasets. If you don't specify a role, Data Wrangler uses the default role in your Snowflake account.
   + **Storage integration** – When you specify and run a query, Data Wrangler creates a temporary copy of the query results in memory. To store a permanent copy of the query results, specify the Amazon S3 location for the storage integration. Your administrator provided you with the S3 URI.
   + **KMS key ID** – A KMS key that you've created. You can specify its ARN to encrypt the output of the Snowflake query. Otherwise, Data Wrangler uses the default encryption.

1. Choose **Connect**.

------
#### [ Providing an Amazon Resource Name (ARN) ]

**To import a dataset into Data Wrangler from Snowflake using an ARN**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Snowflake**.

1. For **Connection name**, specify a name that uniquely identifies the connection.

1. For **Authentication method**, choose **ARN**.

1. **Secrets Manager ARN** – The ARN of the AWS Secrets Manager secret used to store the credentials used to connect to Snowflake.

1. (Optional) For **Advanced settings**. specify the following:
   + **Role** – A role within Snowflake. Some roles have access to different datasets. If you don't specify a role, Data Wrangler uses the default role in your Snowflake account.
   + **Storage integration** – When you specify and run a query, Data Wrangler creates a temporary copy of the query results in memory. To store a permanent copy of the query results, specify the Amazon S3 location for the storage integration. Your administrator provided you with the S3 URI.
   + **KMS key ID** – A KMS key that you've created. You can specify its ARN to encrypt the output of the Snowflake query. Otherwise, Data Wrangler uses the default encryption.

1. Choose **Connect**.

------
#### [ Using an OAuth Connection ]

**Important**  
Your administrator customized your Studio Classic environment to provide the functionality you're using to use an OAuth connection. You might need to restart the Jupyter server application to use the functionality.  
Use the following procedure to update the Jupyter server application.  
Within Studio Classic, choose **File**
Choose **Shut down**.
Choose **Shut down server**.
Close the tab or window that you're using to access Studio Classic.
From the Amazon SageMaker AI console, open Studio Classic.

**To import a dataset into Data Wrangler from Snowflake using your credentials**

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose **Snowflake**.

1. For **Connection name**, specify a name that uniquely identifies the connection.

1. For **Authentication method**, choose **OAuth**.

1. (Optional) For **Advanced settings**. specify the following:
   + **Role** – A role within Snowflake. Some roles have access to different datasets. If you don't specify a role, Data Wrangler uses the default role in your Snowflake account.
   + **Storage integration** – When you specify and run a query, Data Wrangler creates a temporary copy of the query results in memory. To store a permanent copy of the query results, specify the Amazon S3 location for the storage integration. Your administrator provided you with the S3 URI.
   + **KMS key ID** – A KMS key that you've created. You can specify its ARN to encrypt the output of the Snowflake query. Otherwise, Data Wrangler uses the default encryption.

1. Choose **Connect**.

------

You can begin the process of importing your data from Snowflake after you've connected to it.

Within Data Wrangler, you can view your data warehouses, databases, and schemas, along with the eye icon with which you can preview your table. After you select the **Preview Table** icon, the schema preview of that table is generated. You must select a warehouse before you can preview a table.

**Important**  
If you're importing a dataset with columns of type `TIMESTAMP_TZ` or `TIMESTAMP_LTZ`, add `::string` to the column names of your query. For more information, see [How To: Unload TIMESTAMP\$1TZ and TIMESTAMP\$1LTZ data to a Parquet file](https://community.snowflake.com/s/article/How-To-Unload-Timestamp-data-in-a-Parquet-file).

After you select a data warehouse, database and schema, you can now write queries and run them. The output of your query shows under **Query results**.

After you have settled on the output of your query, you can then import the output of your query into a Data Wrangler flow to perform data transformations. 

After you've imported your data, navigate to your Data Wrangler flow and start adding transformations to it. For a list of available transforms, see [Transform Data](data-wrangler-transform.md).

## Import Data From Software as a Service (SaaS) Platforms
Import Data from SaaS Platforms

You can use Data Wrangler to import data from more than forty software as a service (SaaS) platforms. To import your data from your SaaS platform, you or your administrator must use Amazon AppFlow to transfer the data from the platform to Amazon S3 or Amazon Redshift. For more information about Amazon AppFlow, see [What is Amazon AppFlow?](https://docs.aws.amazon.com/appflow/latest/userguide/what-is-appflow.html) If you don't need to use Amazon Redshift, we recommend transferring the data to Amazon S3 for a simpler process.

Data Wrangler supports transferring data from the following SaaS platforms:
+ [Amplitude](https://docs.aws.amazon.com/appflow/latest/userguide/amplitude.html)
+ [Asana](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-asana.html)
+ [Braintree](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-braintree.html)
+ [CircleCI](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-circleci.html)
+ [DocuSign Monitor](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-docusign-monitor.html)
+ [Delighted](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-delighted.html)
+ [Domo](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-domo.html)
+ [Datadog](https://docs.aws.amazon.com/appflow/latest/userguide/datadog.html)
+ [Dynatrace](https://docs.aws.amazon.com/appflow/latest/userguide/dynatrace.html)
+ [Facebook Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-facebook-ads.html)
+ [Facebook Page Insights](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-facebook-page-insights.html)
+ [Google Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-ads.html)
+ [Google Analytics 4](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-analytics-4.html)
+ [Google Calendar](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-calendar.html)
+ [Google Search Console](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-google-search-console.html)
+ [GitHub](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-github.html)
+ [GitLab](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-gitlab.html)
+ [Infor Nexus](https://docs.aws.amazon.com/appflow/latest/userguide/infor-nexus.html)
+ [Instagram Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-instagram-ads.html)
+ [Intercom](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-intercom.html)
+ [JDBC (Sync)](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-jdbc.html)
+ [Jira Cloud](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-jira-cloud.html)
+ [LinkedIn Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-linkedin-ads.html)
+ [Mailchimp](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-mailchimp.html)
+ [Marketo](https://docs.aws.amazon.com/appflow/latest/userguide/marketo.html)
+ [Microsoft Dynamics 365](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-microsoft-dynamics-365.html)
+ [Microsoft Teams](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-microsoft-teams.html)
+ [Mixpanel](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-mixpanel.html)
+ [Okta](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-okta.html)
+ [Oracle HCM](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-oracle-hcm.html)
+ [Paypal Checkout](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-paypal.html)
+ [Pendo](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-pendo.html)
+ [Salesforce](https://docs.aws.amazon.com/appflow/latest/userguide/salesforce.html)
+ [Salesforce Marketing Cloud](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-salesforce-marketing-cloud.html)
+ [Salesforce Pardot](https://docs.aws.amazon.com/appflow/latest/userguide/pardot.html)
+ [SAP OData](https://docs.aws.amazon.com/appflow/latest/userguide/sapodata.html)
+ [SendGrid](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-sendgrid.html)
+ [ServiceNow](https://docs.aws.amazon.com/appflow/latest/userguide/servicenow.html)
+ [Singular](https://docs.aws.amazon.com/appflow/latest/userguide/singular.html)
+ [Slack](https://docs.aws.amazon.com/appflow/latest/userguide/slack.html)
+ [Smartsheet](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-smartsheet.html)
+ [Snapchat Ads](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-snapchat-ads.html)
+ [Stripe](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-stripe.html)
+ [Trend Micro](https://docs.aws.amazon.com/appflow/latest/userguide/trend-micro.html)
+ [Typeform](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-typeform.html)
+ [Veeva](https://docs.aws.amazon.com/appflow/latest/userguide/veeva.html)
+ [WooCommerce](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-woocommerce.html)
+ [Zendesk](https://docs.aws.amazon.com/appflow/latest/userguide/slack.html)
+ [Zendesk Chat](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-chat.html)
+ [Zendesk Sell](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-sell.html)
+ [Zendesk Sunshine](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zendesk-sunshine.html)
+ [Zoho CRM](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zoho-crm.html)
+ [Zoom Meetings](https://docs.aws.amazon.com/appflow/latest/userguide/connectors-zoom-meetings.html)

The preceding list has links to more information about setting up your data source. You or your administrator can refer to the preceding links after you've read the following information.

When you navigate to the **Import** tab of your Data Wrangler flow, you see data sources under the following sections:
+ **Available**
+ **Set up data sources**

You can connect to data sources under **Available** without needing additional configuration. You can choose the data source and import your data.

Data sources under **Set up data sources**, require you or your administrator to use Amazon AppFlow to transfer the data from the SaaS platform to Amazon S3 or Amazon Redshift. For information about performing a transfer, see [Using Amazon AppFlow to transfer your data](#data-wrangler-import-saas-transfer).

After you perform the data transfer, the SaaS platform appears as a data source under **Available**. You can choose it and import the data that you've transferred into Data Wrangler. The data that you've transferred appears as tables that you can query.

### Using Amazon AppFlow to transfer your data


Amazon AppFlow is a platform that you can use to transfer data from your SaaS platform to Amazon S3 or Amazon Redshift without having to write any code. To perform a data transfer, you use the AWS Management Console.

**Important**  
You must make sure you've set up the permissions to perform a data transfer. For more information, see [Amazon AppFlow Permissions](data-wrangler-security.md#data-wrangler-appflow-permissions).

After you've added permissions, you can transfer the data. Within Amazon AppFlow, you create a *flow* to transfer the data. A flow is a series of configurations. You can use it to specify whether you're running the data transfer on a schedule or whether you're partitioning the data into separate files. After you've configured the flow, you run it to transfer the data.

For information about creating a flow, see [Creating flows in Amazon AppFlow](https://docs.aws.amazon.com/appflow/latest/userguide/create-flow.html). For information about running a flow, see [Activate an Amazon AppFlow flow](https://docs.aws.amazon.com/appflow/latest/userguide/run-flow.html).

After the data has been transferred, use the following procedure to access the data in Data Wrangler.
**Important**  
Before you try to access your data, make sure your IAM role has the following policy:  

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "glue:SearchTables",
            "Resource": [
                "arn:aws:glue:*:*:table/*/*",
                "arn:aws:glue:*:*:database/*",
                "arn:aws:glue:*:*:catalog"
            ]
        }
    ]
}
```
By default, the IAM role that you use to access Data Wrangler is the `SageMakerExecutionRole`. For more information about adding policies, see [Adding IAM identity permissions (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#add-policies-console).

To connect to a data source, do the following.

1. Sign into [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose **Studio**.

1. Choose **Launch app**.

1. From the dropdown list, select **Studio**.

1. Choose the Home icon.

1. Choose **Data**.

1. Choose **Data Wrangler**.

1. Choose **Import data**.

1. Under **Available**, choose the data source.

1. For the **Name** field, specify the name of the connection.

1. (Optional) Choose **Advanced configuration**.

   1. Choose a **Workgroup**.

   1. If your workgroup hasn't enforced the Amazon S3 output location or if you don't use a workgroup, specify a value for **Amazon S3 location of query results**.

   1. (Optional) For **Data retention period**, select the checkbox to set a data retention period and specify the number of days to store the data before it's deleted.

   1. (Optional) By default, Data Wrangler saves the connection. You can choose to deselect the checkbox and not save the connection.

1. Choose **Connect**.

1. Specify a query.
**Note**  
To help you specify a query, you can choose a table on the left-hand navigation panel. Data Wrangler shows the table name and a preview of the table. Choose the icon next to the table name to copy the name. You can use the table name in the query.

1. Choose **Run**.

1. Choose **Import query**.

1. For **Dataset name**, specify the name of the dataset.

1. Choose **Add**.

When you navigate to the **Import data** screen, you can see the connection that you've created. You can use the connection to import more data.

## Imported Data Storage


**Important**  
 We strongly recommend that you follow the best practices around protecting your Amazon S3 bucket by following [ Security best practices](https://docs.aws.amazon.com/AmazonS3/latest/userguide/security-best-practices.html). 

When you query data from Amazon Athena or Amazon Redshift, the queried dataset is automatically stored in Amazon S3. Data is stored in the default SageMaker AI S3 bucket for the AWS Region in which you are using Studio Classic.

Default S3 buckets have the following naming convention: `sagemaker-region-account number`. For example, if your account number is 111122223333 and you are using Studio Classic in `us-east-1`, your imported datasets are stored in `sagemaker-us-east-1-`111122223333. 

 Data Wrangler flows depend on this Amazon S3 dataset location, so you should not modify this dataset in Amazon S3 while you are using a dependent flow. If you do modify this S3 location, and you want to continue using your data flow, you must remove all objects in `trained_parameters` in your .flow file. To do this, download the .flow file from Studio Classic and for each instance of `trained_parameters`, delete all entries. When you are done, `trained_parameters` should be an empty JSON object:

```
"trained_parameters": {}
```

When you export and use your data flow to process your data, the .flow file you export refers to this dataset in Amazon S3. Use the following sections to learn more. 

### Amazon Redshift Import Storage


Data Wrangler stores the datasets that result from your query in a Parquet file in your default SageMaker AI S3 bucket. 

This file is stored under the following prefix (directory): redshift/*uuid*/data/, where *uuid* is a unique identifier that gets created for each query. 

For example, if your default bucket is `sagemaker-us-east-1-111122223333`, a single dataset queried from Amazon Redshift is located in s3://sagemaker-us-east-1-111122223333/redshift/*uuid*/data/.

### Amazon Athena Import Storage


When you query an Athena database and import a dataset, Data Wrangler stores the dataset, as well as a subset of that dataset, or *preview files*, in Amazon S3. 

The dataset you import by selecting **Import dataset** is stored in Parquet format in Amazon S3. 

Preview files are written in CSV format when you select **Run** on the Athena import screen, and contain up to 100 rows from your queried dataset. 

The dataset you query is located under the prefix (directory): athena/*uuid*/data/, where *uuid* is a unique identifier that gets created for each query.

For example, if your default bucket is `sagemaker-us-east-1-111122223333`, a single dataset queried from Athena is located in `s3://sagemaker-us-east-1-111122223333`/athena/*uuid*/data/*example\$1dataset.parquet*.

The subset of the dataset that is stored to preview dataframes in Data Wrangler is stored under the prefix: athena/.

# Create and Use a Data Wrangler Flow


Use an Amazon SageMaker Data Wrangler flow, or a *data flow*, to create and modify a data preparation pipeline. The data flow connects the datasets, transformations, and analyses, or *steps*, you create and can be used to define your pipeline. 

## Instances


When you create a Data Wrangler flow in Amazon SageMaker Studio Classic, Data Wrangler uses an Amazon EC2 instance to run the analyses and transformations in your flow. By default, Data Wrangler uses the m5.4xlarge instance. m5 instances are general purpose instances that provide a balance between compute and memory. You can use m5 instances for a variety of compute workloads.

Data Wrangler also gives you the option of using r5 instances. r5 instances are designed to deliver fast performance that processes large datasets in memory.

We recommend that you choose an instance that is best optimized around your workloads. For example, the r5.8xlarge might have a higher price than the m5.4xlarge, but the r5.8xlarge might be better optimized for your workloads. With better optimized instances, you can run your data flows in less time at lower cost.

The following table shows the instances that you can use to run your Data Wrangler flow.


| Standard Instances | vCPU | Memory | 
| --- | --- | --- | 
| ml.m5.4xlarge | 16 | 64 GiB | 
| ml.m5.8xlarge | 32 | 128 GiB | 
| ml.m5.16xlarge | 64 |  256 GiB  | 
| ml.m5.24xlarge | 96 | 384 GiB | 
| r5.4xlarge | 16 | 128 GiB | 
| r5.8xlarge | 32 | 256 GiB | 
| r5.24xlarge | 96 | 768 GiB | 

For more information about r5 instances, see [Amazon EC2 R5 Instances](https://aws.amazon.com/ec2/instance-types/r5/). For more information about m5 instances, see [Amazon EC2 M5 Instances](https://aws.amazon.com/ec2/instance-types/m5/).

Each Data Wrangler flow has an Amazon EC2 instance associated with it. You might have multiple flows that are associated with a single instance.

For each flow file, you can seamlessly switch the instance type. If you switch the instance type, the instance that you used to run the flow continues to run.

To switch the instance type of your flow, do the following.

1. Choose the **Running Terminals and Kernels** icon (![\[Black square icon representing a placeholder or empty image.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/icons/running-terminals-kernels.png)).

1. Navigate to the instance that you're using and choose it.

1. Choose the instance type that you want to use.  
![\[Example showing how to choose an instance in the data flow page of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-instance-switching-list-instances.png)

1. Choose **Save**.

You are charged for all running instances. To avoid incurring additional charges, shut down the instances that you aren't using manually. To shut down an instance that is running, use the following procedure. 

To shut down a running instance.

1. Choose the instance icon. The following image shows you where to select the **RUNNING INSTANCES** icon.  
![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/instance-switching-running-instances.png)

1. Choose **Shut down** next to the instance that you want to shut down.

If you shut down an instance used to run a flow, you temporarily can't access the flow. If you get an error while attempting to open the flow running an instance you previously shut down, wait for 5 minutes and try opening it again.

When you export your data flow to a location such as Amazon Simple Storage Service or Amazon SageMaker Feature Store, Data Wrangler runs an Amazon SageMaker processing job. You can use one of the following instances for the processing job. For more information on exporting your data, see [Export](data-wrangler-data-export.md).


| Standard Instances | vCPU | Memory | 
| --- | --- | --- | 
| ml.m5.4xlarge | 16 | 64 GiB | 
| ml.m5.12xlarge | 48 |  192 GiB  | 
| ml.m5.24xlarge | 96 | 384 GiB | 

For more information about the cost per hour for using the available instance types, see [SageMaker Pricing](https://aws.amazon.com//sagemaker/pricing/). 

## The Data Flow UI


When you import a dataset, the original dataset appears on the data flow and is named **Source**. If you turned on sampling when you imported your data, this dataset is named **Source - sampled**. Data Wrangler automatically infers the types of each column in your dataset and creates a new dataframe named **Data types**. You can select this frame to update the inferred data types. You see results similar to those shown in the following image after you upload a single dataset: 

![\[Example showing Source - sampled and Data types in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/dataflow-after-import.png)


Each time you add a transform step, you create a new dataframe. When multiple transform steps (other than **Join** or **Concatenate**) are added to the same dataset, they are stacked. 

**Join** and **Concatenate** create standalone steps that contain the new joined or concatenated dataset. 

The following diagram shows a data flow with a join between two datasets, as well as two stacks of steps. The first stack (**Steps (2)**) adds two transforms to the type inferred in the **Data types** dataset. The *downstream* stack, or the stack to the right, adds transforms to the dataset resulting from a join named **demo-join**. 

![\[Example showing steps in the data flow page of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-flow-steps.png)


The small, gray box in the bottom right corner of the data flow provides an overview of number of stacks and steps in the flow and the layout of the flow. The lighter box inside the gray box indicates the steps that are within the UI view. You can use this box to see sections of your data flow that fall outside of the UI view. Use the fit screen icon (![\[Dotted square outline icon representing a placeholder or empty state.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/fit-screen.png)) to fit all steps and datasets into your UI view. 

The bottom left navigation bar includes icons that you can use to zoom in (![\[Plus symbol icon representing an addition or new item action.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/zoom-in.png)) and zoom out (![\[Horizontal line or divider, typically used to separate content sections.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/zoom-out.png)) of your data flow and resize the data flow to fit the screen (![\[Dotted square outline icon representing a placeholder or empty state.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/fit-screen.png)). Use the lock icon (![\[Trash can icon representing deletion or removal functionality.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/lock-nodes.png)) to lock and unlock the location of each step on the screen. 


## Add a Step to Your Data Flow


Select **\$1** next to any dataset or previously added step and then select one of the following options:
+ **Edit data types** (For a **Data types** step only): If you have not added any transforms to a **Data types** step, you can select **Edit data types** to update the data types Data Wrangler inferred when importing your dataset. 
+ **Add transform**: Adds a new transform step. See [Transform Data](data-wrangler-transform.md) to learn more about the data transformations you can add. 
+ **Add analysis**: Adds an analysis. You can use this option to analyze your data at any point in the data flow. When you add one or more analyses to a step, an analysis icon (![\[Bar chart icon representing data visualization or analytics functionality.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/updates/analysis-icon.png)) appears on that step. See [Analyze and Visualize](data-wrangler-analyses.md) to learn more about the analyses you can add. 
+ **Join**: Joins two datasets and adds the resulting dataset to the data flow. To learn more, see [Join Datasets](data-wrangler-transform.md#data-wrangler-transform-join).
+ **Concatenate**: Concatenates two datasets and adds the resulting dataset to the data flow. To learn more, see [Concatenate Datasets](data-wrangler-transform.md#data-wrangler-transform-concatenate).

## Delete a Step from Your Data Flow


To delete a step, select the step and select **Delete**. If the node is a node that has a single input, you delete only the step that you select. Deleting a step that has a single input doesn't delete the steps that follow it. If you're deleting a step for a source, join, or concatenate node, all the steps that follow it are also deleted.

To delete a step from a stack of steps, select the stack and then select the step you want to delete. 

You can use one of the following procedures to delete a step without deleting the downstream steps.

------
#### [ Delete a step in the Data Wrangler flow ]

You can delete an individual step for nodes in your data flow that have a single input. You can't delete individual steps for source, join, and concatenate nodes.

Use the following procedure to delete a step in the Data Wrangler flow.

1. Choose the group of steps that has the step that you're deleting.

1. Choose the icon next to the step.

1. Choose **Delete step**.  
![\[Example showing how to delete a step in the data flow page of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/delete-step-flow-1.png)

------
#### [ Delete a step in the table view ]

Use the following procedure to delete a step in the table view.

You can delete an individual step for nodes in your data flow that have a single input. You can't delete individual steps for source, join, and concatenate nodes.

1. Choose the step and open the table view for the step.

1. Move your cursor over the step so the ellipsis icon appears.

1. Choose the icon next to the step.

1. Choose **Delete**.  
![\[Example showing how to delete a step in the table view of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/delete-step-table-0.png)

------

## Edit a Step in Your Data Wrangler Flow


You can edit each step that you've added in your Data Wrangler flow. By editing steps, you can change the transformations or the data types of the columns. You can edit the steps to make changes with which you can perform better analyses.

There are many ways that you can edit a step. Some examples include changing the imputation method or changing the threshold for considering a value to be an outlier.

Use the following procedure to edit a step.

To edit a step, do the following.

1. Choose a step in the Data Wrangler flow to open the table view.  
![\[Example step in the data flow page of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-flow-edit-choose-step.png)

1. Choose a step in the data flow.

1. Edit the step.

The following image shows an example of editing a step.

![\[Example showing how to edit steps in the data flow page of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-flow-table-edit-step.png)


**Note**  
You can use the shared spaces within your Amazon SageMaker AI domain to work collaboratively on your Data Wrangler flows. Within a shared space, you and your collaborators can edit a flow file in real-time. However, neither you nor your collaborators can see the changes in real-time. When anyone makes a change to the Data Wrangler flow, they must save it immediately. When someone saves a file, a collaborator won’t be able to see it unless the close the file and reopen it. Any changes that aren’t saved by one person are overwritten by the person who saved their changes.

# Get Insights On Data and Data Quality


Use the **Data Quality and Insights Report** to perform an analysis of the data that you've imported into Data Wrangler. We recommend that you create the report after you import your dataset. You can use the report to help you clean and process your data. It gives you information such as the number of missing values and the number of outliers. If you have issues with your data, such as target leakage or imbalance, the insights report can bring those issues to your attention.

Use the following procedure to create a Data Quality and Insights report. It assumes that you've already imported a dataset into your Data Wrangler flow.

**To create a Data Quality and Insights report**

1. Choose a **\$1** next to a node in your Data Wrangler flow.

1. Select **Get data insights**.

1. For **Analysis name**, specify a name for the insights report.

1. (Optional) For **Target column**, specify the target column.

1. For **Problem type**, specify **Regression** or **Classification**.

1. For **Data size**, specify one of the following:
   + **50 K** – Uses the first 50000 rows of the dataset that you've imported to create the report.
   + **Entire dataset** – Uses the entire dataset that you've imported to create the report.
**Note**  
Creating a Data Quality and Insights report on the entire dataset uses an Amazon SageMaker processing job. A SageMaker Processing job provisions the additional compute resources required to get insights for all of your data. For more information about SageMaker Processing jobs, see [Data transformation workloads with SageMaker Processing](processing-job.md).

1. Choose **Create**.

The following topics show the sections of the report:

**Topics**
+ [

## Summary
](#data-wrangler-data-insights-summary)
+ [

## Target column
](#data-wrangler-data-insights-target-column)
+ [

## Quick model
](#data-wrangler-data-insights-quick-model)
+ [

## Feature summary
](#data-wrangler-data-insights-feature-summary)
+ [

## Samples
](#data-wrangler-data-insights-samples)
+ [

## Definitions
](#data-wrangler-data-insights-definitions)

You can either download the report or view it online. To download the report, choose the download button at the top right corner of the screen. The following image shows the button.

![\[Example showing the download button.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-download.png)


## Summary


The insights report has a brief summary of the data that includes general information such as missing values, invalid values, feature types, outlier counts, and more. It can also include high severity warnings that point to probable issues with the data. We recommend that you investigate the warnings.

The following is an example of a report summary.

![\[Example report summary.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-report-summary.png)


## Target column


When you create the data quality and insights report, Data Wrangler gives you the option to select a target column. A target column is a column that you're trying to predict. When you choose a target column, Data Wrangler automatically creates a target column analysis. It also ranks the features in the order of their predictive power. When you select a target column, you must specify whether you’re trying to solve a regression or a classification problem.

For classification, Data Wrangler shows a table and a histogram of the most common classes. A class is a category. It also presents observations, or rows, with a missing or invalid target value.

The following image shows an example target column analysis for a classification problem.

![\[Example target column analysis.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-target-column-classification.png)


For regression, Data Wrangler shows a histogram of all the values in the target column. It also presents observations, or rows, with a missing, invalid, or outlier target value.

The following image shows an example target column analysis for a regression problem.

![\[Example target column analysis.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-target-column-regression.png)


## Quick model


The **Quick model** provides an estimate of the expected predicted quality of a model that you train on your data.

Data Wrangler splits your data into training and validation folds. It uses 80% of the samples for training and 20% of the values for validation. For classification, the sample is stratified split. For a stratified split, each data partition has the same ratio of labels. For classification problems, it's important to have the same ratio of labels between the training and classification folds. Data Wrangler trains the XGBoost model with the default hyperparameters. It applies early stopping on the validation data and performs minimal feature preprocessing.

For classification models, Data Wrangler returns both a model summary and a confusion matrix.

The following is an example of a classification model summary. To learn more about the information that it returns, see [Definitions](#data-wrangler-data-insights-definitions).

![\[Example classification model summary.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-quick-model-classification-summary.png)


The following is an example of a confusion matrix that the quick model returns.

![\[Example confusion matrix.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-quick-model-classification-confusion-matrix.png)


A confusion matrix gives you the following information:
+ The number of times the predicted label matches the true label.
+ The number of times the predicted label doesn't match the true label.

The true label represents an actual observation in your data. For example, if you're using a model to detect fraudulent transactions, the true label represents a transaction that is actually fraudulent or non-fraudulent. The predicted label represents the label that your model assigns to the data.

You can use the confusion matrix to see how well the model predicts the presence or the absence of a condition. If you're predicting fraudulent transactions, you can use the confusion matrix to get a sense of both the sensitivity and the specificity of the model. The sensitivity refers to the model's ability to detect fraudulent transactions. The specificity refers to the model's ability to avoid detecting non-fraudulent transactions as fraudulent.

The following is an example of the quick model outputs for a regression problem.

![\[Example of the quick model outputs for a regression problem.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-quick-model-regression-summary.png)


## Feature summary


When you specify a target column, Data Wrangler orders the features by their prediction power. Prediction power is measured on the data after it was split into 80% training and 20% validation folds. Data Wrangler fits a model for each feature separately on the training fold. It applies minimal feature preprocessing and measures prediction performance on the validation data.

It normalizes the scores to the range [0,1]. Higher prediction scores indicate columns that are more useful for predicting the target on their own. Lower scores point to columns that aren’t predictive of the target column.

It’s uncommon for a column that isn’t predictive on its own to be predictive when it’s used in tandem with other columns. You can confidently use the prediction scores to determine whether a feature in your dataset is predictive.

A low score usually indicates the feature is redundant. A score of 1 implies perfect predictive abilities, which often indicates target leakage. Target leakage usually happens when the dataset contains a column that isn’t available at the prediction time. For example, it could be a duplicate of the target column.

The following are examples of the table and the histogram that show the prediction value of each feature.

![\[Example summary table showing the prediction value of each feature.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-feature-summary-table.png)


![\[Example histogram showing the prediction value of each feature.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-insights/data-insights-feature-summary-histogram.png)


## Samples


Data Wrangler provides information about whether your samples are anomalous or if there are duplicates in your dataset.

Data Wrangler detects anomalous samples using the *isolation forest algorithm*. The isolation forest associates an anomaly score with each sample (row) of the dataset. Low anomaly scores indicate anomalous samples. High scores are associated with non-anomalous samples. Samples with a negative anomaly score are usually considered anomalous and samples with positive anomaly score are considered non-anomalous.

When you look at a sample that might be anomalous, we recommend that you pay attention to unusual values. For example, you might have anomalous values that result from errors in gathering and processing the data. We recommend using domain knowledge and business logic when you examine anomalous samples.

Data Wrangler detects duplicate rows and calculates the ratio of duplicate rows in your data. Some data sources could include valid duplicates. Other data sources could have duplicates that point to problems in data collection. Duplicate samples that result from faulty data collection could interfere with machine learning processes that rely on splitting the data into independent training and validation folds.

The following are elements of the insights report that can be impacted by duplicated samples:
+ Quick model
+ Prediction power estimation
+ Automatic hyperparameter tuning

You can remove duplicate samples from the dataset using the **Drop duplicates** transform under **Manage rows**. Data Wrangler shows you the most frequently duplicated rows.

## Definitions


The following are definitions for the technical terms that are used in the data insights report.

------
#### [ Feature types ]

The following are the definitions for each of the feature types:
+ **Numeric** – Numeric values can be either floats or integers, such as age or income. The machine learning models assume that numeric values are ordered and a distance is defined over them. For example, 3 is closer to 4 than to 10 and 3 < 4 < 10.
+ Categorical – The column entries belong to a set of unique values, which is usually much smaller than the number of entries in the column. For example, a column of length 100 could contain the unique values `Dog`, `Cat`, and `Mouse`. The values could be numeric, text, or a combination of both. `Horse`, `House`, `8`, `Love`, and `3.1` would all be valid values and could be found in the same categorical column. The machine learning model does not assume order or distance on the values of categorical features, as opposed to numeric features, even when all the values are numbers.
+ **Binary** – Binary features are a special categorical feature type in which the cardinality of the set of unique values is 2.
+ **Text** – A text column contains many non-numeric unique values. In extreme cases, all the elements of the column are unique. In an extreme case, no two entries are the same.
+ **Datetime** – A datetime column contains information about the date or time. It can have information about both the date and time.

------
#### [ Feature statistics ]

The following are definitions for each of the feature statistics:
+ **Prediction power** – Prediction power measures how useful the column is in predicting the target.
+ **Outliers** (in numeric columns) – Data Wrangler detects outliers using two statistics that are robust to outliers: median and robust standard deviation (RSTD). RSTD is derived by clipping the feature values to the range [5 percentile, 95 percentile] and calculating the standard deviation of the clipped vector. All values larger than median \$1 5 \$1 RSTD or smaller than median - 5 \$1 RSTD are considered to be outliers.
+ **Skew** (in numeric columns) – Skew measures the symmetry of the distribution and is defined as the third moment of the distribution divided by the third power of the standard deviation. The skewness of the normal distribution or any other symmetric distribution is zero. Positive values imply that the right tail of the distribution is longer than the left tail. Negative values imply that the left tail of the distribution is longer than the right tail. As a rule of thumb, a distribution is considered skewed when the absolute value of the skew is larger than 3.
+ **Kurtosis** (in numeric columns) – Pearson's kurtosis measures the heaviness of the tail of the distribution. It's defined as the fourth moment of the distribution divided by the square of the second moment. The kurtosis of the normal distribution is 3. Kurtosis values lower than 3 imply that the distribution is concentrated around the mean and the tails are lighter than the tails of the normal distribution. Kurtosis values higher than 3 imply heavier tails or outliers.
+ **Missing values** – Null-like objects, empty strings and strings composed of only white spaces are considered missing.
+ **Valid values for numeric features or regression target** – All values that you can cast to finite floats are valid. Missing values are not valid.
+ **Valid values for categorical, binary, or text features, or for classification target** – All values that are not missing are valid.
+ **Datetime features** – All values that you can cast to a datetime object are valid. Missing values are not valid.
+ **Invalid values** – Values that are either missing or you can't properly cast. For example, in a numeric column, you can't cast the string `"six"` or a null value.

------
#### [ Quick model metrics for regression ]

The following are the definitions for the quick model metrics:
+ R2 or coefficient of determination) – R2 is the proportion of the variation in the target that is predicted by the model. R2 is in the range of [-infty, 1]. 1 is the score of the model that predicts the target perfectly and 0 is the score of the trivial model that always predicts the target mean.
+ MSE or mean squared error – MSE is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.
+ MAE or mean absolute error – MAE is in the range [0, infty] where 0 is the score of the model that predicts the target perfectly.
+ RMSE or root mean square error – RMSE is in the range [0, infty] where 0 is the score of the model that predicts the target perfectly.
+ Max error – The maximum absolute value of the error over the dataset. Max error is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.
+ Median absolute error – Median absolute error is in the range [0, infty]. 0 is the score of the model that predicts the target perfectly.

------
#### [ Quick model metrics for classification ]

The following are the definitions for the quick model metrics:
+ **Accuracy** – Accuracy is the ratio of samples that are predicted accurately. Accuracy is in the range [0, 1]. 0 is the score of the model that predicts all samples incorrectly and 1 is the score of the perfect model.
+ **Balanced accuracy** – Balanced accuracy is the ratio of samples that are predicted accurately when the class weights are adjusted to balance the data. All classes are given the same importance, regardless of their frequency. Balanced accuracy is in the range [0, 1]. 0 is the score of the model that predicts all samples wrong. 1 is the score of the perfect model.
+ **AUC (binary classification)** – This is the area under the receiver operating characteristic curve. AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model returns a score of 1.
+ **AUC (OVR)** – For multiclass classification, this is the area under the receiver operating characteristic curve calculated separately for each label using one versus rest. Data Wrangler reports the average of the areas. AUC is in the range [0, 1] where a random model returns a score of 0.5 and the perfect model returns a score of 1.
+ **Precision** – Precision is defined for a specific class. Precision is the fraction of true positives out of all the instances that the model classified as that class. Precision is in the range [0, 1]. 1 is the score of the model that has no false positives for the class. For binary classification, Data Wrangler reports the precision of the positive class.
+ **Recall** – Recall is defined for a specific class. Recall is the fraction of the relevant class instances that are successfully retrieved. Recall is in the range [0, 1]. 1 is the score of the model that classifies all the instances of the class correctly. For binary classification, Data Wrangler reports the recall of the positive class.
+ **F1** – F1 is defined for a specific class. It's the harmonic mean of the precision and recall. F1 is in the range [0, 1]. 1 is the score of the perfect model. For binary classification, Data Wrangler reports the F1 for classes with positive values.

------
#### [ Textual patterns ]

**Patterns** describe the textual format of a string using an easy to read format. The following are examples of textual patterns:
+ "\$1digits:4-7\$1" describes a sequence of digits that have a length between 4 and 7.
+ "\$1alnum:5\$1" describes an alpha-numeric string with a length of exactly 5.

Data Wrangler infers the patterns by looking at samples of non-empty strings from your data. It can describe many of the commonly used patterns. The **confidence** expressed as a percentage indicates how much of the data is estimated to match the pattern. Using the textual pattern, you can see which rows in your data you need to correct or drop.

The following describes the patterns that Data Wrangler can recognize:


| Pattern | Textual Format | 
| --- | --- | 
|  \$1alnum\$1  |  Alphanumeric strings  | 
|  \$1any\$1  |  Any string of word characters  | 
|  \$1digits\$1  |  A sequence of digits  | 
|  \$1lower\$1  |  A lowercase word  | 
|  \$1mixed\$1  |  A mixed-case word  | 
|  \$1name\$1  |  A word beginning with a capital letter  | 
|  \$1upper\$1  |  An uppercase word  | 
|  \$1whitespace\$1  |  whitespace characters  | 

A word character is either an underscore or a character that might appear in a word in any language. For example, the strings 'Hello\$1word' and 'écoute' both consist of word characters. 'H' and 'é' are both examples of word characters.

------

# Automatically Train Models on Your Data Flow


You can use Amazon SageMaker Autopilot to automatically train, tune, and deploy models on the data that you've transformed in your data flow. Amazon SageMaker Autopilot can go through several algorithms and use the one that works best with your data. For more information about Amazon SageMaker Autopilot, see [SageMaker Autopilot](autopilot-automate-model-development.md).

When you train and tune a model, Data Wrangler exports your data to an Amazon S3 location where Amazon SageMaker Autopilot can access it.

You can prepare and deploy a model by choosing a node in your Data Wrangler flow and choosing **Export and Train** in the data preview. You can use this method to view your dataset before you choose to train a model on it.

You can also train and deploy a model directly from your data flow.

The following procedure prepares and deploys a model from the data flow. For Data Wrangler flows with multi-row transforms, you can't use the transforms from the Data Wrangler flow when you're deploying the model. You can use the following procedure to process the data before you use it to perform inference.

To train and deploy a model directly from your data flow, do the following.

1. Choose the **\$1** next to the node containing the training data.

1. Choose **Train model**.

1. (Optional) Specify a AWS KMS key or ID. For more information about creating and controlling cryptographic keys to protect your data, see [AWS Key Management Service](https://docs.aws.amazon.com/kms/latest/developerguide/overview.html).

1. Choose **Export and train**.

1. After Amazon SageMaker Autopilot trains the model on the data that Data Wrangler exported, specify a name for **Experiment name**.

1. Under **Input data**, choose **Preview** to verify that Data Wrangler properly exported your data to Amazon SageMaker Autopilot.

1. For **Target**, choose the target column.

1. (Optional) For **S3 location** under **Output data**, specify an Amazon S3 location other than the default location.

1. Choose **Next: Training method**.

1. Choose a training method. For more information, see [Training modes](autopilot-model-support-validation.md#autopilot-training-mode).

1. (Optional) For **Auto deploy endpoint**, specify a name for the endpoint.

1. For **Deployment option**, choose a deployment method. You can choose to deploy with or without the transformations that you've made to your data.
**Important**  
You can't deploy an Amazon SageMaker Autopilot model with the transformations that you've made in your Data Wrangler flow. For more information about those transformations, see [Export to an Inference Endpoint](data-wrangler-data-export.md#data-wrangler-data-export-inference).

1. Choose **Next: Review and create**.

1. Choose **Create experiment**.

For more information about model training and deployment, see [Create Regression or Classification Jobs for Tabular Data Using the AutoML API](autopilot-automate-model-development-create-experiment.md). Autopilot shows you analyses about the best model's performance. For more information about model performance, see [View an Autopilot model performance report](autopilot-model-insights.md).

# Transform Data


Amazon SageMaker Data Wrangler provides numerous ML data transforms to streamline cleaning, transforming, and featurizing your data. When you add a transform, it adds a step to the data flow. Each transform you add modifies your dataset and produces a new dataframe. All subsequent transforms apply to the resulting dataframe.

Data Wrangler includes built-in transforms, which you can use to transform columns without any code. You can also add custom transformations using PySpark, Python (User-Defined Function), pandas, and PySpark SQL. Some transforms operate in place, while others create a new output column in your dataset.

You can apply transforms to multiple columns at once. For example, you can delete multiple columns in a single step.

You can apply the **Process numeric** and **Handle missing** transforms only to a single column.

Use this page to learn more about these built-in and custom transforms.

## Transform UI


Most of the built-in transforms are located in the **Prepare** tab of the Data Wrangler UI. You can access the join and concatenate transforms through the data flow view. Use the following table to preview these two views. 

------
#### [ Transform ]

You can add a transform to any step in your data flow. Use the following procedure to add a transform to your data flow.

To add a step to your data flow, do the following.

1. Choose the **\$1** next to the step in the data flow.

1. Choose **Add transform**.

1. Choose **Add step**.  
![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-add-step.png)

1. Choose a transform. 

1. (Optional) You can search for the transform that you want to use. Data Wrangler highlights the query in the results.  
![\[\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-search.png)

------
#### [ Join View ]

To join two datasets, select the first dataset in your data flow and choose **Join**. When you choose **Join**, you see results similar to those shown in the following image. Your left and right datasets are displayed in the left panel. The main panel displays your data flow, with the newly joined dataset added. 

![\[The joined dataset flow in the data flow section of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/join-1.png)


When you choose **Configure** to configure your join, you see results similar to those shown in the following image. Your join configuration is displayed in the left panel. You can use this panel to choose the joined dataset name, join type, and columns to join. The main panel displays three tables. The top two tables display the left and right datasets on the left and right respectively. Under this table, you can preview the joined dataset. 

![\[The joined dataset tables in the data flow section of the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/join-2.png)


See [Join Datasets](#data-wrangler-transform-join) to learn more. 

------
#### [ Concatenate View ]

To concatenate two datasets, you select the first dataset in your data flow and choose **Concatenate**. When you select **Concatenate**, you see results similar to those shown in the following image. Your left and right datasets are displayed in the left panel. The main panel displays your data flow, with the newly concatenated dataset added. 

![\[Example concatenated dataset flow in the data flow section in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/concat-1.png)


When you choose **Configure** to configure your concatenation, you see results similar to those shown in the following image. Your concatenate configuration displays in the left panel. You can use this panel to choose the concatenated dataset's name, and choose to remove duplicates after concatenation and add columns to indicate the source dataframe. The main panel displays three tables. The top two tables display the left and right datasets on the left and right respectively. Under this table, you can preview the concatenated dataset. 

![\[Example concatenated dataset tables in the data flow section in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/concat-2.png)


See [Concatenate Datasets](#data-wrangler-transform-concatenate) to learn more.

------

## Join Datasets


You join dataframes directly in your data flow. When you join two datasets, the resulting joined dataset appears in your flow. The following join types are supported by Data Wrangler.
+ **Left Outer** – Include all rows from the left table. If the value for the column joined on a left table row does not match any right table row values, that row contains null values for all right table columns in the joined table.
+ **Left Anti** – Include rows from the left table that do not contain values in the right table for the joined column.
+ **Left semi** – Include a single row from the left table for all identical rows that satisfy the criteria in the join statement. This excludes duplicate rows from the left table that match the criteria of the join.
+ **Right Outer** – Include all rows from the right table. If the value for the joined column in a right table row does not match any left table row values, that row contains null values for all left table columns in the joined table.
+ **Inner** – Include rows from left and right tables that contain matching values in the joined column. 
+ **Full Outer** – Include all rows from the left and right tables. If the row value for the joined column in either table does not match, separate rows are created in the joined table. If a row doesn’t contain a value for a column in the joined table, null is inserted for that column.
+ **Cartesian Cross** – Include rows which combine each row from the first table with each row from the second table. This is a [Cartesian product](https://en.wikipedia.org/wiki/Cartesian_product) of rows from tables in the join. The result of this product is the size of the left table times the size of the right table. Therefore, we recommend caution in using this join between very large datasets. 

Use the following procedure to join two dataframes.

1. Select **\$1** next to the left dataframe that you want to join. The first dataframe you select is always the left table in your join. 

1. Choose **Join**.

1. Select the right dataframe. The second dataframe you select is always the right table in your join.

1. Choose **Configure** to configure your join. 

1. Give your joined dataset a name using the **Name** field.

1. Select a **Join type**.

1. Select a column from the left and right tables to join. 

1. Choose **Apply** to preview the joined dataset on the right. 

1. To add the joined table to your data flow, choose **Add**. 

## Concatenate Datasets


**Concatenate two datasets:**

1. Choose **\$1** next to the left dataframe that you want to concatenate. The first dataframe you select is always the left table in your concatenate. 

1. Choose **Concatenate**.

1. Select the right dataframe. The second dataframe you select is always the right table in your concatenate.

1. Choose **Configure** to configure your concatenate. 

1. Give your concatenated dataset a name using the **Name** field.

1. (Optional) Select the checkbox next to **Remove duplicates after concatenation** to remove duplicate columns. 

1. (Optional) Select the checkbox next to **Add column to indicate source dataframe** if, for each column in the new dataset, you want to add an indicator of the column's source. 

1. Choose **Apply** to preview the new dataset. 

1. Choose **Add** to add the new dataset to your data flow. 

## Balance Data


You can balance the data for datasets with an underrepresented category. Balancing a dataset can help you create better models for binary classification.

**Note**  
You can't balance datasets containing column vectors.

You can use the **Balance data** operation to balance your data using one of the following operators:
+ *Random oversampling* – Randomly duplicates samples in the minority category. For example, if you're trying to detect fraud, you might only have cases of fraud in 10% of your data. For an equal proportion of fraudulent and non-fraudulent cases, this operator randomly duplicates fraud cases within the dataset 8 times.
+ *Random undersampling* – Roughly equivalent to random oversampling. Randomly removes samples from the overrepresented category to get the proportion of samples that you desire.
+ *Synthetic Minority Oversampling Technique (SMOTE)* – Uses samples from the underrepresented category to interpolate new synthetic minority samples. For more information about SMOTE, see the following description.

You can use all transforms for datasets containing both numeric and non-numeric features. SMOTE interpolates values by using neighboring samples. Data Wrangler uses the R-squared distance to determine the neighborhood to interpolate the additional samples. Data Wrangler only uses numeric features to calculate the distances between samples in the underrepresented group.

For two real samples in the underrepresented group, Data Wrangler interpolates the numeric features by using a weighted average. It randomly assigns weights to those samples in the range of [0, 1]. For numeric features, Data Wrangler interpolates samples using a weighted average of the samples. For samples A and B, Data Wrangler could randomly assign a weight of 0.7 to A and 0.3 to B. The interpolated sample has a value of 0.7A \$1 0.3B.

Data Wrangler interpolates non-numeric features by copying from either of the interpolated real samples. It copies the samples with a probability that it randomly assigns to each sample. For samples A and B, it can assign probabilities 0.8 to A and 0.2 to B. For the probabilities it assigned, it copies A 80% of the time.

## Custom Transforms


The **Custom Transforms** group allows you to use Python (User-Defined Function), PySpark, pandas, or PySpark (SQL) to define custom transformations. For all three options, you use the variable `df` to access the dataframe to which you want to apply the transform. To apply your custom code to your dataframe, assign the dataframe with the transformations that you've made to the `df` variable. If you're not using Python (User-Defined Function), you don't need to include a return statement. Choose **Preview** to preview the result of the custom transform. Choose **Add** to add the custom transform to your list of **Previous steps**.

You can import the popular libraries with an `import` statement in the custom transform code block, such as the following:
+ NumPy version 1.19.0
+ scikit-learn version 0.23.2
+ SciPy version 1.5.4
+ pandas version 1.0.3
+ PySpark version 3.0.0

**Important**  
**Custom transform** doesn't support columns with spaces or special characters in the name. We recommend that you specify column names that only have alphanumeric characters and underscores. You can use the **Rename column** transform in the **Manage columns** transform group to remove spaces from a column's name. You can also add a **Python (Pandas)** **Custom transform** similar to the following to remove spaces from multiple columns in a single step. This example changes columns named `A column` and `B column` to `A_column` and `B_column` respectively.   

```
df.rename(columns={"A column": "A_column", "B column": "B_column"})
```

If you include print statements in the code block, the result appears when you select **Preview**. You can resize the custom code transformer panel. Resizing the panel provides more space to write code. The following image shows the resizing of the panel.

![\[For the Python function, replace the comments under pd.Series with your code.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/resizing-panel.gif)


The following sections provide additional context and examples for writing custom transform code.

**Python (User-Defined Function)**

The Python function gives you the ability to write custom transformations without needing to know Apache Spark or pandas. Data Wrangler is optimized to run your custom code quickly. You get similar performance using custom Python code and an Apache Spark plugin.

To use the Python (User-Defined Function) code block, you specify the following:
+ **Input column** – The input column where you're applying the transform.
+ **Mode** – The scripting mode, either pandas or Python.
+ **Return type** – The data type of the value that you're returning.

Using the pandas mode gives better performance. The Python mode makes it easier for you to write transformations by using pure Python functions.

The following video shows an example of how to use custom code to create a transformation. It uses the [Titanic dataset](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/walkthrough_titanic.csv) to create a column with the person's salutation.

![\[For the Python function, replace the comments under pd.Series with your code.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/python-function-transform-titanic-720.gif)


**PySpark**

The following example extracts date and time from a timestamp.

```
from pyspark.sql.functions import from_unixtime, to_date, date_format
df = df.withColumn('DATE_TIME', from_unixtime('TIMESTAMP'))
df = df.withColumn( 'EVENT_DATE', to_date('DATE_TIME')).withColumn(
'EVENT_TIME', date_format('DATE_TIME', 'HH:mm:ss'))
```

**pandas**

The following example provides an overview of the dataframe to which you are adding transforms. 

```
df.info()
```

**PySpark (SQL)**

The following example creates a new dataframe with four columns: *name*, *fare*, *pclass*, *survived*.

```
SELECT name, fare, pclass, survived FROM df
```

If you don’t know how to use PySpark, you can use custom code snippets to help you get started.

Data Wrangler has a searchable collection of code snippets. You can use to code snippets to perform tasks such as dropping columns, grouping by columns, or modelling.

To use a code snippet, choose **Search example snippets** and specify a query in the search bar. The text you specify in the query doesn’t have to match the name of the code snippet exactly.

The following example shows a **Drop duplicate rows** code snippet that can delete rows with similar data in your dataset. You can find the code snippet by searching for one of the following:
+ Duplicates
+ Identical
+ Remove

The following snippet has comments to help you understand the changes that you need to make. For most snippets, you must specify the column names of your dataset in the code.

```
# Specify the subset of columns
# all rows having identical values in these columns will be dropped

subset = ["col1", "col2", "col3"]
df = df.dropDuplicates(subset)  

# to drop the full-duplicate rows run
# df = df.dropDuplicates()
```

To use a snippet, copy and paste its content into the **Custom transform** field. You can copy and paste multiple code snippets into the custom transform field.

## Custom Formula


Use **Custom formula** to define a new column using a Spark SQL expression to query data in the current dataframe. The query must use the conventions of Spark SQL expressions.

**Important**  
**Custom formula** doesn't support columns with spaces or special characters in the name. We recommend that you specify column names that only have alphanumeric characters and underscores. You can use the **Rename column** transform in the **Manage columns** transform group to remove spaces from a column's name. You can also add a **Python (Pandas)** **Custom transform** similar to the following to remove spaces from multiple columns in a single step. This example changes columns named `A column` and `B column` to `A_column` and `B_column` respectively.   

```
df.rename(columns={"A column": "A_column", "B column": "B_column"})
```

You can use this transform to perform operations on columns, referencing the columns by name. For example, assuming the current dataframe contains columns named *col\$1a* and *col\$1b*, you can use the following operation to produce an **Output column** that is the product of these two columns with the following code:

```
col_a * col_b
```

Other common operations include the following, assuming a dataframe contains `col_a` and `col_b` columns:
+ Concatenate two columns: `concat(col_a, col_b)`
+ Add two columns: `col_a + col_b`
+ Subtract two columns: `col_a - col_b`
+ Divide two columns: `col_a / col_b`
+ Take the absolute value of a column: `abs(col_a)`

For more information, see the [Spark documentation](http://spark.apache.org/docs/latest/api/python) on selecting data. 

## Reduce Dimensionality within a Dataset
Reduce Dimensionality (PCA)

Reduce the dimensionality in your data by using Principal Component Analysis (PCA). The dimensionality of your dataset corresponds to the number of features. When you use dimensionality reduction in Data Wrangler, you get a new set of features called components. Each component accounts for some variability in the data.

The first component accounts for the largest amount of variation in the data. The second component accounts for the second largest amount of variation in the data, and so on.

You can use dimensionality reduction to reduce the size of the data sets that you use to train models. Instead of using the features in your dataset, you can use the principal components instead.

To perform PCA, Data Wrangler creates axes for your data. An axis is an affine combination of columns in your dataset. The first principal component is the value on the axis that has the largest amount of variance. The second principal component is the value on the axis that has the second largest amount of variance. The nth principal component is the value on the axis that has the nth largest amount of variance.

You can configure the number of principal components that Data Wrangler returns. You can either specify the number of principal components directly or you can specify the variance threshold percentage. Each principal component explains an amount of variance in the data. For example, you might have a principal component with a value of 0.5. The component would explain 50% of the variation in the data. When you specify a variance threshold percentage, Data Wrangler returns the smallest number of components that meet the percentage that you specify.

The following are example principal components with the amount of variance that they explain in the data.
+ Component 1 – 0.5
+ Component 2 – 0.45
+ Component 3 – 0.05

If you specify a variance threshold percentage of `94` or `95`, Data Wrangler returns Component 1 and Component 2. If you specify a variance threshold percentage of `96`, Data Wrangler returns all three principal components.

You can use the following procedure to run PCA on your dataset.

To run PCA on your dataset, do the following.

1. Open your Data Wrangler data flow.

1. Choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Dimensionality Reduction**.

1. For **Input Columns**, choose the features that you're reducing into the principal components.

1. (Optional) For **Number of principal components**, choose the number of principal components that Data Wrangler returns in your dataset. If specify a value for the field, you can't specify a value for **Variance threshold percentage**.

1. (Optional) For **Variance threshold percentage**, specify the percentage of variation in the data that you want explained by the principal components. Data Wrangler uses the default value of `95` if you don't specify a value for the variance threshold. You can't specify a variance threshold percentage if you've specified a value for **Number of principal components**.

1. (Optional) Deselect **Center** to not use the mean of the columns as the center of the data. By default, Data Wrangler centers the data with the mean before scaling.

1. (Optional) Deselect **Scale** to not scale the data with the unit standard deviation.

1. (Optional) Choose **Columns** to output the components to separate columns. Choose **Vector** to output the components as a single vector.

1. (Optional) For **Output column**, specify a name for an output column. If you're outputting the components to separate columns, the name that you specify is a prefix. If you're outputting the components to a vector, the name that you specify is the name of the vector column.

1. (Optional) Select **Keep input columns**. We don't recommend selecting this option if you plan on only using the principal components to train your model.

1. Choose **Preview**.

1. Choose **Add**.

## Encode Categorical


Categorical data is usually composed of a finite number of categories, where each category is represented with a string. For example, if you have a table of customer data, a column that indicates the country a person lives in is categorical. The categories would be *Afghanistan*, *Albania*, *Algeria*, and so on. Categorical data can be *nominal* or *ordinal*. Ordinal categories have an inherent order, and nominal categories do not. The highest degree obtained (*High school*, *Bachelors*, *Masters*, and so on) is an example of ordinal categories. 

Encoding categorical data is the process of creating a numerical representation for categories. For example, if your categories are *Dog* and *Cat*, you may encode this information into two vectors, `[1,0]` to represent *Dog*, and `[0,1]` to represent *Cat*.

When you encode ordinal categories, you may need to translate the natural order of categories into your encoding. For example, you can represent the highest degree obtained with the following map: `{"High school": 1, "Bachelors": 2, "Masters":3}`.

Use categorical encoding to encode categorical data that is in string format into arrays of integers. 

The Data Wrangler categorical encoders create encodings for all categories that exist in a column at the time the step is defined. If new categories have been added to a column when you start a Data Wrangler job to process your dataset at time *t*, and this column was the input for a Data Wrangler categorical encoding transform at time *t*-1, these new categories are considered *missing* in the Data Wrangler job. The option you select for **Invalid handling strategy** is applied to these missing values. Examples of when this can occur are: 
+ When you use a .flow file to create a Data Wrangler job to process a dataset that was updated after the creation of the data flow. For example, you may use a data flow to regularly process sales data each month. If that sales data is updated weekly, new categories may be introduced into columns for which an encode categorical step is defined. 
+ When you select **Sampling** when you import your dataset, some categories may be left out of the sample. 

In these situations, these new categories are considered missing values in the Data Wrangler job.

You can choose from and configure an *ordinal* and a *one-hot encode*. Use the following sections to learn more about these options. 

Both transforms create a new column named **Output column name**. You specify the output format of this column with **Output style**:
+ Select **Vector** to produce a single column with a sparse vector. 
+ Select **Columns** to create a column for every category with an indicator variable for whether the text in the original column contains a value that is equal to that category.

### Ordinal Encode


Select **Ordinal encode** to encode categories into an integer between 0 and the total number of categories in the **Input column** you select.

**Invalid handing strategy**: Select a method to handle invalid or missing values. 
+ Choose **Skip** if you want to omit the rows with missing values.
+ Choose **Keep** to retain missing values as the last category.
+ Choose **Error** if you want Data Wrangler to throw an error if missing values are encountered in the **Input column**.
+ Choose **Replace with NaN** to replace missing with NaN. This option is recommended if your ML algorithm can handle missing values. Otherwise, the first three options in this list may produce better results.

### One-Hot Encode


Select **One-hot encode** for **Transform** to use one-hot encoding. Configure this transform using the following: 
+ **Drop last category**: If `True`, the last category does not have a corresponding index in the one-hot encoding. When missing values are possible, a missing category is always the last one and setting this to `True` means that a missing value results in an all zero vector.
+ **Invalid handing strategy**: Select a method to handle invalid or missing values. 
  + Choose **Skip** if you want to omit the rows with missing values.
  + Choose **Keep** to retain missing values as the last category.
  + Choose **Error** if you want Data Wrangler to throw an error if missing values are encountered in the **Input column**.
+ **Is input ordinal encoded**: Select this option if the input vector contains ordinal encoded data. This option requires that input data contain non-negative integers. If **True**, input *i* is encoded as a vector with a non-zero in the *i*th location. 

### Similarity encode


Use similarity encoding when you have the following:
+ A large number of categorical variables
+ Noisy data

The similarity encoder creates embeddings for columns with categorical data. An embedding is a mapping of discrete objects, such as words, to vectors of real numbers. It encodes similar strings to vectors containing similar values. For example, it creates very similar encodings for "California" and "Calfornia".

Data Wrangler converts each category in your dataset into a set of tokens using a 3-gram tokenizer. It converts the tokens into an embedding using min-hash encoding.

The following example shows how the similarity encoder creates vectors from strings.

![\[Example on using ENCODE CATEGORICAL for a table in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/similarity-encode-example-screenshot-0.png)


![\[Example vector representation of a variable found in a table in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/similarity-encode-example-screenshot-1.png)


The similarity encodings that Data Wrangler creates:
+ Have low dimensionality
+ Are scalable to a large number of categories
+ Are robust and resistant to noise

For the preceding reasons, similarity encoding is more versatile than one-hot encoding.

To add the similarity encoding transform to your dataset, use the following procedure.

To use similarity encoding, do the following.

1. Sign in to the [Amazon SageMaker AI Console](https://console.aws.amazon.com/sagemaker/).

1. Choose **Open Studio Classic**.

1. Choose **Launch app**.

1. Choose **Studio**.

1. Specify your data flow.

1. Choose a step with a transformation.

1. Choose **Add step**.

1. Choose **Encode categorical**.

1. Specify the following:
   + **Transform** – **Similarity encode**
   + **Input column** – The column containing the categorical data that you're encoding.
   + **Target dimension** – (Optional) The dimension of the categorical embedding vector. The default value is 30. We recommend using a larger target dimension if you have a large dataset with many categories.
   + **Output style** – Choose **Vector** for a single vector with all of the encoded values. Choose **Column** to have the encoded values in separate columns.
   + **Output column** – (Optional) The name of the output column for a vector encoded output. For a column-encoded output, this is the prefix of the column names followed by listed number.

## Featurize Text


Use the **Featurize Text** transform group to inspect string-typed columns and use text embedding to featurize these columns. 

This feature group contains two features, *Character statistics* and *Vectorize*. Use the following sections to learn more about these transforms. For both options, the **Input column** must contain text data (string type).

### Character Statistics


Use **Character statistics** to generate statistics for each row in a column containing text data. 

This transform computes the following ratios and counts for each row, and creates a new column to report the result. The new column is named using the input column name as a prefix and a suffix that is specific to the ratio or count. 
+ **Number of words**: The total number of words in that row. The suffix for this output column is `-stats_word_count`.
+ **Number of characters**: The total number of characters in that row. The suffix for this output column is `-stats_char_count`.
+ **Ratio of upper**: The number of uppercase characters, from A to Z, divided by all characters in the column. The suffix for this output column is `-stats_capital_ratio`.
+ **Ratio of lower**: The number of lowercase characters, from a to z, divided by all characters in the column. The suffix for this output column is `-stats_lower_ratio`.
+ **Ratio of digits**: The ratio of digits in a single row over the sum of digits in the input column. The suffix for this output column is `-stats_digit_ratio`.
+ **Special characters ratio**: The ratio of non-alphanumeric (characters like \$1\$1&%:@) characters to over the sum of all characters in the input column. The suffix for this output column is `-stats_special_ratio`.

### Vectorize


Text embedding involves mapping words or phrases from a vocabulary to vectors of real numbers. Use the Data Wrangler text embedding transform to tokenize and vectorize text data into term frequency–inverse document frequency (TF-IDF) vectors. 

When TF-IDF is calculated for a column of text data, each word in each sentence is converted to a real number that represents its semantic importance. Higher numbers are associated with less frequent words, which tend to be more meaningful. 

When you define a **Vectorize** transform step, Data Wrangler uses the data in your dataset to define the count vectorizer and TF-IDF methods . Running a Data Wrangler job uses these same methods.

You configure this transform using the following: 
+ **Output column name**: This transform creates a new column with the text embedding. Use this field to specify a name for this output column. 
+ **Tokenizer**: A tokenizer converts the sentence into a list of words, or *tokens*. 

  Choose **Standard** to use a tokenizer that splits by white space and converts each word to lowercase. For example, `"Good dog"` is tokenized to `["good","dog"]`.

  Choose **Custom** to use a customized tokenizer. If you choose **Custom**, you can use the following fields to configure the tokenizer:
  + **Minimum token length**: The minimum length, in characters, for a token to be valid. Defaults to `1`. For example, if you specify `3` for minimum token length, words like `a, at, in` are dropped from the tokenized sentence. 
  + **Should regex split on gaps**: If selected, **regex** splits on gaps. Otherwise, it matches tokens. Defaults to `True`. 
  + **Regex pattern**: Regex pattern that defines the tokenization process. Defaults to `' \\ s+'`.
  + **To lowercase**: If chosen, Data Wrangler converts all characters to lowercase before tokenization. Defaults to `True`.

  To learn more, see the Spark documentation on [Tokenizer](https://spark.apache.org/docs/latest/ml-features#tokenizer).
+ **Vectorizer**: The vectorizer converts the list of tokens into a sparse numeric vector. Each token corresponds to an index in the vector and a non-zero indicates the existence of the token in the input sentence. You can choose from two vectorizer options, *Count* and *Hashing*.
  + **Count vectorize** allows customizations that filter infrequent or too common tokens. **Count vectorize parameters** include the following: 
    + **Minimum term frequency**: In each row, terms (tokens) with smaller frequency are filtered. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `1`.
    + **Minimum document frequency**: Minimum number of rows in which a term (token) must appear to be included. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `1`.
    + **Maximum document frequency**: Maximum number of documents (rows) in which a term (token) can appear to be included. If you specify an integer, this is an absolute threshold (inclusive). If you specify a fraction between 0 (inclusive) and 1, the threshold is relative to the total term count. Defaults to `0.999`.
    + **Maximum vocabulary size**: Maximum size of the vocabulary. The vocabulary is made up of all terms (tokens) in all rows of the column. Defaults to `262144`.
    + **Binary outputs**: If selected, the vector outputs do not include the number of appearances of a term in a document, but rather are a binary indicator of its appearance. Defaults to `False`.

    To learn more about this option, see the Spark documentation on [CountVectorizer](https://spark.apache.org/docs/latest/ml-features#countvectorizer).
  + **Hashing** is computationally faster. **Hash vectorize parameters** includes the following:
    + **Number of features during hashing**: A hash vectorizer maps tokens to a vector index according to their hash value. This feature determines the number of possible hash values. Large values result in fewer collisions between hash values but a higher dimension output vector.

    To learn more about this option, see the Spark documentation on [FeatureHasher](https://spark.apache.org/docs/latest/ml-features#featurehasher)
+ **Apply IDF** applies an IDF transformation, which multiplies the term frequency with the standard inverse document frequency used for TF-IDF embedding. **IDF parameters** include the following: 
  + **Minimum document frequency **: Minimum number of documents (rows) in which a term (token) must appear to be included. If **count\$1vectorize** is the chosen vectorizer, we recommend that you keep the default value and only modify the **min\$1doc\$1freq** field in **Count vectorize parameters**. Defaults to `5`.
+ ** Output format**:The output format of each row. 
  + Select **Vector** to produce a single column with a sparse vector. 
  + Select **Flattened** to create a column for every category with an indicator variable for whether the text in the original column contains a value that is equal to that category. You can only choose flattened when **Vectorizer** is set as **Count vectorizer**.

## Transform Time Series


In Data Wrangler, you can transform time series data. The values in a time series dataset are indexed to specific time. For example, a dataset that shows the number of customers in a store for each hour in a day is a time series dataset. The following table shows an example of a time series dataset.

Hourly number of customers in a store


| Number of customers | Time (hour) | 
| --- | --- | 
| 4 | 09:00 | 
| 10 | 10:00 | 
| 14 | 11:00 | 
| 25 | 12:00 | 
| 20 | 13:00 | 
| 18 | 14:00 | 

For the preceding table, the **Number of Customers** column contains the time series data. The time series data is indexed on the hourly data in the **Time (hour)** column.

You might need to perform a series of transformations on your data to get it in a format that you can use for your analysis. Use the **Time series** transform group to transform your time series data. For more information about the transformations that you can perform, see the following sections.

**Topics**
+ [

### Group by a Time Series
](#data-wrangler-group-by-time-series)
+ [

### Resample Time Series Data
](#data-wrangler-resample-time-series)
+ [

### Handle Missing Time Series Data
](#data-wrangler-transform-handle-missing-time-series)
+ [

### Validate the Timestamp of Your Time Series Data
](#data-wrangler-transform-validate-timestamp)
+ [

### Standardizing the Length of the Time Series
](#data-wrangler-transform-standardize-length)
+ [

### Extract Features from Your Time Series Data
](#data-wrangler-transform-extract-time-series-features)
+ [

### Use Lagged Features from Your Time Series Data
](#data-wrangler-transform-lag-time-series)
+ [

### Create a Datetime Range In Your Time Series
](#data-wrangler-transform-datetime-range)
+ [

### Use a Rolling Window In Your Time Series
](#data-wrangler-transform-rolling-window)

### Group by a Time Series


You can use the group by operation to group time series data for specific values in a column.

For example, you have the following table that tracks the average daily electricity usage in a household.

Average daily household electricity usage


| Household ID | Daily timestamp | Electricity usage (kWh) | Number of household occupants | 
| --- | --- | --- | --- | 
| household\$10 | 1/1/2020 | 30 | 2 | 
| household\$10 | 1/2/2020 | 40 | 2 | 
| household\$10 | 1/4/2020 | 35 | 3 | 
| household\$11 | 1/2/2020 | 45 | 3 | 
| household\$11 | 1/3/2020 | 55 | 4 | 

If you choose to group by ID, you get the following table.

Electricity usage grouped by household ID


| Household ID | Electricity usage series (kWh) | Number of household occupants series | 
| --- | --- | --- | 
| household\$10 | [30, 40, 35] | [2, 2, 3] | 
| household\$11 | [45, 55] | [3, 4] | 

Each entry in the time series sequence is ordered by the corresponding timestamp. The first element of the sequence corresponds to the first timestamp of the series. For `household_0`, `30` is the first value of the **Electricity Usage Series**. The value of `30` corresponds to the first timestamp of `1/1/2020`.

You can include the starting timestamp and ending timestamp. The following table shows how that information appears.

Electricity usage grouped by household ID


| Household ID | Electricity usage series (kWh) | Number of household occupants series | Start\$1time | End\$1time | 
| --- | --- | --- | --- | --- | 
| household\$10 | [30, 40, 35] | [2, 2, 3] | 1/1/2020 | 1/4/2020 | 
| household\$11 | [45, 55] | [3, 4] | 1/2/2020 | 1/3/2020 | 

You can use the following procedure to group by a time series column. 

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Time Series**.

1. Under **Transform**, choose **Group by**.

1. Specify a column in **Group by this column**.

1. For **Apply to columns**, specify a value.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Resample Time Series Data


Time series data usually has observations that aren't taken at regular intervals. For example, a dataset could have some observations that are recorded hourly and other observations that are recorded every two hours.

Many analyses, such as forecasting algorithms, require the observations to be taken at regular intervals. Resampling gives you the ability to establish regular intervals for the observations in your dataset.

You can either upsample or downsample a time series. Downsampling increases the interval between observations in the dataset. For example, if you downsample observations that are taken either every hour or every two hours, each observation in your dataset is taken every two hours. The hourly observations are aggregated into a single value using an aggregation method such as the mean or median.

Upsampling reduces the interval between observations in the dataset. For example, if you upsample observations that are taken every two hours into hourly observations, you can use an interpolation method to infer hourly observations from the ones that have been taken every two hours. For information on interpolation methods, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

You can resample both numeric and non-numeric data.

Use the **Resample** operation to resample your time series data. If you have multiple time series in your dataset, Data Wrangler standardizes the time interval for each time series.

The following table shows an example of downsampling time series data by using the mean as the aggregation method. The data is downsampled from every two hours to every hour.

Hourly temperature readings over a day before downsampling


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 | 30 | 
| 1:00 | 32 | 
| 2:00 | 35 | 
| 3:00 | 32 | 
| 4:00 | 30 | 

Temperature readings downsampled to every two hours


| Timestamp | Temperature (Celsius) | 
| --- | --- | 
| 12:00 | 30 | 
| 2:00 | 33.5 | 
| 4:00 | 35 | 

You can use the following procedure to resample time series data.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Resample**.

1. For **Timestamp**, choose the timestamp column.

1. For **Frequency unit**, specify the frequency that you're resampling.

1. (Optional) Specify a value for **Frequency quantity**.

1. Configure the transform by specifying the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Handle Missing Time Series Data


If you have missing values in your dataset, you can do one of the following:
+ For datasets that have multiple time series, drop the time series that have missing values that are greater than a threshold that you specify.
+ Impute the missing values in a time series by using other values in the time series.

Imputing a missing value involves replacing the data by either specifying a value or by using an inferential method. The following are the methods that you can use for imputation:
+ Constant value – Replace all the missing data in your dataset with a value that you specify.
+ Most common value – Replace all the missing data with the value that has the highest frequency in the dataset.
+ Forward fill – Use a forward fill to replace the missing values with the non-missing value that precedes the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced with 7. The sequence that results from using a forward fill is [2, 4, 7, 7, 7, 7, 8].
+ Backward fill – Use a backward fill to replace the missing values with the non-missing value that follows the missing values. For the sequence: [2, 4, 7, NaN, NaN, NaN, 8], all of the missing values are replaced with 8. The sequence that results from using a backward fill is [2, 4, 7, 8, 8, 8, 8]. 
+ Interpolate – Uses an interpolation function to impute the missing values. For more information on the functions that you can use for interpolation, see [pandas.DataFrame.interpolate](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html).

Some of the imputation methods might not be able to impute of all the missing value in your dataset. For example, a **Forward fill** can't impute a missing value that appears at the beginning of the time series. You can impute the values by using either a forward fill or a backward fill.

You can either impute missing values within a cell or within a column.

The following example shows how values are imputed within a cell.

Electricity usage with missing values


| Household ID | Electricity usage series (kWh) | 
| --- | --- | 
| household\$10 | [30, 40, 35, NaN, NaN] | 
| household\$11 | [45, NaN, 55] | 

Electricity usage with values imputed using a forward fill


| Household ID | Electricity usage series (kWh) | 
| --- | --- | 
| household\$10 | [30, 40, 35, 35, 35] | 
| household\$11 | [45, 45, 55] | 

The following example shows how values are imputed within a column.

Average daily household electricity usage with missing values


| Household ID | Electricity usage (kWh) | 
| --- | --- | 
| household\$10 | 30 | 
| household\$10 | 40 | 
| household\$10 | NaN | 
| household\$11 | NaN | 
| household\$11 | NaN | 

Average daily household electricity usage with values imputed using a forward fill


| Household ID | Electricity usage (kWh) | 
| --- | --- | 
| household\$10 | 30 | 
| household\$10 | 40 | 
| household\$10 | 40 | 
| household\$11 | 40 | 
| household\$11 | 40 | 

You can use the following procedure to handle missing values.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Handle missing**.

1. For **Time series input type**, choose whether you want to handle missing values inside of a cell or along a column.

1. For **Impute missing values for this column**, specify the column that has the missing values.

1. For **Method for imputing values**, select a method.

1. Configure the transform by specifying the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. If you have missing values, you can specify a method for imputing them under **Method for imputing values**.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Validate the Timestamp of Your Time Series Data


You might have time stamp data that is invalid. You can use the **Validate time stamp** function to determine whether the timestamps in your dataset are valid. Your timestamp can be invalid for one or more of the following reasons:
+ Your timestamp column has missing values.
+ The values in your timestamp column are not formatted correctly.

If you have invalid timestamps in your dataset, you can't perform your analysis successfully. You can use Data Wrangler to identify invalid timestamps and understand where you need to clean your data.

The time series validation works in one of the two ways:

You can configure Data Wrangler to do one of the following if it encounters missing values in your dataset:
+ Drop the rows that have the missing or invalid values.
+ Identify the rows that have the missing or invalid values.
+ Throw an error if it finds any missing or invalid values in your dataset.

You can validate the timestamps on columns that either have the `timestamp` type or the `string` type. If the column has the `string` type, Data Wrangler converts the type of the column to `timestamp` and performs the validation.

You can use the following procedure to validate the timestamps in your dataset.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Validate timestamps**.

1. For **Timestamp Column**, choose the timestamp column.

1. For **Policy**, choose whether you want to handle missing timestamps.

1. (Optional) For **Output column**, specify a name for the output column.

1. If the date time column is formatted for the string type, choose **Cast to datetime**.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Standardizing the Length of the Time Series


If you have time series data stored as arrays, you can standardize each time series to the same length. Standardizing the length of the time series array might make it easier for you to perform your analysis on the data.

You can standardize your time series for data transformations that require the length of your data to be fixed.

Many ML algorithms require you to flatten your time series data before you use them. Flattening time series data is separating each value of the time series into its own column in a dataset. The number of columns in a dataset can't change, so the lengths of the time series need to be standardized between you flatten each array into a set of features.

Each time series is set to the length that you specify as a quantile or percentile of the time series set. For example, you can have three sequences that have the following lengths:
+ 3
+ 4
+ 5

You can set the length of all of the sequences as the length of the sequence that has the 50th percentile length.

Time series arrays that are shorter than the length you've specified have missing values added. The following is an example format of standardizing the time series to a longer length: [2, 4, 5, NaN, NaN, NaN].

You can use different approaches to handle the missing values. For information on those approaches, see [Handle Missing Time Series Data](#data-wrangler-transform-handle-missing-time-series).

The time series arrays that are longer than the length that you specify are truncated.

You can use the following procedure to standardize the length of the time series.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Standardize length**.

1. For **Standardize the time series length for the column**, choose a column.

1. (Optional) For **Output column**, specify a name for the output column. If you don't specify a name, the transform is done in place.

1. If the datetime column is formatted for the string type, choose **Cast to datetime**.

1. Choose **Cutoff quantile** and specify a quantile to set the length of the sequence.

1. Choose **Flatten the output** to output the values of the time series into separate columns.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Extract Features from Your Time Series Data


If you're running a classification or a regression algorithm on your time series data, we recommend extracting features from the time series before running the algorithm. Extracting features might improve the performance of your algorithm.

Use the following options to choose how you want to extract features from your data:
+ Use **Minimal subset** to specify extracting 8 features that you know are useful in downstream analyses. You can use a minimal subset when you need to perform computations quickly. You can also use it when your ML algorithm has a high risk of overfitting and you want to provide it with fewer features.
+ Use **Efficient subset** to specify extracting the most features possible without extracting features that are computationally intensive in your analyses.
+ Use **All features** to specify extracting all features from the tune series.
+ Use **Manual subset** to choose a list of features that you think explain the variation in your data well.

Use the following the procedure to extract features from your time series data.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Extract features**.

1. For **Extract features for this column**, choose a column.

1. (Optional) Select **Flatten** to output the features into separate columns.

1. For **Strategy**, choose a strategy to extract the features.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Use Lagged Features from Your Time Series Data


For many use cases, the best way to predict the future behavior of your time series is to use its most recent behavior.

The most common uses of lagged features are the following:
+ Collecting a handful of past values. For example, for time, t \$1 1, you collect t, t - 1, t - 2, and t - 3.
+ Collecting values that correspond to seasonal behavior in the data. For example, to predict the occupancy in a restaurant at 1:00 PM, you might want to use the features from 1:00 PM on the previous day. Using the features from 12:00 PM or 11:00 AM on the same day might not be as predictive as using the features from previous days.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Lag features**.

1. For **Generate lag features for this column**, choose a column.

1. For **Timestamp Column**, choose the column containing the timestamps.

1. For **Lag**, specify the duration of the lag.

1. (Optional) Configure the output using one of the following options:
   + **Include the entire lag window**
   + **Flatten the output**
   + **Drop rows without history**

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Create a Datetime Range In Your Time Series


You might have time series data that don't have timestamps. If you know that the observations were taken at regular intervals, you can generate timestamps for the time series in a separate column. To generate timestamps, you specify the value for the start timestamp and the frequency of the timestamps.

For example, you might have the following time series data for the number of customers at a restaurant.

Time series data on the number of customers at a restaurant


| Number of customers | 
| --- | 
| 10 | 
| 14 | 
| 24 | 
| 40 | 
| 30 | 
| 20 | 

If you know that the restaurant opened at 5:00 PM and that the observations are taken hourly, you can add a timestamp column that corresponds to the time series data. You can see the timestamp column in the following table.

Time series data on the number of customers at a restaurant


| Number of customers | Timestamp | 
| --- | --- | 
| 10 | 1:00 PM | 
| 14 | 2:00 PM | 
| 24 | 3:00 PM | 
| 40 | 4:00 PM | 
| 30 | 5:00 PM | 
| 20 | 6:00 PM | 

Use the following procedure to add a datetime range to your data.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Datetime range**.

1. For **Frequency type**, choose the unit used to measure the frequency of the timestamps.

1. For **Starting timestamp**, specify the start timestamp.

1. For **Output column**, specify a name for the output column.

1. (Optional) Configure the output using the remaining fields.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

### Use a Rolling Window In Your Time Series


You can extract features over a time period. For example, for time, *t*, and a time window length of 3, and for the row that indicates the *t*th timestamp, we append the features that are extracted from the time series at times *t* - 3, *t* -2, and *t* - 1. For information on extracting features, see [Extract Features from Your Time Series Data](#data-wrangler-transform-extract-time-series-features). 

You can use the following procedure to extract features over a time period.

1. Open your Data Wrangler data flow.

1. If you haven't imported your dataset, import it under the **Import data** tab.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add transform**.

1. Choose **Add step**.

1. Choose **Rolling window features**.

1. For **Generate rolling window features for this column**, choose a column.

1. For **Timestamp Column**, choose the column containing the timestamps.

1. (Optional) For **Output Column**, specify the name of the output column.

1. For **Window size**, specify the window size.

1. For **Strategy**, choose the extraction strategy.

1. Choose **Preview** to generate a preview of the transform.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Featurize Datetime


Use **Featurize date/time** to create a vector embedding representing a datetime field. To use this transform, your datetime data must be in one of the following formats: 
+ Strings describing datetime: For example, `"January 1st, 2020, 12:44pm"`. 
+ A Unix timestamp: A Unix timestamp describes the number of seconds, milliseconds, microseconds, or nanoseconds from 1/1/1970. 

You can choose to **Infer datetime format** and provide a **Datetime format**. If you provide a datetime format, you must use the codes described in the [Python documentation](https://docs.python.org/3/library/datetime.html#strftime-and-strptime-format-codes). The options you select for these two configurations have implications for the speed of the operation and the final results.
+ The most manual and computationally fastest option is to specify a **Datetime format** and select **No** for **Infer datetime format**.
+ To reduce manual labor, you can choose **Infer datetime format** and not specify a datetime format. It is also a computationally fast operation; however, the first datetime format encountered in the input column is assumed to be the format for the entire column. If there are other formats in the column, these values are NaN in the final output. Inferring the datetime format can give you unparsed strings. 
+ If you don't specify a format and select **No** for **Infer datetime format**, you get the most robust results. All the valid datetime strings are parsed. However, this operation can be an order of magnitude slower than the first two options in this list. 

When you use this transform, you specify an **Input column** which contains datetime data in one of the formats listed above. The transform creates an output column named **Output column name**. The format of the output column depends on your configuration using the following:
+ **Vector**: Outputs a single column as a vector. 
+ **Columns**: Creates a new column for every feature. For example, if the output contains a year, month, and day, three separate columns are created for year, month, and day. 

Additionally, you must choose an **Embedding mode**. For linear models and deep networks, we recommend choosing **cyclic**. For tree-based algorithms, we recommend choosing **ordinal**.

## Format String


The **Format string** transforms contain standard string formatting operations. For example, you can use these operations to remove special characters, normalize string lengths, and update string casing.

This feature group contains the following transforms. All transforms return copies of the strings in the **Input column** and add the result to a new, output column.


| Name | Function | 
| --- | --- | 
| Left pad |  Left-pad the string with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Right pad |  Right-pad the string with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Center (pad on either side) |  Center-pad the string (add padding on both sides of the string) with a given **Fill character** to the given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Prepend zeros |  Left-fill a numeric string with zeros, up to a given **width**. If the string is longer than **width**, the return value is shortened to **width** characters.  | 
| Strip left and right |  Returns a copy of the string with the leading and trailing characters removed.  | 
| Strip characters from left |  Returns a copy of the string with leading characters removed.  | 
| Strip characters from right |  Returns a copy of the string with trailing characters removed.  | 
| Lower case |  Convert all letters in text to lowercase.  | 
| Upper case |  Convert all letters in text to uppercase.  | 
| Capitalize |  Capitalize the first letter in each sentence.   | 
| Swap case | Converts all uppercase characters to lowercase and all lowercase characters to uppercase characters of the given string, and returns it. | 
| Add prefix or suffix |  Adds a prefix and a suffix the string column. You must specify at least one of **Prefix** and **Suffix**.   | 
| Remove symbols |  Removes given symbols from a string. All listed characters are removed. Defaults to white space.   | 

## Handle Outliers


Machine learning models are sensitive to the distribution and range of your feature values. Outliers, or rare values, can negatively impact model accuracy and lead to longer training times. Use this feature group to detect and update outliers in your dataset. 

When you define a **Handle outliers** transform step, the statistics used to detect outliers are generated on the data available in Data Wrangler when defining this step. These same statistics are used when running a Data Wrangler job. 

Use the following sections to learn more about the transforms this group contains. You specify an **Output name** and each of these transforms produces an output column with the resulting data. 

### Robust standard deviation numeric outliers


This transform detects and fixes outliers in numeric features using statistics that are robust to outliers.

You must define an **Upper quantile** and a **Lower quantile** for the statistics used to calculate outliers. You must also specify the number of **Standard deviations** from which a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values.

### Standard Deviation Numeric Outliers


This transform detects and fixes outliers in numeric features using the mean and standard deviation.

You specify the number of **Standard deviations** a value must vary from the mean to be considered an outlier. For example, if you specify 3 for **Standard deviations**, a value must fall more than 3 standard deviations from the mean to be considered an outlier. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values.

### Quantile Numeric Outliers


Use this transform to detect and fix outliers in numeric features using quantiles. You can define an **Upper quantile** and a **Lower quantile**. All values that fall above the upper quantile or below the lower quantile are considered outliers. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values. 

### Min-Max Numeric Outliers


This transform detects and fixes outliers in numeric features using upper and lower thresholds. Use this method if you know threshold values that demark outliers.

You specify a **Upper threshold** and a **Lower threshold**, and if values fall above or below those thresholds respectively, they are considered outliers. 

The **Fix method** is the method used to handle outliers when they are detected. You can choose from the following:
+ **Clip**: Use this option to clip the outliers to the corresponding outlier detection bound.
+ **Remove**: Use this option to remove rows with outliers from the dataframe.
+ **Invalidate**: Use this option to replace outliers with invalid values. 

### Replace Rare


When you use the **Replace rare** transform, you specify a threshold and Data Wrangler finds all values that meet that threshold and replaces them with a string that you specify. For example, you may want to use this transform to categorize all outliers in a column into an "Others" category. 
+ **Replacement string**: The string with which to replace outliers.
+ **Absolute threshold**: A category is rare if the number of instances is less than or equal to this absolute threshold.
+ **Fraction threshold**: A category is rare if the number of instances is less than or equal to this fraction threshold multiplied by the number of rows.
+ **Max common categories**: Maximum not-rare categories that remain after the operation. If the threshold does not filter enough categories, those with the top number of appearances are classified as not rare. If set to 0 (default), there is no hard limit to the number of categories.

## Handle Missing Values


Missing values are a common occurrence in machine learning datasets. In some situations, it is appropriate to impute missing data with a calculated value, such as an average or categorically common value. You can process missing values using the **Handle missing values** transform group. This group contains the following transforms. 

### Fill Missing


Use the **Fill missing** transform to replace missing values with a **Fill value** you define. 

### Impute Missing


Use the **Impute missing** transform to create a new column that contains imputed values where missing values were found in input categorical and numerical data. The configuration depends on your data type.

For numeric data, choose an imputing strategy, the strategy used to determine the new value to impute. You can choose to impute the mean or the median over the values that are present in your dataset. Data Wrangler uses the value that it computes to impute the missing values.

For categorical data, Data Wrangler imputes missing values using the most frequent value in the column. To impute a custom string, use the **Fill missing** transform instead.

### Add Indicator for Missing


Use the **Add indicator for missing** transform to create a new indicator column, which contains a Boolean `"false"` if a row contains a value, and `"true"` if a row contains a missing value. 

### Drop Missing


Use the **Drop missing** option to drop rows that contain missing values from the **Input column**.

## Manage Columns


You can use the following transforms to quickly update and manage columns in your dataset: 


| Name | Function | 
| --- | --- | 
| Drop Column | Delete a column.  | 
| Duplicate Column | Duplicate a column. | 
| Rename Column | Rename a column. | 
| Move Column |  Move a column's location in the dataset. Choose to move your column to the start or end of the dataset, before or after a reference column, or to a specific index.   | 

## Manage Rows


Use this transform group to quickly perform sort and shuffle operations on rows. This group contains the following:
+ **Sort**: Sort the entire dataframe by a given column. Select the check box next to **Ascending order** for this option; otherwise, deselect the check box and descending order is used for the sort. 
+ **Shuffle**: Randomly shuffle all rows in the dataset. 

## Manage Vectors


Use this transform group to combine or flatten vector columns. This group contains the following transforms. 
+ **Assemble**: Use this transform to combine Spark vectors and numeric data into a single column. For example, you can combine three columns: two containing numeric data and one containing vectors. Add all the columns you want to combine in **Input columns** and specify a **Output column name** for the combined data. 
+ **Flatten**: Use this transform to flatten a single column containing vector data. The input column must contain PySpark vectors or array-like objects. You can control the number of columns created by specifying a **Method to detect number of outputs**. For example, if you select **Length of first vector**, the number of elements in the first valid vector or array found in the column determines the number of output columns that are created. All other input vectors with too many items are truncated. Inputs with too few items are filled with NaNs.

  You also specify an **Output prefix**, which is used as the prefix for each output column. 

## Process Numeric


Use the **Process Numeric** feature group to process numeric data. Each scalar in this group is defined using the Spark library. The following scalars are supported:
+ **Standard Scaler**: Standardize the input column by subtracting the mean from each value and scaling to unit variance. To learn more, see the Spark documentation for [StandardScaler](https://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-transform.html).
+ **Robust Scaler**: Scale the input column using statistics that are robust to outliers. To learn more, see the Spark documentation for [RobustScaler](https://spark.apache.org/docs/latest/ml-features#robustscaler).
+ **Min Max Scaler**: Transform the input column by scaling each feature to a given range. To learn more, see the Spark documentation for [MinMaxScaler](https://spark.apache.org/docs/latest/ml-features#minmaxscaler).
+ **Max Absolute Scaler**: Scale the input column by dividing each value by the maximum absolute value. To learn more, see the Spark documentation for [MaxAbsScaler](https://spark.apache.org/docs/latest/ml-features#maxabsscaler).

## Sampling


After you've imported your data, you can use the **Sampling** transformer to take one or more samples of it. When you use the sampling transformer, Data Wrangler samples your original dataset.

You can choose one of the following sample methods:
+ **Limit**: Samples the dataset starting from the first row up to the limit that you specify.
+ **Randomized**: Takes a random sample of a size that you specify.
+ **Stratified**: Takes a stratified random sample.

You can stratify a randomized sample to make sure that it represents the original distribution of the dataset.

You might be performing data preparation for multiple use cases. For each use case, you can take a different sample and apply a different set of transformations.

The following procedure describes the process of creating a random sample. 

To take a random sample from your data.

1. Choose the **\$1** to the right of the dataset that you've imported. The name of your dataset is located below the **\$1**.

1. Choose **Add transform**.

1. Choose **Sampling**.

1. For **Sampling method**, choose the sampling method.

1. For **Approximate sample size**, choose the approximate number of observations that you want in your sample.

1. (Optional) Specify an integer for **Random seed** to create a reproducible sample.

The following procedure describes the process of creating a stratified sample.

To take a stratified sample from your data.

1. Choose the **\$1** to the right of the dataset that you've imported. The name of your dataset is located below the **\$1**.

1. Choose **Add transform**.

1. Choose **Sampling**.

1. For **Sampling method**, choose the sampling method.

1. For **Approximate sample size**, choose the approximate number of observations that you want in your sample.

1. For **Stratify column**, specify the name of the column that you want to stratify on.

1. (Optional) Specify an integer for **Random seed** to create a reproducible sample.

## Search and Edit


Use this section to search for and edit specific patterns within strings. For example, you can find and update strings within sentences or documents, split strings by delimiters, and find occurrences of specific strings. 

The following transforms are supported under **Search and edit**. All transforms return copies of the strings in the **Input column** and add the result to a new output column.


| Name | Function | 
| --- | --- | 
|  Find substring  |  Returns the index of the first occurrence of the **Substring** for which you searched , You can start and end the search at **Start** and **End** respectively.   | 
|  Find substring (from right)  |  Returns the index of the last occurrence of the **Substring** for which you searched. You can start and end the search at **Start** and **End** respectively.   | 
|  Matches prefix  |  Returns a Boolean value if the string contains a given **Pattern**. A pattern can be a character sequence or regular expression. Optionally, you can make the pattern case sensitive.   | 
|  Find all occurrences  |  Returns an array with all occurrences of a given pattern. A pattern can be a character sequence or regular expression.   | 
|  Extract using regex  |  Returns a string that matches a given Regex pattern.  | 
|  Extract between delimiters  |  Returns a string with all characters found between **Left delimiter** and **Right delimiter**.   | 
|  Extract from position  |  Returns a string, starting from **Start position** in the input string, that contains all characters up to the start position plus **Length**.   | 
|  Find and replace substring  |  Returns a string with all matches of a given **Pattern** (regular expression) replaced by **Replacement string**.  | 
|  Replace between delimiters  |  Returns a string with the substring found between the first appearance of a **Left delimiter** and the last appearance of a **Right delimiter** replaced by **Replacement string**. If no match is found, nothing is replaced.   | 
|  Replace from position  |  Returns a string with the substring between **Start position** and **Start position** plus **Length** replaced by **Replacement string**. If **Start position** plus **Length** is greater than the length of the replacement string, the output contains **…**.  | 
|  Convert regex to missing  |  Converts a string to `None` if invalid and returns the result. Validity is defined with a regular expression in **Pattern**.  | 
|  Split string by delimiter  |  Returns an array of strings from the input string, split by **Delimiter**, with up to **Max number of splits** (optional). The delimiter defaults to white space.   | 

## Split data


Use the **Split data** transform to split your dataset into two or three datasets. For example, you can split your dataset into a dataset used to train your model and a dataset used to test it. You can determine the proportion of the dataset that goes into each split. For example, if you’re splitting one dataset into two datasets, the training dataset can have 80% of the data while the testing dataset has 20%.

Splitting your data into three datasets gives you the ability to create training, validation, and test datasets. You can see how well the model performs on the test dataset by dropping the target column.

Your use case determines how much of the original dataset each of your datasets get and the method you use to split the data. For example, you might want to use a stratified split to make sure that the distribution of the observations in the target column are the same across datasets. You can use the following split transforms:
+ Randomized split — Each split is a random, non-overlapping sample of the original dataset. For larger datasets, using a randomized split might be computationally expensive and take longer than an ordered split.
+ Ordered split – Splits the dataset based on the sequential order of the observations. For example, for an 80/20 train-test split, the first observations that make up 80% of the dataset go to the training dataset. The last 20% of the observations go to the testing dataset. Ordered splits are effective in keeping the existing order of the data between splits.
+ Stratified split – Splits the dataset to make sure that the number of observations in the input column have proportional representation. For an input column that has the observations 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, an 80/20 split on the column would mean that approximately 80% of the 1s, 80% of the 2s, and 80% of the 3s go to the training set. About 20% of each type of observation go to the testing set.
+ Split by key – Avoids data with the same key occurring in more than one split. For example, if you have a dataset with the column 'customer\$1id' and you're using it as a key, no customer id is in more than one split.

After you split the data, you can apply additional transformations to each dataset. For most use cases, they aren't necessary.

Data Wrangler calculates the proportions of the splits for performance. You can choose an error threshold to set the accuracy of the splits. Lower error thresholds more accurately reflect the proportions that you specify for the splits. If you set a higher error threshold, you get better performance, but lower accuracy.

For perfectly split data, set the error threshold to 0. You can specify a threshold between 0 and 1 for better performance. If you specify a value greater than 1, Data Wrangler interprets that value as 1.

If you have 10000 rows in your dataset and you specify an 80/20 split with an error of 0.001, you would get observations approximating one of the following results:
+ 8010 observations in the training set and 1990 in the testing set
+ 7990 observations in the training set and 2010 in the testing set

The number of observations for the testing set in the preceding example is in the interval between 8010 and 7990.

By default, Data Wrangler uses a random seed to make the splits reproducible. You can specify a different value for the seed to create a different reproducible split.

------
#### [ Randomized split ]

Use the following procedure to perform a randomized split on your dataset.

To split your dataset randomly, do the following

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) Specify a value for **Random seed**.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Ordered split ]

Use the following procedure to perform an ordered split on your dataset.

To make an ordered split in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. For **Transform**, choose **Ordered split**.

1. Choose **Split data**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) For **Input column**, specify a column with numeric values. Uses the values of the columns to infer which records are in each split. The smaller values are in one split with the larger values in the other splits.

1. (Optional) Select **Handle duplicates** to add noise to duplicate values and create a dataset of entirely unique values.

1. (Optional) Specify a value for **Random seed**.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Stratified split ]

Use the following procedure to perform a stratified split on your dataset.

To make a stratified split in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. For **Transform**, choose **Stratified split**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. For **Input column**, specify a column with up to 100 unique values. Data Wrangler can't stratify a column with more than 100 unique values.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. (Optional) Specify a value for **Random seed** to specify a different seed.

1. Choose **Preview**.

1. Choose **Add**.

------
#### [ Split by column keys ]

Use the following procedure to split by the column keys in your dataset.

To split by the column keys in your dataset, do the following.

1. Choose the **\$1** next to the node containing the dataset that you're splitting.

1. Choose **Add transform**.

1. Choose **Split data**.

1. For **Transform**, choose **Split by key**.

1. (Optional) For **Splits**, specify the names and proportions of each split. The proportions must sum to 1.

1. (Optional) Choose the **\$1** to create an additional split.

   1. Specify the names and proportions of all the splits. The proportions must sum to 1.

1. For **Key columns**, specify the columns with values that you don't want to appear in both datasets.

1. (Optional) Specify a value for **Error threshold** other than the default value.

1. Choose **Preview**.

1. Choose **Add**.

------

## Parse Value as Type


Use this transform to cast a column to a new type. The supported Data Wrangler data types are:
+ Long
+ Float
+ Boolean
+ Date, in the format dd-MM-yyyy, representing day, month, and year respectively. 
+ String

## Validate String


Use the **Validate string** transforms to create a new column that indicates that a row of text data meets a specified condition. For example, you can use a **Validate string** transform to verify that a string only contains lowercase characters. The following transforms are supported under **Validate string**. 

The following transforms are included in this transform group. If a transform outputs a Boolean value, `True` is represented with a `1` and `False` is represented with a `0`.


| Name | Function | 
| --- | --- | 
|  String length  |  Returns `True` if a string length equals specified length. Otherwise, returns `False`.   | 
|  Starts with  |  Returns `True` if a string starts will a specified prefix. Otherwise, returns `False`.  | 
|  Ends with  |  Returns `True` if a string length equals specified length. Otherwise, returns `False`.  | 
|  Is alphanumeric  |  Returns `True` if a string only contains numbers and letters. Otherwise, returns `False`.  | 
|  Is alpha (letters)  |  Returns `True` if a string only contains letters. Otherwise, returns `False`.  | 
|  Is digit  |  Returns `True` if a string only contains digits. Otherwise, returns `False`.  | 
|  Is space  |  Returns `True` if a string only contains numbers and letters. Otherwise, returns `False`.  | 
|  Is title  |  Returns `True` if a string contains any white spaces. Otherwise, returns `False`.  | 
|  Is lowercase  |  Returns `True` if a string only contains lower case letters. Otherwise, returns `False`.  | 
|  Is uppercase  |  Returns `True` if a string only contains upper case letters. Otherwise, returns `False`.  | 
|  Is numeric  |  Returns `True` if a string only contains numbers. Otherwise, returns `False`.  | 
|  Is decimal  |  Returns `True` if a string only contains decimal numbers. Otherwise, returns `False`.  | 

## Unnest JSON Data


If you have a .csv file, you might have values in your dataset that are JSON strings. Similarly, you might have nested data in columns of either a Parquet file or a JSON document.

Use the **Flatten structured** operator to separate the first level keys into separate columns. A first level key is a key that isn't nested within a value.

For example, you might have a dataset that has a *person* column with demographic information on each person stored as JSON strings. A JSON string might look like the following.

```
 "{"seq": 1,"name": {"first": "Nathaniel","last": "Ferguson"},"age": 59,"city": "Posbotno","state": "WV"}"
```

The **Flatten structured** operator converts the following first level keys into additional columns in your dataset:
+ seq
+ name
+ age
+ city
+ state

Data Wrangler puts the values of the keys as values under the columns. The following shows the column names and values of the JSON.

```
seq, name,                                    age, city, state
1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV
```

For each value in your dataset containing JSON, the **Flatten structured** operator creates columns for the first-level keys. To create columns for nested keys, call the operator again. For the preceding example, calling the operator creates the columns:
+ name\$1first
+ name\$1last

The following example shows the dataset that results from calling the operation again.

```
seq, name,                                    age, city, state, name_first, name_last
1, {"first": "Nathaniel","last": "Ferguson"}, 59, Posbotno, WV, Nathaniel, Ferguson
```

Choose **Keys to flatten on** to specify the first-level keys that want to extract as separate columns. If you don't specify any keys, Data Wrangler extracts all the keys by default.

## Explode Array


Use **Explode array** to expand the values of the array into separate output rows. For example, the operation can take each value in the array, [[1, 2, 3,], [4, 5, 6], [7, 8, 9]] and create a new column with the following rows:

```
                [1, 2, 3]
                [4, 5, 6]
                [7, 8, 9]
```

Data Wrangler names the new column, input\$1column\$1name\$1flatten.

You can call the **Explode array** operation multiple times to get the nested values of the array into separate output columns. The following example shows the result of calling the operation multiple times on a dataset with a nested array.

Putting the values of a nested array into separate columns


| id | array | id | array\$1items | id | array\$1items\$1items | 
| --- | --- | --- | --- | --- | --- | 
| 1 | [ [cat, dog], [bat, frog] ] | 1 | [cat, dog] | 1 | cat | 
| 2 |  [[rose, petunia], [lily, daisy]]  | 1 | [bat, frog] | 1 | dog | 
|  |  | 2 | [rose, petunia] | 1 | bat | 
|  |  | 2 | [lily, daisy] | 1 | frog | 
|  |  |  | 2 | 2 | rose | 
|  |  |  | 2 | 2 | petunia | 
|  |  |  | 2 | 2 | lily | 
|  |  |  | 2 | 2 | daisy | 

## Transform Image Data


Use Data Wrangler to import and transform the images that you're using for your machine learning (ML) pipelines. After you've prepared your image data, you can export it from your Data Wrangler flow to your ML pipeline.

You can use the information provided here to familiarize yourself with importing and transforming image data in Data Wrangler. Data Wrangler uses OpenCV to import images. For more information about supported image formats, see [Image file reading and writing](https://docs.opencv.org/3.4/d4/da8/group__imgcodecs.html#ga288b8b3da0892bd651fce07b3bbd3a56).

After you've familiarized yourself with the concepts of transforming your image data, go through the following tutorial, [Prepare image data with Amazon SageMaker Data Wrangler](https://aws.amazon.com/blogs/machine-learning/prepare-image-data-with-amazon-sagemaker-data-wrangler/).

The following industries and use cases are examples where applying machine learning to transformed image data can be useful:
+ Manufacturing – Identifying defects in items from the assembly line
+ Food – Identifying spoiled or rotten food
+ Medicine – Identifying lesions in tissues

When you work with image data in Data Wrangler, you go through the following process:

1. Import – Select the images by choosing the directory containing them in your Amazon S3 bucket.

1. Transform – Use the built-in transformations to prepare the images for your machine learning pipeline.

1. Export – Export the images that you’ve transformed to a location that can be accessed from the pipeline.

Use the following procedure to import your image data.

**To import your image data**

1. Navigate to the **Create connection** page.

1. Choose **Amazon S3**.

1. Specify the Amazon S3 file path that contains the image data.

1. For **File type**, choose **Image**.

1. (Optional) Choose **Import nested directories** to import images from multiple Amazon S3 paths.

1. Choose **Import**.

Data Wrangler uses the open-source [imgaug](https://imgaug.readthedocs.io/en/latest/) library for its built-in image transformations. You can use the following built-in transformations:
+ **ResizeImage**
+ **EnhanceImage**
+ **CorruptImage**
+ **SplitImage**
+ **DropCorruptedImages**
+ **DropImageDuplicates**
+ **Brightness**
+ **ColorChannels**
+ **Grayscale**
+ **Rotate**

Use the following procedure to transform your images without writing code.

**To transform the image data without writing code**

1. From your Data Wrangler flow, choose the **\$1** next to the node representing the images that you've imported.

1. Choose **Add transform**.

1. Choose **Add step**.

1. Choose the transform and configure it.

1. Choose **Preview**.

1. Choose **Add**.

In addition to using the transformations that Data Wrangler provides, you can also use your own custom code snippets. For more information about using custom code snippets, see [Custom Transforms](#data-wrangler-transform-custom). You can import the OpenCV and imgaug libraries within your code snippets and use the transforms associated with them. The following is an example of a code snippet that detects edges within the images.

```
# A table with your image data is stored in the `df` variable
import cv2
import numpy as np
from pyspark.sql.functions import column

from sagemaker_dataprep.compute.operators.transforms.image.constants import DEFAULT_IMAGE_COLUMN, IMAGE_COLUMN_TYPE
from sagemaker_dataprep.compute.operators.transforms.image.decorators import BasicImageOperationDecorator, PandasUDFOperationDecorator


@BasicImageOperationDecorator
def my_transform(image: np.ndarray) -> np.ndarray:
  # To use the code snippet on your image data, modify the following lines within the function
    HYST_THRLD_1, HYST_THRLD_2 = 100, 200
    edges = cv2.Canny(image,HYST_THRLD_1,HYST_THRLD_2)
    return edges
    

@PandasUDFOperationDecorator(IMAGE_COLUMN_TYPE)
def custom_image_udf(image_row):
    return my_transform(image_row)
    

df = df.withColumn(DEFAULT_IMAGE_COLUMN, custom_image_udf(column(DEFAULT_IMAGE_COLUMN)))
```

When apply transformations in your Data Wrangler flow, Data Wrangler only applies them to a sample of the images in your dataset. To optimize your experience with the application, Data Wrangler doesn't apply the transforms to all of your images.

To apply the transformations to all of your images, export your Data Wrangler flow to an Amazon S3 location. You can use the images that you've exported in your training or inference pipelines. Use a destination node or a Jupyter Notebook to export your data. You can access either method for exporting your data from the Data Wrangler flow. For information about using these methods, see [Export to Amazon S3](data-wrangler-data-export.md#data-wrangler-data-export-s3).

## Filter data


Use Data Wrangler to filter the data in your columns. When you filter the data in a column, you specify the following fields:
+ **Column name** – The name of the column that you're using to filter the data.
+ **Condition** – The type of filter that you're applying to values in the column.
+ **Value** – The value or category in the column to which you're applying the filter.

You can filter on the following conditions:
+ **=** – Returns values that match the value or category that you specify.
+ **\$1=** – Returns values that don't match the value or category that you specify.
+ **>=** – For **Long** or **Float** data, filters for values that are greater than or equal to the value that you specify.
+ **<=** – For **Long** or **Float** data, filters for values that are less than or equal to the value that you specify.
+ **>** – For **Long** or **Float** data, filters for values that are greater than the value that you specify.
+ **<** – For **Long** or **Float** data, filters for values that are less than the value that you specify.

For a column that has the categories, `male` and `female`, you can filter out all the `male` values. You could also filter for all the `female` values. Because there are only `male` and `female` values in the column, the filter returns a column that only has `female` values.

You can also add multiple filters. The filters can be applied across multiple columns or the same column. For example, if you're creating a column that only has values within a certain range, you add two different filters. One filter specifies that the column must have values greater than the value that you provide. The other filter specifies that the column must have values less than the value that you provide.

Use the following procedure to add the filter transform to your data.

**To filter your data**

1. From your Data Wrangler flow, choose the **\$1** next to the node with the data that you're filtering.

1. Choose **Add transform**.

1. Choose **Add step**.

1. Choose **Filter data**.

1. Specify the following fields:
   + **Column name** – The column that you're filtering.
   + **Condition** – The condition of the filter.
   + **Value** – The value or category in the column to which you're applying the filter.

1. (Optional) Choose **\$1** following the filter that you've created.

1. Configure the filter.

1. Choose **Preview**.

1. Choose **Add**.

## Map Columns for Amazon Personalize


Data Wrangler integrates with Amazon Personalize, a fully managed machine learning service that generates item recommendations and user segments. You can use the **Map columns for Amazon Personalize** transform to get your data into a format that Amazon Personalize can interpret. For more information about the transforms specific to Amazon Personalize, see [Importing data using Amazon SageMaker Data Wrangler](https://docs.aws.amazon.com/personalize/latest/dg/preparing-importing-with-data-wrangler.html#dw-transform-data). For more information about Amazon Personalize see [What is Amazon Personalize?](https://docs.aws.amazon.com/personalize/latest/dg/what-is-personalize.html)

# Analyze and Visualize


Amazon SageMaker Data Wrangler includes built-in analyses that help you generate visualizations and data analyses in a few clicks. You can also create custom analyses using your own code. 

You add an analysis to a dataframe by selecting a step in your data flow, and then choosing **Add analysis**. To access an analysis you've created, select the step that contains the analysis, and select the analysis. 

All analyses are generated using 100,000 rows of your dataset. 

You can add the following analysis to a dataframe:
+ Data visualizations, including histograms and scatter plots. 
+ A quick summary of your dataset, including number of entries, minimum and maximum values (for numeric data), and most and least frequent categories (for categorical data).
+ A quick model of the dataset, which can be used to generate an importance score for each feature. 
+ A target leakage report, which you can use to determine if one or more features are strongly correlated with your target feature.
+ A custom visualization using your own code. 

Use the following sections to learn more about these options. 

## Histogram


Use histograms to see the counts of feature values for a specific feature. You can inspect the relationships between features using the **Color by** option. For example, the following histogram charts the distribution of user ratings of the best-selling books on Amazon from 2009–2019, colored by genre. 

![\[Example histogram chart in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/histogram.png)


You can use the **Facet by** feature to create histograms of one column, for each value in another column. For example, the following diagram shows histograms of user reviews of best-selling books on Amazon if faceted by year. 

![\[Example histograms in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/review_by_year.png)


## Scatter Plot


Use the **Scatter Plot** feature to inspect the relationship between features. To create a scatter plot, select a feature to plot on the **X axis** and the **Y axis**. Both of these columns must be numeric typed columns. 

You can color scatter plots by an additional column. For example, the following example shows a scatter plot comparing the number of reviews against user ratings of top-selling books on Amazon between 2009 and 2019. The scatter plot is colored by book genre. 

![\[Example scatter plot in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/scatter-plot.png)


Additionally, you can facet scatter plots by features. For example, the following image shows an example of the same review versus user rating scatter plot, faceted by year. 

![\[Example faceted scatter plot in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/scatter-plot-facet.png)


## Table Summary


Use the **Table Summary** analysis to quickly summarize your data.

For columns with numerical data, including log and float data, a table summary reports the number of entries (count), minimum (min), maximum (max), mean, and standard deviation (stddev) for each column.

For columns with non-numerical data, including columns with string, Boolean, or date/time data, a table summary reports the number of entries (count), least frequent value (min), and most frequent value (max). 

## Quick Model


Use the **Quick Model** visualization to quickly evaluate your data and produce importance scores for each feature. A [feature importance score](http://spark.apache.org/docs/2.1.0/api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassificationModel.featureImportances) score indicates how useful a feature is at predicting a target label. The feature importance score is between [0, 1] and a higher number indicates that the feature is more important to the whole dataset. On the top of the quick model chart, there is a model score. A classification problem shows an F1 score. A regression problem has a mean squared error (MSE) score.

When you create a quick model chart, you select a dataset you want evaluated, and a target label against which you want feature importance to be compared. Data Wrangler does the following:
+ Infers the data types for the target label and each feature in the dataset selected. 
+ Determines the problem type. Based on the number of distinct values in the label column, Data Wrangler determines if this is a regression or classification problem type. Data Wrangler sets a categorical threshold to 100. If there are more than 100 distinct values in the label column, Data Wrangler classifies it as a regression problem; otherwise, it is classified as a classification problem. 
+ Pre-processes features and label data for training. The algorithm used requires encoding features to vector type and encoding labels to double type. 
+ Trains a random forest algorithm with 70% of data. Spark’s [RandomForestRegressor](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-regression) is used to train a model for regression problems. The [RandomForestClassifier](https://spark.apache.org/docs/latest/ml-classification-regression.html#random-forest-classifier) is used to train a model for classification problems.
+ Evaluates a random forest model with the remaining 30% of data. Data Wrangler evaluates classification models using an F1 score and evaluates regression models using an MSE score.
+ Calculates feature importance for each feature using the Gini importance method. 

The following image shows the user interface for the quick model feature. 

![\[Example UI of the quick model feature in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/quick-model.png)


## Target Leakage


Target leakage occurs when there is data in a machine learning training dataset that is strongly correlated with the target label, but is not available in real-world data. For example, you may have a column in your dataset that serves as a proxy for the column you want to predict with your model. 

When you use the **Target Leakage** analysis, you specify the following:
+ **Target**: This is the feature about which you want your ML model to be able to make predictions.
+ **Problem type**: This is the ML problem type on which you are working. Problem type can either be **classification** or **regression**. 
+  (Optional) **Max features**: This is the maximum number of features to present in the visualization, which shows features ranked by their risk of being target leakage.

For classification, the target leakage analysis uses the area under the receiver operating characteristic, or AUC - ROC curve for each column, up to **Max features**. For regression, it uses a coefficient of determination, or R2 metric.

The AUC - ROC curve provides a predictive metric, computed individually for each column using cross-validation, on a sample of up to around 1000 rows. A score of 1 indicates perfect predictive abilities, which often indicates target leakage. A score of 0.5 or lower indicates that the information on the column could not provide, on its own, any useful information towards predicting the target. Although it can happen that a column is uninformative on its own but is useful in predicting the target when used in tandem with other features, a low score could indicate the feature is redundant.

For example, the following image shows a target leakage report for a diabetes classification problem, that is, predicting if a person has diabetes or not. An AUC - ROC curve is used to calculate the predictive ability of five features, and all are determined to be safe from target leakage.

![\[Example target leakage report in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/target-leakage.png)


## Multicollinearity


Multicollinearity is a circumstance where two or more predictor variables are related to each other. The predictor variables are the features in your dataset that you're using to predict a target variable. When you have multicollinearity, the predictor variables are not only predictive of the target variable, but also predictive of each other.

You can use the **Variance Inflation Factor (VIF)**, **Principal Component Analysis (PCA)**, or **Lasso feature selection** as measures for the multicollinearity in your data. For more information, see the following.

------
#### [ Variance Inflation Factor (VIF) ]

The Variance Inflation Factor (VIF) is a measure of collinearity among variable pairs. Data Wrangler returns a VIF score as a measure of how closely the variables are related to each other. A VIF score is a positive number that is greater than or equal to 1.

A score of 1 means that the variable is uncorrelated with the other variables. Scores greater than 1 indicate higher correlation.

Theoretically, you can have a VIF score with a value of infinity. Data Wrangler clips high scores to 50. If you have a VIF score greater than 50, Data Wrangler sets the score to 50.

You can use the following guidelines to interpret your VIF scores:
+ A VIF score less than or equal to 5 indicates that the variables are moderately correlated with the other variables.
+ A VIF score greater than or equal to 5 indicates that the variables are highly correlated with the other variables.

------
#### [ Principle Component Analysis (PCA) ]

Principal Component Analysis (PCA) measures the variance of the data along different directions in the feature space. The feature space consists of all the predictor variables that you use to predict the target variable in your dataset.

For example, if you're trying to predict who survived on the *RMS Titanic* after it hit an iceberg, your feature space can include the passengers' age, gender, and the fare that they paid.

From the feature space, PCA generates an ordered list of variances. These variances are also known as singular values. The values in the list of variances are greater than or equal to 0. We can use them to determine how much multicollinearity there is in our data.

When the numbers are roughly uniform, the data has very few instances of multicollinearity. When there is a lot of variability among the values, we have many instances of multicollinearity. Before it performs PCA, Data Wrangler normalizes each feature to have a mean of 0 and a standard deviation of 1.

**Note**  
PCA in this circumstance can also be referred to as Singular Value Decomposition (SVD).

------
#### [ Lasso feature selection ]

Lasso feature selection uses the L1 regularization technique to only include the most predictive features in your dataset.

For both classification and regression, the regularization technique generates a coefficient for each feature. The absolute value of the coefficient provides an importance score for the feature. A higher importance score indicates that it is more predictive of the target variable. A common feature selection method is to use all the features that have a non-zero lasso coefficient.

------

## Detect Anomalies In Time Series Data


You can use the anomaly detection visualization to see outliers in your time series data. To understand what determines an anomaly, you need to understand that we decompose the time series into a predicted term and an error term. We treat the seasonality and trend of the time series as the predicted term. We treat the residuals as the error term.

For the error term, you specify a threshold as the number of standard of deviations the residual can be away from the mean for it to be considered an anomaly. For example, you can specify a threshold as being 3 standard deviations. Any residual greater than 3 standard deviations away from the mean is an anomaly.

You can use the following procedure to perform an **Anomaly detection** analysis.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add analysis**.

1. For **Analysis type**, choose **Time Series**.

1. For **Visualization**, choose **Anomaly detection**.

1. For **Anomaly threshold**, choose the threshold that a value is considered an anomaly.

1. Choose **Preview** to generate a preview of the analysis.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Seasonal Trend Decomposition In Time Series Data


You can determine whether there's seasonality in your time series data by using the Seasonal Trend Decomposition visualization. We use the STL (Seasonal Trend decomposition using LOESS) method to perform the decomposition. We decompose the time series into its seasonal, trend, and residual components. The trend reflects the long term progression of the series. The seasonal component is a signal that recurs in a time period. After removing the trend and the seasonal components from the time series, you have the residual.

You can use the following procedure to perform a **Seasonal-Trend decomposition** analysis.

1. Open your Data Wrangler data flow.

1. In your data flow, under **Data types**, choose the **\$1**, and select **Add analysis**.

1. For **Analysis type**, choose **Time Series**.

1. For **Visualization**, choose **Seasonal-Trend decomposition**.

1. For **Anomaly threshold**, choose the threshold that a value is considered an anomaly.

1. Choose **Preview** to generate a preview of the analysis.

1. Choose **Add** to add the transform to the Data Wrangler data flow.

## Bias Report


You can use the bias report in Data Wrangler to uncover potential biases in your data. To generate a bias report, you must specify the target column, or **Label**, that you want to predict and a **Facet**, or the column that you want to inspect for biases.

**Label**: The feature about which you want a model to make predictions. For example, if you are predicting customer conversion, you may select a column containing data on whether or not a customer has placed an order. You must also specify whether this feature is a label or a threshold. If you specify a label, you must specify what a *positive outcome* looks like in your data. In the customer conversion example, a positive outcome may be a 1 in the orders column, representing the positive outcome of a customer placing an order within the last three months. If you specify a threshold, you must specify a lower bound defining a positive outcome. For example, if your customer orders columns contains the number of orders placed in the last year, you may want to specify 1.

**Facet**: The column that you want to inspect for biases. For example, if you are trying to predict customer conversion, your facet may be the age of the customer. You may choose this facet because you believe that your data is biased toward a certain age group. You must identify whether the facet is measured as a value or threshold. For example, if you wanted to inspect one or more specific ages, you select **Value** and specify those ages. If you want to look at an age group, you select **Threshold** and specify the threshold of ages you want to inspect.

After you select your feature and label, you select the types of bias metrics you want to calculate.

To learn more, see [Generate reports for bias in pre-training data](https://docs.aws.amazon.com/sagemaker/latest/dg/data-bias-reports.html). 

## Create Custom Visualizations


You can add an analysis to your Data Wrangler flow to create a custom visualization. Your dataset, with all the transformations you've applied, is available as a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html). Data Wrangler uses the `df` variable to store the dataframe. You access the dataframe by calling the variable.

You must provide the output variable, `chart`, to store an [Altair](https://altair-viz.github.io/) output chart. For example, you can use the following code block to create a custom histogram using the Titanic dataset.

```
import altair as alt
df = df.iloc[:30]
df = df.rename(columns={"Age": "value"})
df = df.assign(count=df.groupby('value').value.transform('count'))
df = df[["value", "count"]]
base = alt.Chart(df)
bar = base.mark_bar().encode(x=alt.X('value', bin=True, axis=None), y=alt.Y('count'))
rule = base.mark_rule(color='red').encode(
    x='mean(value):Q',
    size=alt.value(5))
chart = bar + rule
```

**To create a custom visualization:**

1. Next to the node containing the transformation that you'd like to visualize, choose the **\$1**.

1. Choose **Add analysis**.

1. For **Analysis type**, choose **Custom Visualization**.

1. For **Analysis name**, specify a name.

1. Enter your code in the code box. 

1. Choose **Preview** to preview your visualization.

1. Choose **Save** to add your visualization.

![\[Example on how to add your visualization in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/custom-visualization.png)


If you don’t know how to use the Altair visualization package in Python, you can use custom code snippets to help you get started.

Data Wrangler has a searchable collection of visualization snippets. To use a visualization snippet, choose **Search example snippets** and specify a query in the search bar.

The following example uses the **Binned scatterplot** code snippet. It plots a histogram for 2 dimensions.

The snippets have comments to help you understand the changes that you need to make to the code. You usually need to specify the column names of your dataset in the code.

```
import altair as alt

# Specify the number of top rows for plotting
rows_number = 1000
df = df.head(rows_number)  
# You can also choose bottom rows or randomly sampled rows
# df = df.tail(rows_number)
# df = df.sample(rows_number)


chart = (
    alt.Chart(df)
    .mark_circle()
    .encode(
        # Specify the column names for binning and number of bins for X and Y axis
        x=alt.X("col1:Q", bin=alt.Bin(maxbins=20)),
        y=alt.Y("col2:Q", bin=alt.Bin(maxbins=20)),
        size="count()",
    )
)

# :Q specifies that label column has quantitative type.
# For more details on Altair typing refer to
# https://altair-viz.github.io/user_guide/encoding.html#encoding-data-types
```

# Reusing Data Flows for Different Datasets


For Amazon Simple Storage Service (Amazon S3) data sources, you can create and use parameters. A parameter is a variable that you've saved in your Data Wrangler flow. Its value can be any portion of the data source's Amazon S3 path. Use parameters to quickly change the data that you're importing into a Data Wrangler flow or exporting to a processing job. You can also use parameters to select and import a specific subset of your data.

After you created a Data Wrangler flow, you might have trained a model on the data that you've transformed. For datasets that have the same schema, you can use parameters to apply the same transformations on a different dataset and train a different model. You can use the new datasets to perform inference with your model or you could be using them to retrain your model.

In general, parameters have the following attributes:
+ Name – The name you specify for the parameter
+ Type – The type of value that the parameter represents
+ Default value – The value of the parameter when you don't specify a new value

**Note**  
Datetime parameters have a time range attribute that they use as the default value.

Data Wrangler uses curly braces, `{{}}`, to indicate that a parameter is being used in the Amazon S3 path. For example, you can have a URL such as `s3://amzn-s3-demo-bucket1/{{example_parameter_name}}/example-dataset.csv`.

You create a parameter when you're editing the Amazon S3 data source that you've imported. You can set any portion of the file path to a parameter value. You can set the parameter value to either a value or a pattern. The following are the available parameter value types in the Data Wrangler flow:
+ Number
+ String
+ Pattern
+ Datetime

**Note**  
You can't create a pattern parameter or a datetime parameter for the name of the bucket in the Amazon S3 path.

You must set a number as the default value of a number parameter. You can change the value of the parameter to a different number when you're editing a parameter or when you're launching a processing job. For example, in the S3 path, `s3://amzn-s3-demo-bucket/example-prefix/example-file-1.csv`, you can create a number parameter named `number_parameter` in the place of `1`. Your S3 path now appears as `s3://amzn-s3-demo-bucket/example-prefix/example-file-{{number_parameter}}.csv`. The path continues to point to the `example-file-1.csv` dataset until you change the value of the parameter. If you change the value of `number_parameter` to `2` the path is now `s3://amzn-s3-demo-bucket/example-prefix/example-file-2.csv`. You can import `example-file-2.csv` into Data Wrangler if you've uploaded the file to that Amazon S3 location.

A string parameter stores a string as its default value. For example, in the S3 path, `s3://amzn-s3-demo-bucket/example-prefix/example-file-1.csv`, you can create a string parameter named `string_parameter` in the place of the filename, `example-file-1.csv`. The path now appears as `s3://amzn-s3-demo-bucket/example-prefix/{{string_parameter}}`. It continues to match `s3://amzn-s3-demo-bucket/example-prefix/example-file-1.csv`, until you change the value of the parameter.

Instead of specifying the filename as a string parameter, you can create a string parameter using the entire Amazon S3 path. You can specify a dataset from any Amazon S3 location in the string parameter.

A pattern parameter stores a regular expression (Python REGEX) string as its default value. You can use a pattern parameter to import multiple data files at the same time. To import more than one object at a time, specify a parameter value that matches the Amazon S3 objects that you're importing.

You can also create a pattern parameter for the following datasets:
+ s3://amzn-s3-demo-bucket1/example-prefix/example-file-1.csv
+ s3://amzn-s3-demo-bucket1/example-prefix/example-file-2.csv
+ s3://amzn-s3-demo-bucket1/example-prefix/example-file-10.csv
+ s3://amzn-s3-demo-bucket/example-prefix/example-file-0123.csv

For `s3://amzn-s3-demo-bucket1/example-prefix/example-file-1.csv`, you can create a pattern parameter in the place of `1`, and set the default value of the parameter to `\d+`. The `\d+` REGEX string matches any one or more decimal digits. If you create a pattern parameter named `pattern_parameter`, your S3 path appears as `s3://amzn-s3-demo-bucket1/example-prefix/example-file-{{pattern_parameter}}.csv`.

You can also use pattern parameters to match all CSV objects within your bucket. To match all objects in a bucket, create a pattern parameter with the default value of `.*` and set the path to `s3://amzn-s3-demo-bucket/{{pattern_parameter}}.csv`. The `.*` character matches any string character in the path. 

The `s3://amzn-s3-demo-bucket/{{pattern_parameter}}.csv` path can match the following datasets.
+ `example-file-1.csv`
+ `other-example-file.csv`
+ `example-file-a.csv`

A datetime parameter stores the format with the following information:
+ A format for parsing strings inside an Amazon S3 path.
+ A relative time range to limit the datetime values that match

For example, in the Amazon S3 file path, `s3://amzn-s3-demo-bucket/2020/01/01/example-dataset.csv`, 2020/01/01 represents a datetime in the format of `year/month/day`. You can set the parameter’s time range to an interval such as `1 years` or `24 hours`. An interval of `1 years` matches all S3 paths with datetimes that fall between the current time and the time exactly a year before the current time. The current time is the time when you start exporting the transformations that you've made to the data. For more information about exporting data, see [Export](data-wrangler-data-export.md). If the current date is 2022/01/01 and the time range is `1 years`, the S3 path matches datasets such as the following:
+ s3://amzn-s3-demo-bucket/2021/01/01/example-dataset.csv
+ s3://amzn-s3-demo-bucket/2021/06/30/example-dataset.csv
+ s3://amzn-s3-demo-bucket/2021/12/31/example-dataset.csv

The datetime values within a relative time range change as time passes. The S3 paths that fall within the relative time range might also differ.

For the Amazon S3 file path, `s3://amzn-s3-demo-bucket1/20200101/example-dataset.csv`, `20220101` is an example of a path that can become a datetime parameter.

To view a table of all parameters that you've created in Data Wrangler flow, choose the `\$1\$1\$1\$1` to the right of the text box containing the Amazon S3 path. If you no longer need a parameter that you've created, you can edit or delete. To edit or delete a parameter, choose icons to the right of the parameter.

**Important**  
Before you delete a parameter, make sure that you haven't used it anywhere in your Data Wrangler flow. Deleted parameters that are still within the flow cause errors.

You can create parameters for any step of your Data Wrangler flow. You can edit or delete any parameter that you create. If you're applying transformations to data that is no longer relevant to your use case, you can modify the values of parameters. Modifying the values of the parameters changes the data that you're importing.

The following sections provide additional examples and general guidance on using parameters. You can use the sections to understand the parameters that work best for you.

**Note**  
The following sections contain procedures that use the Data Wrangler interface to override the parameters and create a processing job.  
You can also override the parameters by using the following procedures.  
To export your Data Wrangler flow and override the value of a parameter, do the following.  
Choose the **\$1** next to the node that you want to export.
Choose **Export to**.
Choose the location where you're exporting the data.
Under `parameter_overrides`, specify different values for the parameters that you've created.
Run the Jupyter Notebook.

## Applying a Data Wrangler flow to files using patterns


You can use parameters to apply transformations in your Data Wrangler flow to different files that match a pattern in the Amazon S3 URI path. This helps you specify the files in your S3 bucket that you want to transform with high specificity. For example, you might have a dataset with the path `s3://amzn-s3-demo-bucket1/example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv`. Different datasets named `example-dataset.csv` are stored under many different example prefixes. The prefixes might also be numbered sequentially. You can create patterns for the numbers in the Amazon S3 URI. Pattern parameters use REGEX to select any number of files that match the pattern of the expression. The following are REGEX patterns that might be useful:
+ `.*` – Matches zero or more of any character, except newline characters
+ `.+` – Matches one or more of any character, excluding newline characters
+ `\d+` – Matches one or more of any decimal digit
+ `\w+` – Matches one or more of any alphanumeric character
+ `[abc-_]{2,4}` – Matches a string two, three, or four characters composed of the set of characters provided within a set of brackets
+ `abc|def` – Matches one string or another. For example, the operation matches either `abc` or `def`

You can replace each number in the following paths with a single parameter that has a value of `\d+`.
+ `s3://amzn-s3-demo-bucket1/example-prefix-3/example-prefix-4/example-prefix-5/example-dataset.csv`
+ `s3://amzn-s3-demo-bucket1/example-prefix-8/example-prefix-12/example-prefix-13/example-dataset.csv`
+ `s3://amzn-s3-demo-bucket1/example-prefix-4/example-prefix-9/example-prefix-137/example-dataset.csv`

The following procedure creates a pattern parameter for a dataset with the path `s3://amzn-s3-demo-bucket1/example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv`.

To create a pattern parameter, do the following.

1. Next to the dataset that you've imported, choose **Edit dataset**.

1. Highlight the `0` in `example-prefix-0`.

1. Specify values for the following fields:
   + **Name** – A name for parameter
   + **Type** – **Pattern**
   + **Value** – **\$1d\$1** a regular expression that corresponds to one or more digits

1. Choose **Create**.

1. Replace the `1` and the `2` in S3 URI path with the parameter. The path should have the following format: `s3://amzn-s3-demo-bucket1/example-prefix-{{example_parameter_name}}/example-prefix-{{example_parameter_name}}/example-prefix-{{example_parameter_name}}/example-dataset.csv`

The following is a general procedure for creating a pattern parameter.

1. Navigate to your Data Wrangler flow.

1. Next to the dataset that you've imported, choose **Edit dataset**.

1. Highlight the portion of the URI that you're using as the value of the pattern parameter.

1. Choose **Create custom parameter**.

1. Specify values for the following fields:
   + **Name** – A name for parameter
   + **Type** – **Pattern**
   + **Value** – A regular expression containing the pattern that you'd like to store.

1. Choose **Create**.

## Applying a Data Wrangler flow to files using numeric values


You can use parameters to apply transformations in your Data Wrangler flow to different files that have similar paths. For example, you might have a dataset with the path `s3://amzn-s3-demo-bucket1/example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv`.

You might have the transformations from your Data Wrangler flow that you've applied to datasets under `example-prefix-1`. You might want to apply the same transformations to `example-dataset.csv` that falls under `example-prefix-10` or `example-prefix-20`.

You can create a parameter that stores the value `1`. If you want to apply the transformations to different datasets, you can create processing jobs that replace the value of the parameter with a different value. The parameter acts as a placeholder for you to change when you want to apply the transformations from your Data Wrangler flow to new data. You can override the value of the parameter when you create a Data Wrangler processing job to apply the transformations in your Data Wrangler flow to different datasets.

Use the following procedure to create numeric parameters for `s3://amzn-s3-demo-bucket1/example-prefix-0/example-prefix-1/example-prefix-2/example-dataset.csv`.

To create parameters for the preceding S3 URI path, do the following.

1. Navigate to your Data Wrangler flow.

1. Next to the dataset that you've imported, choose **Edit dataset**.

1. Highlight the number in an example prefix of `example-prefix-number`.

1. Choose **Create custom parameter**.

1. For **Name**, specify a name for the parameter.

1. For **Type**, choose **Integer**.

1. For **Value**, specify the number.

1. Create parameters for the remaining numbers by repeating the procedure.

After you've created the parameters, apply the transforms to your dataset and create a destination node for them. For more information about destination nodes, see [Export](data-wrangler-data-export.md).

Use the following procedure to apply the transformations from your Data Wrangler flow to a different time range. It assumes that you've created a destination node for the transformations in your flow.

To change the value of a numeric parameter in a Data Wrangler processing job, do the following.

1. From your Data Wrangler flow, choose **Create job**

1. Select only the destination node that contains the transformations to the dataset containing the datetime parameters.

1. Choose **Configure job**.

1. Choose **Parameters**.

1. Choose the name of a parameter that you've created.

1. Change the value of the parameter.

1. Repeat the procedure for the other parameters.

1. Choose **Run**.

## Applying a Data Wrangler flow to files using strings


You can use parameters to apply transformations in your Data Wrangler flow to different files that have similar paths. For example, you might have a dataset with the path `s3://amzn-s3-demo-bucket1/example-prefix/example-dataset.csv`.

You might have transformations from your Data Wrangler flow that you've applied to datasets under `example-prefix`. You might want to apply the same transformations to `example-dataset.csv` under `another-example-prefix` or `example-prefix-20`.

You can create a parameter that stores the value `example-prefix`. If you want to apply the transformations to different datasets, you can create processing jobs that replace the value of the parameter with a different value. The parameter acts as a placeholder for you to change when you want to apply the transformations from your Data Wrangler flow to new data. You can override the value of the parameter when you create a Data Wrangler processing job to apply the transformations in your Data Wrangler flow to different datasets.

Use the following procedure to create a string parameter for `s3://amzn-s3-demo-bucket1/example-prefix/example-dataset.csv`.

To create a parameter for the preceding S3 URI path, do the following.

1. Navigate to your Data Wrangler flow.

1. Next to the dataset that you've imported, choose **Edit dataset**.

1. Highlight the example prefix, `example-prefix`.

1. Choose **Create custom parameter**.

1. For **Name**, specify a name for the parameter.

1. For **Type**, choose **String**.

1. For **Value**, specify the prefix.

After you've created the parameter, apply the transforms to your dataset and create a destination node for them. For more information about destination nodes, see [Export](data-wrangler-data-export.md).

Use the following procedure to apply the transformations from your Data Wrangler flow to a different time range. It assumes that you've created a destination node for the transformations in your flow.

To change the value of a numeric parameter in a Data Wrangler processing job, do the following:

1. From your Data Wrangler flow, choose **Create job**

1. Select only the destination node that contains the transformations to the dataset containing the datetime parameters.

1. Choose **Configure job**.

1. Choose **Parameters**.

1. Choose the name of a parameter that you've created.

1. Change the value of the parameter.

1. Repeat the procedure for the other parameters.

1. Choose **Run**.

## Applying a Data Wrangler flow to different datetime ranges


Use datetime parameters to apply transformations in your Data Wrangler flow to different time ranges. Highlight the portion of the Amazon S3 URI that has a timestamp and create a parameter for it. When you create a parameter, you specify a time range from the current time to a time in the past. For example, you might have an Amazon S3 URI that looks like the following: `s3://amzn-s3-demo-bucket1/example-prefix/2022/05/15/example-dataset.csv`. You can save `2022/05/15` as a datetime parameter. If you specify a year as the time range, the time range includes the moment that you run the processing job containing the datetime parameter and the time exactly one year ago. If the moment you're running the processing job is September 6th, 2022 or `2022/09/06`, the time ranges can include the following:
+ `s3://amzn-s3-demo-bucket1/example-prefix/2022/03/15/example-dataset.csv`
+ `s3://amzn-s3-demo-bucket1/example-prefix/2022/01/08/example-dataset.csv`
+ `s3://amzn-s3-demo-bucket1/example-prefix/2022/07/31/example-dataset.csv`
+ `s3://amzn-s3-demo-bucket1/example-prefix/2021/09/07/example-dataset.csv`

The transformations in the Data Wrangler flow apply to all of the preceding prefixes. Changing the value of the parameter in the processing job doesn't change the value of the parameter in the Data Wrangler flow. To apply the transformations to datasets within a different time range, do the following:

1. Create a destination node containing all the transformations that you'd like to use.

1. Create a Data Wrangler job.

1. Configure the job to use a different time range for the parameter. Changing the value of the parameter in the processing job doesn't change the value of the parameter in the Data Wrangler flow.

For more information about destination nodes and Data Wrangler jobs, see [Export](data-wrangler-data-export.md).

The following procedure creates a datetime parameter for the Amazon S3 path: `s3://amzn-s3-demo-bucket1/example-prefix/2022/05/15/example-dataset.csv`.

To create a datetime parameter for the preceding S3 URI path, do the following.

1. Navigate to your Data Wrangler flow.

1. Next to the dataset that you've imported, choose **Edit dataset**.

1. Highlight the portion of the URI that you're using as the value of the datetime parameter.

1. Choose **Create custom parameter**.

1. For **Name**, specify a name for the parameter.

1. For **Type**, choose **Datetime**.
**Note**  
By default, Data Wrangler selects **Predefined**, which provides a dropdown menu for you to select a date format. However, the timestamp format that you're using might not be available. Instead of using **Predefined** as the default option, you can choose **Custom** and specify the timestamp format manually.

1. For **Date format**, open the dropdown menu following **Predefined** and choose **yyyy/MM/dd**. The format, **yyyy/MM/dd,** corresponds to the year/month/day of the timestamp.

1. For **Timezone**, choose a time zone.
**Note**  
The data that you're analyzing might have time stamps taken in a different time zone from your time zone. Make sure that the time zone that you select matches the time zone of the data. 

1. For **Time range**, specify the time range for the parameter.

1. (Optional) Enter a description to describe how you're using the parameter.

1. Choose **Create**.

After you've created the datetime parameters, apply the transforms to your dataset and create a destination node for them. For more information about destination nodes, see [Export](data-wrangler-data-export.md).

Use the following procedure to apply the transformations from your Data Wrangler flow to a different time range. It assumes that you've created a destination node for the transformations in your flow.

To change the value of a datetime parameter in a Data Wrangler processing job, do the following:

1. From your Data Wrangler flow, choose **Create job**

1. Select only the destination node that contains the transformations to the dataset containing the datetime parameters.

1. Choose **Configure job**.

1. Choose **Parameters**.

1. Choose the name of the datetime parameter that you've created.

1. For **Time range**, change the time range for the datasets.

1. Choose **Run**.

# Export


In your Data Wrangler flow, you can export some or all of the transformations that you've made to your data processing pipelines.

A *Data Wrangler flow* is the series of data preparation steps that you've performed on your data. In your data preparation, you perform one or more transformations to your data. Each transformation is done using a transform step. The flow has a series of nodes that represent the import of your data and the transformations that you've performed. For an example of nodes, see the following image.

![\[Example data flow in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-0.png)


The preceding image shows a Data Wrangler flow with two nodes. The **Source - sampled** node shows the data source from which you've imported your data. The **Data types** node indicates that Data Wrangler has performed a transformation to convert the dataset into a usable format. 

Each transformation that you add to the Data Wrangler flow appears as an additional node. For information on the transforms that you can add, see [Transform Data](data-wrangler-transform.md). The following image shows a Data Wrangler flow that has a **Rename-column** node to change the name of a column in a dataset.

You can export your data transformations to the following:
+ Amazon S3
+ Pipelines
+ Amazon SageMaker Feature Store
+ Python Code

**Important**  
We recommend that you use the IAM `AmazonSageMakerFullAccess` managed policy to grant AWS permission to use Data Wrangler. If you don't use the managed policy, you can use an IAM policy that gives Data Wrangler access to an Amazon S3 bucket. For more information on the policy, see [Security and Permissions](data-wrangler-security.md).

When you export your data flow, you're charged for the AWS resources that you use. You can use cost allocation tags to organize and manage the costs of those resources. You create these tags for your user-profile and Data Wrangler automatically applies them to the resources used to export the data flow. For more information, see [Using Cost Allocation Tags](https://docs.aws.amazon.com//awsaccountbilling/latest/aboutv2/cost-alloc-tags.html).

## Export to Amazon S3


Data Wrangler gives you the ability to export your data to a location within an Amazon S3 bucket. You can specify the location using one of the following methods:
+ Destination node – Where Data Wrangler stores the data after it has processed it.
+ Export to – Exports the data resulting from a transformation to Amazon S3.
+ Export data – For small datasets, can quickly export the data that you've transformed.

Use the following sections to learn more about each of these methods.

------
#### [ Destination Node ]

If you want to output a series of data processing steps that you've performed to Amazon S3, you create a destination node. A *destination node* tells Data Wrangler where to store the data after you've processed it. After you create a destination node, you create a processing job to output the data. A *processing job* is an Amazon SageMaker processing job. When you're using a destination node, it runs the computational resources needed to output the data that you've transformed to Amazon S3. 

You can use a destination node to export some of the transformations or all of the transformations that you've made in your Data Wrangler flow.

You can use multiple destination nodes to export different transformations or sets of transformations. The following example shows two destination nodes in a single Data Wrangler flow.

![\[Example data flow showing two destination nodes in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-4.png)


You can use the following procedure to create destination nodes and export them to an Amazon S3 bucket.

To export your data flow, you create destination nodes and a Data Wrangler job to export the data. Creating a Data Wrangler job starts a SageMaker Processing job to export your flow. You can choose the destination nodes that you want to export after you've created them.
**Note**  
You can choose **Create job** in the Data Wrangler flow to view the instructions to use a processing job.

Use the following procedure to create destination nodes.

1. Choose the **\$1** next to the nodes that represent the transformations that you want to export.

1. Choose **Add destination**.  
![\[Example data flow showing how to add a destination in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/destination-nodes-add-destination-0.png)

1. Choose **Amazon S3**.  
![\[Example dataflow showing how to add destination in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/destination-nodes-add-destination-S3-selected.png)

1. Specify the following fields.
   + **Dataset name** – The name that you specify for the dataset that you're exporting.
   + **File type** – The format of the file that you're exporting.
   + **Delimiter** (CSV and Parquet files only) – The value used to separate other values.
   + **Compression** (CSV and Parquet files only) – The compression method used to reduce the file size. You can use the following compression methods:
     + bzip2
     + deflate
     + gzip
   + (Optional) **Amazon S3 location** – The S3 location that you're using to output the files.
   + (Optional) **Number of partitions** – The number of datasets that you're writing as the output of the processing job.
   + (Optional) **Partition by column** – Writes all data with the same unique value from the column.
   + (Optional) **Inference Parameters** – Selecting **Generate inference artifact** applies all of the transformations you've used in the Data Wrangler flow to data coming into your inference pipeline. The model in your pipeline makes predictions on the transformed data.

1. Choose **Add destination**.

Use the following procedure to create a processing job.

Create a job from the **Data flow** page and choose the destination nodes that you want to export.
**Note**  
You can choose **Create job** in the Data Wrangler flow to view the instructions for creating a processing job.

1. Choose **Create job**. The following image shows the pane that appears after you select **Create job**.  
![\[Example data flow create job pane in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/destination-nodes-create-job.png)

1. For **Job name**, specify the name of the export job.

1. Choose the destination nodes that you want to export.

1. (Optional) Specify a AWS KMS key ARN. A AWS KMS key is a cryptographic key that you can use to protect your data. For more information about AWS KMS keys, see [AWS Key Management Service](https://docs.aws.amazon.com//kms/latest/developerguide/overview.html).

1. (Optional) Under **Trained parameters**. choose **Refit** if you've done the following:
   + Sampled your dataset
   + Applied a transform that uses your data to create a new column in the dataset

   For more information about refitting the transformations you've made to an entire dataset, see [Refit Transforms to The Entire Dataset and Export Them](#data-wrangler-data-export-fit-transform).
**Note**  
For image data, Data Wrangler exports the transformations that you've made to all of the images. Refitting the transformations isn't applicable to your use case.

1. Choose **Configure job**. The following image shows the **Configure job** page.  
![\[Example data flow configure job page in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/destination-nodes-configure-job.png)

1. (Optional) Configure the Data Wrangler job. You can make the following configurations:
   + **Job configuration**
   + **Spark memory configuration**
   + **Network configuration**
   + **Tags**
   + **Parameters**
   + **Associate Schedules**

1. Choose **Run**.

------
#### [ Export to ]

As an alternative to using a destination node, you can use the **Export to** option to export your Data Wrangler flow to Amazon S3 using a Jupyter notebook. You can choose any data node in your Data Wrangler flow and export it. Exporting the data node exports the transformation that the node represents and the transformations that precede it.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Amazon S3.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose **Amazon S3 (via Jupyter Notebook)**.

1. Run the Jupyter notebook.  
![\[Example data flow showing how to export your Data Wrangler flow in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-export-to.png)

When you run the notebook, it exports your data flow (.flow file) in the same AWS Region as the Data Wrangler flow.

The notebook provides options that you can use to configure the processing job and the data that it outputs.

**Important**  
We provide you with job configurations to configure the output of your data. For the partitioning and driver memory options, we strongly recommend that you don't specify a configuration unless you already have knowledge about them.

Under **Job Configurations**, you can configure the following:
+ `output_content_type` – The content type of the output file. Uses `CSV` as the default format, but you can specify `Parquet`.
+ `delimiter` – The character used to separate values in the dataset when writing to a CSV file.
+ `compression` – If set, compresses the output file. Uses gzip as the default compression format.
+ `num_partitions` – The number of partitions or files that Data Wrangler writes as the output.
+ `partition_by` – The names of the columns that you use to partition the output.

To change the output file format from CSV to Parquet, change the value from `"CSV"` to `"Parquet"`. For the rest of the preceding fields, uncomment the lines containing the fields that you want to specify.

Under **(Optional) Configure Spark Cluster Driver Memory** you can configure Spark properties for the job, such as the Spark driver memory, in the `config` dictionary.

The following shows the `config` dictionary.

```
config = json.dumps({
    "Classification": "spark-defaults",
    "Properties": {
        "spark.driver.memory": f"{driver_memory_in_mb}m",
    }
})
```

To apply the configuration to the processing job, uncomment the following lines:

```
# data_sources.append(ProcessingInput(
#     source=config_s3_uri,
#     destination="/opt/ml/processing/input/conf",
#     input_name="spark-config",
#     s3_data_type="S3Prefix",
#     s3_input_mode="File",
#     s3_data_distribution_type="FullyReplicated"
# ))
```

------
#### [ Export data ]

If you have a transformation on a small dataset that you want to export quickly, you can use the **Export data** method. When you start choose **Export data**, Data Wrangler works synchronously to export the data that you've transformed to Amazon S3. You can't use Data Wrangler until either it finishes exporting your data or you cancel the operation.

For information on using the **Export data** method in your Data Wrangler flow, see the following procedure.

To use the **Export data** method:

1. Choose a node in your Data Wrangler flow by opening (double-clicking on) it.  
![\[Example data flow showing how to export data in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/export-s3.png)

1. Configure how you want to export the data.

1. Choose **Export data**.

------

When you export your data flow to an Amazon S3 bucket, Data Wrangler stores a copy of the flow file in the S3 bucket. It stores the flow file under the *data\$1wrangler\$1flows* prefix. If you use the default Amazon S3 bucket to store your flow files, it uses the following naming convention: `sagemaker-region-account number`. For example, if your account number is 111122223333 and you are using Studio Classic in us-east-1, your imported datasets are stored in `sagemaker-us-east-1-111122223333`. In this example, your .flow files created in us-east-1 are stored in `s3://sagemaker-region-account number/data_wrangler_flows/`. 

## Export to Pipelines


When you want to build and deploy large-scale machine learning (ML) workflows, you can use Pipelines to create workflows that manage and deploy SageMaker AI jobs. With Pipelines, you can build workflows that manage your SageMaker AI data preparation, model training, and model deployment jobs. You can use the first-party algorithms that SageMaker AI offers by using Pipelines. For more information on Pipelines, see [SageMaker Pipelines](https://docs.aws.amazon.com/sagemaker/latest/dg/pipelines.html).

When you export one or more steps from your data flow to Pipelines, Data Wrangler creates a Jupyter notebook that you can use to define, instantiate, run, and manage a pipeline.

### Use a Jupyter Notebook to Create a Pipeline


Use the following procedure to create a Jupyter notebook to export your Data Wrangler flow to Pipelines.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Pipelines.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose **Pipelines (via Jupyter Notebook)**.

1. Run the Jupyter notebook.

![\[Example data flow showing how to export your Data Wrangler flow in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-export-to.png)


You can use the Jupyter notebook that Data Wrangler produces to define a pipeline. The pipeline includes the data processing steps that are defined by your Data Wrangler flow. 

You can add additional steps to your pipeline by adding steps to the `steps` list in the following code in the notebook:

```
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[instance_type, instance_count],
    steps=[step_process], #Add more steps to this list to run in your Pipeline
)
```

For more information on defining pipelines, see [Define SageMaker AI Pipeline](https://docs.aws.amazon.com/sagemaker/latest/dg/define-pipeline.html).

## Export to an Inference Endpoint


Use your Data Wrangler flow to process data at the time of inference by creating a SageMaker AI serial inference pipeline from your Data Wrangler flow. An inference pipeline is a series of steps that results in a trained model making predictions on new data. A serial inference pipeline within Data Wrangler transforms the raw data and provides it to the machine learning model for a prediction. You create, run, and manage the inference pipeline from a Jupyter notebook within Studio Classic. For more information about accessing the notebook, see [Use a Jupyter Notebook to create an inference endpoint](#data-wrangler-inference-notebook).

Within the notebook, you can either train a machine learning model or specify one that you've already trained. You can either use Amazon SageMaker Autopilot or XGBoost to train the model using the data that you've transformed in your Data Wrangler flow.

The pipeline provides the ability to perform either batch or real-time inference. You can also add the Data Wrangler flow to SageMaker Model Registry. For more information about hosting models, see [Multi-model endpoints](multi-model-endpoints.md).

**Important**  
You can't export your Data Wrangler flow to an inference endpoint if it has the following transformations:  
Join
Concatenate
Group by
If you must use the preceding transforms to prepare your data, use the following procedure.  
Create a Data Wrangler flow.
Apply the preceding transforms that aren't supported.
Export the data to an Amazon S3 bucket.
Create a separate Data Wrangler flow.
Import the data that you've exported from the preceding flow.
Apply the remaining transforms.
Create a serial inference pipeline using the Jupyter notebook that we provide.
For information about exporting your data to an Amazon S3 bucket see [Export to Amazon S3](#data-wrangler-data-export-s3). For information about opening the Jupyter notebook used to create the serial inference pipeline, see [Use a Jupyter Notebook to create an inference endpoint](#data-wrangler-inference-notebook).

Data Wrangler ignores transforms that remove data at the time of inference. For example, Data Wrangler ignores the [Handle Missing Values](data-wrangler-transform.md#data-wrangler-transform-handle-missing) transform if you use the **Drop missing** configuration.

If you've refit transforms to your entire dataset, the transforms carry over to your inference pipeline. For example, if you used the median value to impute missing values, the median value from refitting the transform is applied to your inference requests. You can either refit the transforms from your Data Wrangler flow when you're using the Jupyter notebook or when you're exporting your data to an inference pipeline. For information about refitting transforms, see [Refit Transforms to The Entire Dataset and Export Them](#data-wrangler-data-export-fit-transform).

The serial inference pipeline supports the following data types for the input and output strings. Each data type has a set of requirements.

**Supported datatypes**
+ `text/csv` – the datatype for CSV strings
  + The string can't have a header.
  + Features used for the inference pipeline must be in the same order as features in the training dataset.
  + There must be a comma delimiter between features.
  + Records must be delimited by a newline character.

  The following is an example of a validly formatted CSV string that you can provide in an inference request.

  ```
  abc,0.0,"Doe, John",12345\ndef,1.1,"Doe, Jane",67890                    
  ```
+ `application/json` – the datatype for JSON strings
  + The features used in the dataset for the inference pipeline must be in the same order as the features in the training dataset.
  + The data must have a specific schema. You define schema as a single `instances` object that has a set of `features`. Each `features` object represents an observation.

  The following is an example of a validly formatted JSON string that you can provide in an inference request.

  ```
  {
      "instances": [
          {
              "features": ["abc", 0.0, "Doe, John", 12345]
          },
          {
              "features": ["def", 1.1, "Doe, Jane", 67890]
          }
      ]
  }
  ```

### Use a Jupyter Notebook to create an inference endpoint


Use the following procedure to export your Data Wrangler flow to create an inference pipeline.

To create an inference pipeline using a Jupyter notebook, do the following.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose **SageMaker AI Inference Pipeline (via Jupyter Notebook)**.

1. Run the Jupyter notebook.

When you run the Jupyter notebook, it creates an inference flow artifact. An inference flow artifact is a Data Wrangler flow file with additional metadata used to create the serial inference pipeline. The node that you're exporting encompasses all of the transforms from the preceding nodes.

**Important**  
Data Wrangler needs the inference flow artifact to run the inference pipeline. You can't use your own flow file as the artifact. You must create it by using the preceding procedure.

## Export to Python Code


To export all steps in your data flow to a Python file that you can manually integrate into any data processing workflow, use the following procedure.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Python Code.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose **Python Code**.

1. Run the Jupyter notebook.

![\[Example data flow showing how to export your Data Wrangler flow in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-export-to.png)


You might need to configure the Python script to make it run in your pipeline. For example, if you're running a Spark environment, make sure that you are running the script from an environment that has permission to access AWS resources.

## Export to Amazon SageMaker Feature Store


You can use Data Wrangler to export features you've created to Amazon SageMaker Feature Store. A feature is a column in your dataset. Feature Store is a centralized store for features and their associated metadata. You can use Feature Store to create, share, and manage curated data for machine learning (ML) development. Centralized stores make your data more discoverable and reusable. For more information about Feature Store, see [Amazon SageMaker Feature Store](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store.html).

A core concept in Feature Store is a feature group. A feature group is a collection of features, their records (observations), and associated metadata. It's similar to a table in a database.

You can use Data Wrangler to do one of the following:
+ Update an existing feature group with new records. A record is an observation in the dataset.
+ Create a new feature group from a node in your Data Wrangler flow. Data Wrangler adds the observations from your datasets as records in your feature group.

If you're updating an existing feature group, your dataset's schema must match the schema of the feature group. All the records in the feature group are replaced with the observations in your dataset.

You can use either a Jupyter notebook or a destination node to update your feature group with the observations in the dataset.

If your feature groups with the Iceberg table format have a custom offline store encryption key, make sure you grant the IAM that you're using for the Amazon SageMaker Processing job permissions to use it. At a minimum, you must grant it permissions to encrypt the data that you're writing to Amazon S3. To grant the permissions, give the IAM role the ability to use the [GenerateDataKey](https://docs.aws.amazon.com/kms/latest/APIReference/API_GenerateDataKey.html). For more information about granting IAM roles permissions to use AWS KMS keys see [https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html](https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html)

------
#### [ Destination Node ]

If you want to output a series of data processing steps that you've performed to a feature group, you can create a destination node. When you create and run a destination node, Data Wrangler updates a feature group with your data. You can also create a new feature group from the destination node UI. After you create a destination node, you create a processing job to output the data. A processing job is an Amazon SageMaker processing job. When you're using a destination node, it runs the computational resources needed to output the data that you've transformed to the feature group. 

You can use a destination node to export some of the transformations or all of the transformations that you've made in your Data Wrangler flow.

Use the following procedure to create a destination node to update a feature group with the observations from your dataset.

To update a feature group using a destination node, do the following.
**Note**  
You can choose **Create job** in the Data Wrangler flow to view the instructions for using a processing job to update the feature group.

1. Choose the **\$1** symbol next to the node containing the dataset that you'd like to export.

1. Under **Add destination**, choose **SageMaker AI Feature Store**.  
![\[Example data flow showing how to add destination in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/feature-store-destination-node-selection.png)

1. Choose (double-click) the feature group. Data Wrangler checks whether the schema of the feature group matches the schema of the data that you're using to update the feature group.

1. (Optional) Select **Export to offline store only** for feature groups that have both an online store and an offline store. This option only updates the offline store with observations from your dataset.

1. After Data Wrangler validates the schema of your dataset, choose **Add**.

Use the following procedure to create a new feature group with data from your dataset.

You can store your feature group in one of the following ways:
+ Online – Low-latency, high-availability cache for a feature group that provides real-time lookup of records. The online store allows quick access to the latest value for a record in a feature group.
+ Offline – Stores data for your feature group in an Amazon S3 bucket. You can store your data offline when you don't need low-latency (sub-second) reads. You can use an offline store for features used in data exploration, model training, and batch inference.
+ Both online and offline – Stores your data in both an online store and an offline store.

To create a feature group using a destination node, do the following.

1. Choose the **\$1** symbol next to the node containing the dataset that you'd like to export.

1. Under **Add destination**, choose **SageMaker AI Feature Store**.

1. Choose **Create Feature Group**.

1. In the following dialog box, if your dataset doesn't have an event time column, select **Create "EventTime" column**.

1. Choose **Next**.

1. Choose **Copy JSON Schema**. When you create a feature group, you paste the schema into the feature definitions.

1. Choose **Create**.

1. For **Feature group name**, specify a name for your feature group.

1. For **Description (optional)**, specify a description to make your feature group more discoverable.

1. To create a feature group for an online store, do the following.

   1. Select **Enable storage online**.

   1. For **Online store encryption key**, specify an AWS managed encryption key or an encryption key of your own.

1. To create a feature group for an offline store, do the following.

   1. Select **Enable storage offline**. Specify values for the following fields:
      + **S3 bucket name** – The name of the Amazon S3 bucket that stores the feature group.
      + (Optional) **Dataset directory name** – The Amazon S3 prefix that you're using to store the feature group.
      + **IAM Role ARN** – The IAM role that has access to Feature Store.
      + **Table Format** – Table format of your offline store. You can specify **Glue** or **Iceberg**. **Glue** is the default format.
      + **Offline store encryption key** – By default, Feature Store uses an AWS Key Management Service managed key, but you can use the field to specify a key of your own.

   1. Specify values for the following fields:
      + **S3 bucket name** – The name of the bucket storing the feature group.
      + **(Optional) Dataset directory name** – The Amazon S3 prefix that you're using to store the feature group.
      + **IAM Role ARN** – The IAM role that has access to feature store.
      + **Offline store encryption key** – By default, Feature Store uses an AWS managed key, but you can use the field to specify a key of your own.

1. Choose **Continue**.

1. Choose **JSON**.

1. Remove the placeholder brackets in the window.

1. Paste the JSON text from Step 6.

1. Choose **Continue**.

1. For **RECORD IDENTIFIER FEATURE NAME**, choose the column in your dataset that has unique identifiers for each record in your dataset.

1. For **EVENT TIME FEATURE NAME**, choose the column with the timestamp values.

1. Choose **Continue**.

1. (Optional) Add tags to make your feature group more discoverable.

1. Choose **Continue**.

1. Choose **Create feature group**.

1. Navigate back to your Data Wrangler flow and choose the refresh icon next to the **Feature Group** search bar.

**Note**  
If you've already created a destination node for a feature group within a flow, you can't create another destination node for the same feature group. If you want to create another destination node for the same feature group, you must create another flow file.

Use the following procedure to create a Data Wrangler job.

Create a job from the **Data flow** page and choose the destination nodes that you want to export.

1. Choose **Create job**. The following image shows the pane that appears after you select **Create job**.

1. For **Job name**, specify the name of the export job.

1. Choose the destination nodes that you want to export.

1. (Optional) For **Output KMS Key**, specify an ARN, ID, or alias of an AWS KMS key. A KMS key is a cryptographic key. You can use the key to encrypt the output data from the job. For more information about AWS KMS keys, see [AWS Key Management Service](https://docs.aws.amazon.com//kms/latest/developerguide/overview.html).

1. The following image shows the **Configure job** page with the **Job configuration** tab open.  
![\[Example data flow create job page in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/destination-nodes/destination-nodes-configure-job.png)

   (Optional) Under **Trained parameters**. choose **Refit** if you've done the following:
   + Sampled your dataset
   + Applied a transform that uses your data to create a new column in the dataset

   For more information about refitting the transformations you've made to an entire dataset, see [Refit Transforms to The Entire Dataset and Export Them](#data-wrangler-data-export-fit-transform).

1. Choose **Configure job**.

1. (Optional) Configure the Data Wrangler job. You can make the following configurations:
   + **Job configuration**
   + **Spark memory configuration**
   + **Network configuration**
   + **Tags**
   + **Parameters**
   + **Associate Schedules**

1. Choose **Run**.

------
#### [ Jupyter notebook ]

Use the following procedure to a Jupyter notebook to export to Amazon SageMaker Feature Store.

Use the following procedure to generate a Jupyter notebook and run it to export your Data Wrangler flow to Feature Store.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose **Amazon SageMaker Feature Store (via Jupyter Notebook)**.

1. Run the Jupyter notebook.

![\[Example data flow showing how to export your Data Wrangler flow in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/data-wrangler-destination-nodes-photo-export-to.png)


Running a Jupyter notebook runs a Data Wrangler job. Running a Data Wrangler job starts a SageMaker AI processing job. The processing job ingests the flow into an online and offline feature store.

**Important**  
The IAM role you use to run this notebook must have the following AWS managed policies attached: `AmazonSageMakerFullAccess` and `AmazonSageMakerFeatureStoreAccess`.

You only need to enable one online or offline feature store when you create a feature group. You can also enable both. To disable online store creation, set `EnableOnlineStore` to `False`:

```
# Online Store Configuration
online_store_config = {
    "EnableOnlineStore": False
}
```

The notebook uses the column names and types of the dataframe you export to create a feature group schema, which is used to create a feature group. A feature group is a group of features defined in the feature store to describe a record. The feature group defines the schema and features contained in the feature group. A feature group definition is composed of a list of features, a record identifier feature name, an event time feature name, and configurations for its online store and offline store. 

Each feature in a feature group can have one of the following types: *String*, *Fractional*, or *Integral*. If a column in your exported dataframe is not one of these types, it defaults to `String`. 

The following is an example of a feature group schema.

```
column_schema = [
    {
        "name": "Height",
        "type": "long"
    },
    {
        "name": "Input",
        "type": "string"
    },
    {
        "name": "Output",
        "type": "string"
    },
    {
        "name": "Sum",
        "type": "string"
    },
    {
        "name": "Time",
        "type": "string"
    }
]
```

Additionally, you must specify a record identifier name and event time feature name:
+ The *record identifier name* is the name of the feature whose value uniquely identifies a record defined in the feature store. Only the latest record per identifier value is stored in the online store. The record identifier feature name must be one of feature definitions' names.
+ The *event time feature name* is the name of the feature that stores the `EventTime` of a record in a feature group. An `EventTime` is a point in time when a new event occurs that corresponds to the creation or update of a record in a feature. All records in the feature group must have a corresponding `EventTime`.

The notebook uses these configurations to create a feature group, process your data at scale, and then ingest the processed data into your online and offline feature stores. To learn more, see [Data Sources and Ingestion](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-ingest-data.html).

------

The notebook uses these configurations to create a feature group, process your data at scale, and then ingest the processed data into your online and offline feature stores. To learn more, see [Data Sources and Ingestion](https://docs.aws.amazon.com/sagemaker/latest/dg/feature-store-ingest-data.html).

## Refit Transforms to The Entire Dataset and Export Them


When you import data, Data Wrangler uses a sample of the data to apply the encodings. By default, Data Wrangler uses the first 50,000 rows as a sample, but you can import the entire dataset or use a different sampling method. For more information, see [Import](data-wrangler-import.md).

The following transformations use your data to create a column in the dataset:
+ [Encode Categorical](data-wrangler-transform.md#data-wrangler-transform-cat-encode)
+ [Featurize Text](data-wrangler-transform.md#data-wrangler-transform-featurize-text)
+ [Handle Outliers](data-wrangler-transform.md#data-wrangler-transform-handle-outlier)
+ [Handle Missing Values](data-wrangler-transform.md#data-wrangler-transform-handle-missing)

If you used sampling to import your data, the preceding transforms only use the data from the sample to create the column. The transform might not have used all of the relevant data. For example, if you use the **Encode Categorical** transform, there might have been a category in the entire dataset that wasn't present in the sample.

You can either use a destination node or a Jupyter notebook to refit the transformations to the entire dataset. When Data Wrangler exports the transformations in the flow, it creates a SageMaker Processing job. When the processing job finishes, Data Wrangler saves the following files in either the default Amazon S3 location or an S3 location that you specify:
+ The Data Wrangler flow file that specifies the transformations that are refit to the dataset
+ The dataset with the refit transformations applied to it

You can open a Data Wrangler flow file within Data Wrangler and apply the transformations to a different dataset. For example, if you've applied the transformations to a training dataset, you can open and use the Data Wrangler flow file to apply the transformations to a dataset used for inference.

For a information about using destination nodes to refit transforms and export see the following pages:
+ [Export to Amazon S3](#data-wrangler-data-export-s3)
+ [Export to Amazon SageMaker Feature Store](#data-wrangler-data-export-feature-store)

Use the following procedure to run a Jupyter notebook to refit the transformations and export the data.

To run a Jupyter notebook and to refit the transformations and export your Data Wrangler flow, do the following.

1. Choose the **\$1** next to the node that you want to export.

1. Choose **Export to**.

1. Choose the location to which you're exporting the data.

1. For the `refit_trained_params` object, set `refit` to `True`.

1. For the `output_flow` field, specify the name of the output flow file with the refit transformations.

1. Run the Jupyter notebook.

## Create a Schedule to Automatically Process New Data


If you're processing data periodically, you can create a schedule to run the processing job automatically. For example, you can create a schedule that runs a processing job automatically when you get new data. For more information about processing jobs, see [Export to Amazon S3](#data-wrangler-data-export-s3) and [Export to Amazon SageMaker Feature Store](#data-wrangler-data-export-feature-store).

When you create a job you must specify an IAM role that has permissions to create the job. By default, the IAM role that you use to access Data Wrangler is the `SageMakerExecutionRole`.

The following permissions allow Data Wrangler to access EventBridge and allow EventBridge to run processing jobs:
+ Add the following AWS Managed policy to the Amazon SageMaker Studio Classic execution role that provides Data Wrangler with permissions to use EventBridge:

  ```
  arn:aws:iam::aws:policy/AmazonEventBridgeFullAccess
  ```

  For more information about the policy, see [AWS managed policies for EventBridge](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-use-identity-based.html#eb-full-access-policy).
+ Add the following policy to the IAM role that you specify when you create a job in Data Wrangler:

------
#### [ JSON ]

****  

  ```
  {
      "Version":"2012-10-17",		 	 	 
      "Statement": [
          {
              "Effect": "Allow",
              "Action": "sagemaker:StartPipelineExecution",
              "Resource": "arn:aws:sagemaker:us-east-1:111122223333:pipeline/data-wrangler-*"
          }
      ]
  }
  ```

------

  If you're using the default IAM role, you add the preceding policy to the Amazon SageMaker Studio Classic execution role.

  Add the following trust policy to the role to allow EventBridge to assume it.

  ```
  {
      "Effect": "Allow",
      "Principal": {
          "Service": "events.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
  }
  ```

**Important**  
When you create a schedule, Data Wrangler creates an `eventRule` in EventBridge. You incur charges for both the event rules that you create and the instances used to run the processing job.  
For information about EventBridge pricing, see [Amazon EventBridge pricing](https://aws.amazon.com/eventbridge/pricing/). For information about processing job pricing, see [Amazon SageMaker Pricing](https://aws.amazon.com/sagemaker/pricing/).

You can set a schedule using one of the following methods:
+ [CRON expressions](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html)
**Note**  
Data Wrangler doesn't support the following expressions:  
LW\$1
Abbreviations for days
Abbreviations for months
+ [RATE expressions](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-create-rule-schedule.html#eb-rate-expressions)
+ Recurring – Set an hourly or daily interval to run the job.
+ Specific time – Set specific days and times to run the job.

The following sections provide procedures on creating jobs.

------
#### [ CRON ]

Use the following procedure to create a schedule with a CRON expression.

To specify a schedule with a CRON expression, do the following.

1. Open your Data Wrangler flow.

1. Choose **Create job**.

1. (Optional) For **Output KMS key**, specify an AWS KMS key to configure the output of the job.

1. Choose **Next, 2. Configure job**.

1. Select **Associate Schedules**.

1. Choose **Create a new schedule**.

1. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **CRON**.

1. Specify a valid CRON expression.

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – Data Wrangler the job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – Data Wrangler the job only runs on the schedules that you specify.

1. Choose **Run**

------
#### [ RATE ]

Use the following procedure to create a schedule with a RATE expression.

To specify a schedule with a RATE expression, do the following.

1. Open your Data Wrangler flow.

1. Choose **Create job**.

1. (Optional) For **Output KMS key**, specify an AWS KMS key to configure the output of the job.

1. Choose **Next, 2. Configure job**.

1. Select **Associate Schedules**.

1. Choose **Create a new schedule**.

1. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, choose **Rate**.

1. For **Value**, specify an integer.

1. For **Unit**, select one of the following:
   + **Minutes**
   + **Hours**
   + **Days**

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – Data Wrangler the job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – Data Wrangler the job only runs on the schedules that you specify.

1. Choose **Run**

------
#### [ Recurring ]

Use the following procedure to create a schedule that runs a job on a recurring basis.

To specify a schedule with a CRON expression, do the following.

1. Open your Data Wrangler flow.

1. Choose **Create job**.

1. (Optional) For **Output KMS key**, specify an AWS KMS key to configure the output of the job.

1. Choose **Next, 2. Configure job**.

1. Select **Associate Schedules**.

1. Choose **Create a new schedule**.

1. For **Schedule Name**, specify the name of the schedule.

1. For **Run Frequency**, make sure **Recurring** is selected by default.

1. For **Every x hours**, specify the hourly frequency that the job runs during the day. Valid values are integers in the inclusive range of **1** and **23**.

1. For **On days**, select one of the following options:
   + **Every Day**
   + **Weekends**
   + **Weekdays**
   + **Select Days**

   1. (Optional) If you've selected **Select Days**, choose the days of the week to run the job.
**Note**  
The schedule resets every day. If you schedule a job to run every five hours, it runs at the following times during the day:  
00:00
05:00
10:00
15:00
20:00

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – Data Wrangler the job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – Data Wrangler the job only runs on the schedules that you specify.

1. Choose **Run**

------
#### [ Specific time ]

Use the following procedure to create a schedule that runs a job at specific times.

To specify a schedule with a CRON expression, do the following.

1. Open your Data Wrangler flow.

1. Choose **Create job**.

1. (Optional) For **Output KMS key**, specify an AWS KMS key to configure the output of the job.

1. Choose **Next, 2. Configure job**.

1. Select **Associate Schedules**.

1. Choose **Create a new schedule**.

1. For **Schedule Name**, specify the name of the schedule.

1. Choose **Create**.

1. (Optional) Choose **Add another schedule** to run the job on an additional schedule.
**Note**  
You can associate a maximum of two schedules. The schedules are independent and don't affect each other unless the times overlap.

1. Choose one of the following:
   + **Schedule and run now** – Data Wrangler the job runs immediately and subsequently runs on the schedules.
   + **Schedule only** – Data Wrangler the job only runs on the schedules that you specify.

1. Choose **Run**

------

You can use Amazon SageMaker Studio Classic view the jobs that are scheduled to run. Your processing jobs run within Pipelines. Each processing job has its own pipeline. It runs as a processing step within the pipeline. You can view the schedules that you've created within a pipeline. For information about viewing a pipeline, see [View the details of a pipeline](pipelines-studio-list.md).

Use the following procedure to view the jobs that you've scheduled.

To view the jobs you've scheduled, do the following.

1. Open Amazon SageMaker Studio Classic.

1. Open Pipelines

1. View the pipelines for the jobs that you've created.

   The pipeline running the job uses the job name as a prefix. For example, if you've created a job named `housing-data-feature-enginnering`, the name of the pipeline is `data-wrangler-housing-data-feature-engineering`.

1. Choose the pipeline containing your job.

1. View the status of the pipelines. Pipelines with a **Status** of **Succeeded** have run the processing job successfully.

To stop the processing job from running, do the following:

To stop a processing job from running, delete the event rule that specifies the schedule. Deleting an event rule stops all the jobs associated with the schedule from running. For information about deleting a rule, see [Disabling or deleting an Amazon EventBridge rule](https://docs.aws.amazon.com/eventbridge/latest/userguide/eb-delete-rule.html).

You can stop and delete the pipelines associated with the schedules as well. For information about stopping a pipeline, see [StopPipelineExecution](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_StopPipelineExecution.html). For information about deleting a pipeline, see [DeletePipeline](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DeletePipeline.html#API_DeletePipeline_RequestSyntax).

# Use an Interactive Data Preparation Widget in an Amazon SageMaker Studio Classic Notebook to Get Data Insights
Use Data Preparation in a Studio Classic Notebook to Get Data Insights

Use the Data Wrangler data preparation widget to interact with your data, get visualizations, explore actionable insights, and fix data quality issues. 

You can access the data preparation widget from an Amazon SageMaker Studio Classic notebook. For each column, the widget creates a visualization that helps you better understand its distribution. If a column has data quality issues, a warning appears in its header.

To see the data quality issues, select the column header showing the warning. You can use the information that you get from the insights and the visualizations to apply the widget's built-in transformations to help you fix the issues. 

For example, the widget might detect that you have a column that only has one unique value and show you a warning. The warning provides the option to drop the column from the dataset.

## Getting started with running the widget


Use the following information to help you get started with running a notebook.

Open a notebook in Amazon SageMaker Studio Classic. For information about opening a notebook, see [Create or Open an Amazon SageMaker Studio Classic Notebook](notebooks-create-open.md).

**Important**  
To run the widget, the notebook must use one of the following images:  
Python 3 (Data Science) with Python 3.7
Python 3 (Data Science 2.0) with Python 3.8
Python 3 (Data Science 3.0) with Python 3.10
SparkAnalytics 1.0
SparkAnalytics 2.0
For more information about images, see [Amazon SageMaker Images Available for Use With Studio Classic Notebooks](notebooks-available-images.md).

Use the following code to import the data preparation widget and pandas. The widget uses pandas dataframes to analyze your data.

```
import pandas as pd
import sagemaker_datawrangler
```

The following example code loads a file into the dataframe called `df`.

```
df = pd.read_csv("example-dataset.csv")
```

You can use a dataset in any format that you can load as a pandas dataframe object. For more information about pandas formats, see [IO tools (text, CSV, HDF5, …)](https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html).

The following cell runs the `df` variable to start the widget.

```
df
```

The top of the dataframe has the following options:
+ **View the Pandas table** – Switches between the interactive visualization and a pandas table.
+ **Use all of the rows in your dataset to compute the insights. Using the entire dataset might increase the time it takes to generate the insights.** – If you don't select the option, Data Wrangler computes the insights for the first 10,000 rows of the dataset.

The dataframe shows the first 1000 rows of the dataset. Each column header has a stacked bar chart that shows the column's characteristics. It shows the proportion of valid values, invalid values, and missing values. You can hover over the different portions of the stacked bar chart to get the calculated percentages.

Each column has a visualization in the header. The following shows the types of visualizations the columns can have:
+ Categorical – Bar chart
+ Numeric – Histogram
+ Datetime – Bar chart
+ Text – Bar chart

For each visualization, the data preparation widget highlights outliers in orange.

When you choose a column, it opens a side panel. The side panel shows you the **Insights** tab. The pane provides a count for the following types of values:
+ Invalid values – Values whose type doesn’t match the column type.
+ Missing values – Values that are missing, such as `NaN` or `None`.
+ Valid values – Values that are neither missing nor invalid.

For numeric columns, the **Insights** tab shows the following summary statistics:
+ Minimum – The smallest value.
+ Maximum – The largest value.
+ Mean – The mean of the values.
+ Mode – The value that appears most frequently.
+ Standard deviation – The standard deviation of the values.

For categorical columns, the **Insights** tab shows the following summary statistics:
+ Unique values – The number of unique values in the column.
+ Top – The value that appears most frequently.

The columns that have warning icons in their headers have data quality issues. Choosing a column opens a **Data quality** tab that you can use to find transforms to help you fix the issue. A warning has one of the following severity levels:
+ Low – Issues that might not affect your analysis, but can be useful to fix.
+ Medium – Issues that are likely to affect your analysis, but are likely not critical to fix.
+ High – Severe issues that we strongly recommend fixing.

**Note**  
The widget sorts the column to show the values that have data quality issues at the top of the dataframe. It also highlights the values that are causing the issues. The color of the highlighting corresponds to the severity level.

Under **SUGGESTED TRANSFORMS**, you can choose a transform to fix the data quality issue. The widget can offer multiple transforms that can fix the issue. It can offer recommendations for the transforms that are best suited to the problem. You can move your cursor over the transform to get more information about it.

To apply a transform to the dataset, choose **Apply and export code**. The transform modifies the dataset and updates the visualization with modified values. The code for the transform appears in the following cell of the notebook. If you apply additional transforms to the dataset, the widget appends the transforms to the cell. You can use the code that the widget generates to do the following:
+ Customize it to better fit your needs.
+ Use it in your own workflows.

You can reproduce all the transforms you've made by rerunning all of the cells in the notebook.

The widget can provide insights and warnings for the target column. The target column is the column that you're trying to predict. Use the following procedure to get target column insights.

To get target column insights, do the following.

1. Choose the column that you're using as the target column.

1. Choose **Select as target column**.

1. Choose the problem type. The widget's insights and warnings are tailored to the problem types. The following are the problem types:
   + **Classification** – The target column has categorical data.
   + **Regression** – The target column has numeric data.

1. Choose **Run**.

1. (Optional) Under **Target Column Insights**, choose one of the suggested transforms.

## Reference for the insights and transforms in the widget


For feature columns (columns that aren't the target column), you can get the following insights to warn you about issues with your dataset.
+ **Missing values** – The column has missing values such as `None`, `NaN` (not a number), or `NaT` (not a timestamp). Many machine learning algorithms don’t support missing values in the input data. Filling them in or dropping the rows with missing data is therefore a crucial data preparation step. If you see the missing values warning, you can use one of the following transforms to correct the issue.
  + **Drop missing** – Drops rows with missing values. We recommend dropping rows when the percentage of rows with missing data is small and imputing the missing values isn't appropriate. 
  + **Replace with new value** – Replaces textual missing values with `Other`. You can change `Other` to a different value in the output code. Replaces numeric missing values with 0.
  + **Replace with mean** – Replaces missing values with the mean of the column.
  + **Replace with median** – Replaces missing values with the median of the column.
  + **Drop column** – Drops the column with missing values from the dataset. We recommend dropping the entire column when there's a high percentage of rows with missing data.
+ **Disguised missing values** – The column has disguised missing values. A disguised missing value is a value that isn't explicitly encoded as a missing value. For example, instead of using a `NaN` to indicate a missing value, the value could be `Placeholder`. You can use one of the following transforms to handle the missing values:
  + **Drop missing** – Drops rows with missing values
  + **Replace with new value** – Replaces textual missing values with `Other`. You can change `Other` to a different value in the output code. Replaces numeric missing values with 0.
+ **Constant column** – The column only has one value. It therefore has no predictive power. We strongly recommend using the **Drop column** transform to drop the column from the dataset.
+ **ID column** – The column has no repeating values. All of the values in the column are unique. They might be either IDs or database keys. Without additional information, the column has no predictive power. We strongly recommend using the **Drop column** transform to drop the column from the dataset.
+ **High cardinality** – The column has a high percentage of unique values. High cardinality limits the predictive power of categorical columns. Examine the importance of the column in your analysis and consider using the **Drop column** transform to drop it.

For the target column, you can get the following insights to warn you about issues with your dataset. You can use the suggested transformation provided with the warning to correct the issue.
+ **Mixed data types in target (Regression)** – There are some non-numeric values in the target column. There might be data entry errors. We recommend removing the rows that have the values that can't be converted.
+ **Frequent label** – Certain values in the target column appear more frequently than what would be normal in the context of regression. There might be an error in data collection or processing. A frequently appearing category might indicate that either the value is used as a default value or that it’s a placeholder for missing values. We recommend using the **Replace with new value** transform to replace the missing values with `Other`.
+ **Too few instances per class** – The target column has categories that appear rarely. Some of the categories don't have enough rows for the target column to be useful. You can use one of the following transforms:
  + **Drop rare target** – Drops unique values with fewer than ten observations. For example, drops the value `cat` if it appears nine times in the column.
  + **Replace rare target** – Replaces categories that appear rarely in the dataset with the value `Other`.
+ **Classes too imbalanced (multi-class classification)** – There are categories in the dataset that appear much more frequently than the other categories. The class imbalance might affect prediction accuracy. For the most accurate predictions possible, we recommend updating the dataset with rows that have the categories that currently appear less frequently.
+ **Large amount of classes/too many classes** – There's a large number of classes in the target column. Having many classes might result in longer training times or poor predictive quality. We recommend doing one of the following:
  + Grouping some of the categories into their own category. For example, if six categories are closely related, we recommend using a single category for them.
  + Using an ML algorithm that's resilient to multiple categories.

# Security and Permissions


When you query data from Athena or Amazon Redshift, the queried dataset is automatically stored in the default SageMaker AI S3 bucket for the AWS Region in which you are using Studio Classic. Additionally, when you export a Jupyter Notebook from Amazon SageMaker Data Wrangler and run it, your data flows, or .flow files, are saved to the same default bucket, under the prefix *data\$1wrangler\$1flows*.

For high-level security needs, you can configure a bucket policy that restricts the AWS roles that have access to this default SageMaker AI S3 bucket. Use the following section to add this type of policy to an S3 bucket. To follow the instructions on this page, use the AWS Command Line Interface (AWS CLI). To learn how, see [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) in the IAM User Guide.

Additionally, you need to grant each IAM role that uses Data Wrangler permissions to access required resources. If you do not require granular permissions for the IAM role you use to access Data Wrangler, you can add the IAM managed policy, [https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess), to an IAM role that you use to create your Studio Classic user. This policy grants you full permission to use Data Wrangler. If you require more granular permissions, refer to the section, [Grant an IAM Role Permission to Use Data Wrangler](#data-wrangler-security-iam-policy).

## Add a Bucket Policy To Restrict Access to Datasets Imported to Data Wrangler


You can add a policy to the S3 bucket that contains your Data Wrangler resources using an Amazon S3 bucket policy. Resources that Data Wrangler uploads to your default SageMaker AI S3 bucket in the AWS Region you are using Studio Classic in include the following:
+ Queried Amazon Redshift results. These are stored under the *redshift/* prefix.
+ Queried Athena results. These are stored under the *athena/* prefix. 
+ The .flow files uploaded to Amazon S3 when you run an exported Jupyter Notebook Data Wrangler produces. These are stored under the *data\$1wrangler\$1flows/* prefix.

Use the following procedure to create an S3 bucket policy that you can add to restrict IAM role access to that bucket. To learn how to add a policy to an S3 bucket, see [How do I add an S3 Bucket policy?](https://docs.aws.amazon.com/AmazonS3/latest/user-guide/add-bucket-policy.html).

**To set up a bucket policy on the S3 bucket that stores your Data Wrangler resources:**

1. Configure one or more IAM roles that you want to be able to access Data Wrangler.

1. Open a command prompt or shell. For each role that you create, replace *role-name* with the name of the role and run the following:

   ```
   $ aws iam get-role --role-name role-name
   ```

   In the response, you see a `RoleId` string which begins with `AROA`. Copy this string. 

1. Add the following policy to the SageMaker AI default bucket in the AWS Region in which you are using Data Wrangler. Replace *region* with the AWS Region in which the bucket is located, and *account-id* with your AWS account ID. Replace `userId`s starting with *AROAEXAMPLEID* with the IDs of an AWS roles to which you want to grant permission to use Data Wrangler. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Deny",
         "Principal": "*",
         "Action": "s3:*",
         "Resource": [
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/data_wrangler_flows/",
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/data_wrangler_flows/*",
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/athena",
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/athena/*",
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/redshift",
           "arn:aws:s3:::sagemaker-us-east-1-111122223333/redshift/*"
           
         ],
         "Condition": {
           "StringNotLike": {
             "aws:userId": [
               "AROAEXAMPLEID_1:*",
               "AROAEXAMPLEID_2:*"
             ]
           }
         }
       }
     ]
   }
   ```

------

## Create an Allowlist for Data Wrangler


Whenever a user starts running Data Wrangler from the Amazon SageMaker Studio Classic user interface, they make call to the SageMaker AI application programming interface (API) to create a Data Wrangler application.

Your organization might not provide your users with permissions to make those API calls by default. To provide permissions, you must create and attach a policy to the user's IAM roles using the following policy template: [Data Wrangler Allow List Example](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/DataWranglerAllowListExample.txt).

**Note**  
The preceding policy example only gives your users access to the Data Wrangler application.

For information about creating a policy, see [Creating policies on the JSON tab](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_create-console.html#access_policies_create-json-editor). When you're creating a policy, copy and paste the JSON policy from [Data Wrangler Allow List Example](https://s3.us-west-2.amazonaws.com/amazon-sagemaker-data-wrangler-documentation-artifacts/DataWranglerAllowListExample.txt) in the **JSON** tab.

**Important**  
Delete any IAM policies that prevent users from running the following operations:  
[CreateApp](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_CreateApp.html)
[DescribeApp](https://docs.aws.amazon.com/sagemaker/latest/APIReference/API_DescribeApp.html)
If you don't delete the policies, your users could still be affected by them.

After you've creating the policy using the template, attach it to the IAM roles of your users. For information about attaching a policy, see [Adding IAM identity permissions (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html#add-policies-console).

## Grant an IAM Role Permission to Use Data Wrangler


You can grant an IAM role permission to use Data Wrangler with the general IAM managed policy, [https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess](https://console.aws.amazon.com/iam/home?#/policies/arn:aws:iam::aws:policy/AmazonSageMakerFullAccess). This is a general policy that includes [permissions](https://docs.aws.amazon.com/sagemaker/latest/dg/security-iam-awsmanpol-AmazonSageMakerFullAccess.html) required to use all SageMaker AI services. This policy grants an IAM role full access to Data Wrangler. You should be aware of the following when using `AmazonSageMakerFullAccess` to grant access to Data Wrangler:
+ If you import data from Amazon Redshift, the **Database User** name must have the prefix `sagemaker_access`.
+ This managed policy only grants permission to access buckets with one of the following in the name: `SageMaker AI`, `SageMaker AI`, `sagemaker`, or `aws-glue`. If want to use Data Wrangler to import from an S3 bucket without these phrases in the name, refer to the last section on this page to learn how to grant permission to an IAM entity to access your S3 buckets.

If you have high-security needs, you can attach the policies in this section to an IAM entity to grant permissions required to use Data Wrangler.

If you have datasets in Amazon Redshift or Athena that an IAM role needs to import from Data Wrangler, you must add a policy to that entity to access these resources. The following policies are the most restrictive policies you can use to give an IAM role permission to import data from Amazon Redshift and Athena. 

To learn how to attach a custom policy to an IAM role, refer to [Managing IAM policies](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage.html#create-managed-policy-console) in the IAM User Guide.

**Policy example to grant access to an Athena dataset import**

The following policy assumes that the IAM role has permission to access the underlying S3 bucket where data is stored through a separate IAM policy.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "athena:ListDataCatalogs",
                "athena:ListDatabases",
                "athena:ListTableMetadata",
                "athena:GetQueryExecution",
                "athena:GetQueryResults",
                "athena:StartQueryExecution",
                "athena:StopQueryExecution"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateTable"
            ],
            "Resource": [
                "arn:aws:glue:*:*:table/*/sagemaker_tmp_*",
                "arn:aws:glue:*:*:table/sagemaker_featurestore/*",
                "arn:aws:glue:*:*:catalog",
                "arn:aws:glue:*:*:database/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:DeleteTable"
            ],
            "Resource": [
                "arn:aws:glue:*:*:table/*/sagemaker_tmp_*",
                "arn:aws:glue:*:*:catalog",
                "arn:aws:glue:*:*:database/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetDatabases",
                "glue:GetTable",
                "glue:GetTables"
            ],
            "Resource": [
                "arn:aws:glue:*:*:table/*",
                "arn:aws:glue:*:*:catalog",
                "arn:aws:glue:*:*:database/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "glue:CreateDatabase",
                "glue:GetDatabase"
            ],
            "Resource": [
                "arn:aws:glue:*:*:catalog",
                "arn:aws:glue:*:*:database/sagemaker_featurestore",
                "arn:aws:glue:*:*:database/sagemaker_processing",
                "arn:aws:glue:*:*:database/default",
                "arn:aws:glue:*:*:database/sagemaker_data_wrangler"
            ]
        }
    ]
}
```

------

****Policy example to grant access to an Amazon Redshift dataset import****

The following policy grants permission to set up an Amazon Redshift connection to Data Wrangler using database users that have the prefix `sagemaker_access` in the name. To grant permission to connect using additional database users, add additional entries under `"Resources"` in the following policy. The following policy assumes that the IAM role has permission to access the underlying S3 bucket where data is stored through a separate IAM policy, if applicable. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift-data:ExecuteStatement",
                "redshift-data:DescribeStatement",
                "redshift-data:CancelStatement",
                "redshift-data:GetStatementResult",
                "redshift-data:ListSchemas",
                "redshift-data:ListTables"
            ],
            "Resource": [
                "*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "redshift:GetClusterCredentials"
            ],
            "Resource": [
                "arn:aws:redshift:*:*:dbuser:*/sagemaker_access*",
                "arn:aws:redshift:*:*:dbname:*"
            ]
        }
    ]
}
```

------

**Policy to grant access to an S3 bucket**

If your dataset is stored in Amazon S3, you can grant an IAM role permission to access this bucket with a policy similar to the following. This example grants programmatic read-write access to the bucket named *test*.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:ListBucket"],
      "Resource": ["arn:aws:s3:::test"]
    },
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject"
      ],
      "Resource": ["arn:aws:s3:::test/*"]
    }
  ]
}
```

------

To import data from Athena and Amazon Redshift, you must grant an IAM role permission to access the following prefixes under the default Amazon S3 bucket in the AWS Region Data Wrangler in which is being used: `athena/`, `redshift/`. If a default Amazon S3 bucket does not already exist in the AWS Region, you must also give the IAM role permission to create a bucket in this region.

Additionally, if you want the IAM role to be able to use the Amazon SageMaker Feature Store, Pipelines, and Data Wrangler job export options, you must grant access to the prefix `data_wrangler_flows/` in this bucket.

 Data Wrangler uses the `athena/` and `redshift/` prefixes to store preview files and imported datasets. To learn more, see [Imported Data Storage](data-wrangler-import.md#data-wrangler-import-storage).

Data Wrangler uses the `data_wrangler_flows/` prefix to store .flow files when you run a Jupyter Notebook exported from Data Wrangler. To learn more, see [Export](data-wrangler-data-export.md).

Use a policy similar to the following to grant the permissions described in the preceding paragraphs.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/data_wrangler_flows/",
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/data_wrangler_flows/*",
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/athena",
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/athena/*",
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/redshift",
                "arn:aws:s3:::sagemaker-us-east-1-111122223333/redshift/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:CreateBucket",
                "s3:ListBucket"
            ],
            "Resource": "arn:aws:s3:::sagemaker-us-east-1-111122223333"
        },
        {
            "Effect": "Allow",
            "Action": [
                "s3:ListAllMyBuckets",
                "s3:GetBucketLocation"
            ],
            "Resource": "*"
        }
    ]
}
```

------

You can also access data in your Amazon S3 bucket from another AWS account by specifying the Amazon S3 bucket URI. To do this, the IAM policy that grants access to the Amazon S3 bucket in the other account should use a policy similar to the following example, where `BucketFolder` is the specific directory in the user's bucket `UserBucket`. This policy should be added to the user granting access to their bucket for another user. 

------
#### [ JSON ]

****  

```
{
   "Version":"2012-10-17",		 	 	 
   "Statement": [
       {
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": "arn:aws:s3:::UserBucket/BucketFolder/*"
            },
                {
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket"
                ],
                "Resource": "arn:aws:s3:::UserBucket",
                "Condition": {
                "StringLike": {
                    "s3:prefix": [
                    "BucketFolder/*"
                    ]
                }
            }
        } 
    ]
}
```

------

The user that is accessing the bucket (not the bucket owner) must add a policy similar to the following example to their user. Note that `AccountX` and `TestUser` below refers to the bucket owner and their user respectively.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111122223333:user/TestUser"
            },
            "Action": [
                "s3:GetObject",
                "s3:PutObject",
                "s3:PutObjectAcl"
            ],
            "Resource": [
                "arn:aws:s3:::UserBucket/BucketFolder/*"
            ]
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::111122223333:user/TestUser"
            },
            "Action": [
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::UserBucket"
            ]
        }
    ]
}
```

------

**Policy example to grant access to use SageMaker AI Studio**

Use a policy like to the following to create an IAM execution role that can be used to set up a Studio Classic instance. 

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "sagemaker:CreatePresignedDomainUrl",
                "sagemaker:DescribeDomain",
                "sagemaker:ListDomains",
                "sagemaker:DescribeUserProfile",
                "sagemaker:ListUserProfiles",
                "sagemaker:*App",
                "sagemaker:ListApps"
            ],
            "Resource": "*"
        }
    ]
}
```

------

## Snowflake and Data Wrangler


All permissions for AWS resources are managed via your IAM role attached to your Studio Classic instance. The Snowflake administrator manages Snowflake-specific permissions, as they can grant granular permissions and privileges to each Snowflake user. This includes databases, schemas, tables, warehouses, and storage integration objects. You must ensure that the correct permissions are set up outside of Data Wrangler. 

Note that the Snowflake `COPY INTO Amazon S3` command moves data from Snowflake to Amazon S3 over the public internet by default, but data in transit is secured using SSL. Data at rest in Amazon S3 is encrypted with SSE-KMS using the default AWS KMS key.

With respect to Snowflake credentials storage, Data Wrangler does not store customer credentials. Data Wrangler uses Secrets Manager to store the credentials in a secret and rotates secrets as part of a best practice security plan. The Snowflake or Studio Classic administrator needs to ensure that the data scientist’s Studio Classic execution role is granted permission to perform `GetSecretValue` on the secret storing the credentials. If already attached to the Studio Classic execution role, the `AmazonSageMakerFullAccess` policy has the necessary permissions to read secrets created by Data Wrangler and secrets created by following the naming and tagging convention in the instructions above. Secrets that do not follow the conventions must be separately granted access. We recommend using Secrets Manager to prevent sharing credentials over unsecured channels; however, note that a logged-in user can retrieve the plain-text password by launching a terminal or Python notebook in Studio Classic and then invoking API calls from the Secrets Manager API. 

## Data Encryption with AWS KMS


Within Data Wrangler, you can decrypt encrypted files and add them to your Data Wrangler flow. You can also encrypt the output of the transforms using either a default AWS KMS key or one that you provide.

You can import files if they have the following:
+ server-side encryption
+ SSE-KMS as the encryption type

To decrypt the file and import to a Data Wrangler flow, you must add the SageMaker Studio Classic user that you're using as a key user.

The following screenshot shows a Studio Classic user role added as a key user. See [IAM Roles](https://console.aws.amazon.com/iam/home#/roles) to access users under the left panel to make this change.

![\[The Key users section in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/data-wrangler-kms.png)


### Amazon S3 customer managed key setup for Data Wrangler imported data storage


 By default, Data Wrangler uses Amazon S3 buckets that have the following naming convention: `sagemaker-region-account number`. For example, if your account number is `111122223333` and you are using Studio Classic in us-east-1, your imported datasets are stored with the following naming convention: `sagemaker-us-east-1-111122223333`. 

The following instructions explain how to set up a customer managed key for your default Amazon S3 bucket.

1. To enable server-side encryption and setup a customer managed key for your default S3 bucket, see [Using KMS Encryption](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html).

1. After following step 1, navigate to AWS KMS in your AWS Management Console. Find the customer managed key you selected in step 1 of the previous step and add the Studio Classic role as the key user. To do this, follow the instructions in [Allows key users to use a customer managed key](https://docs.aws.amazon.com/kms/latest/developerguide/key-policies.html#key-policy-default-allow-users).

### Encrypting the Data That You Export


You can encrypt the data that you export using one of the following methods:
+ Specifying that your Amazon S3 bucket has object use SSE-KMS encryption.
+ Specifying an AWS KMS key to encrypt the data that you export from Data Wrangler.

On the **Export data** page, specify a value for the **AWS KMS key ID or ARN**.

For more information on using AWS KMS keys, see [Protecting Data Using Server-Side Encryption with AWS KMS keys Stored in AWSAWS Key Management Service (SSE-KMS) ](https://docs.aws.amazon.com//AmazonS3/latest/userguide/UsingKMSEncryption.html).

## Amazon AppFlow Permissions


When you're performing a transfer, you must specify an IAM role that has permissions to perform the transfer. You can use the same IAM role that has permissions to use Data Wrangler. By default, the IAM role that you use to access Data Wrangler is the `SageMakerExecutionRole`.

The IAM role must have the following permissions:
+ Permissions to Amazon AppFlow
+ Permissions to the AWS Glue Data Catalog
+ Permissions for AWS Glue to discover the data sources that are available

When you run a transfer, Amazon AppFlow stores metadata from the transfer in the AWS Glue Data Catalog. Data Wrangler uses the metadata from the catalog to determine whether it's available for you to query and import.

To add permissions to Amazon AppFlow, add the `AmazonAppFlowFullAccess` AWS managed policy to the IAM role. For more information about adding policies, see [Adding or removing IAM identity permissions](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_manage-attach-detach.html).

If you're transferring data to Amazon S3, you must also attach the following policy.

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "VisualEditor0",
      "Effect": "Allow",
      "Action": [
        "s3:GetBucketTagging",
        "s3:ListBucketVersions",
        "s3:CreateBucket",
        "s3:ListBucket",
        "s3:GetBucketPolicy",
        "s3:PutEncryptionConfiguration",
        "s3:GetEncryptionConfiguration",
        "s3:PutBucketTagging",
        "s3:GetObjectTagging",
        "s3:GetBucketOwnershipControls",
        "s3:PutObjectTagging",
        "s3:DeleteObject",
        "s3:DeleteBucket",
        "s3:DeleteObjectTagging",
        "s3:GetBucketPublicAccessBlock",
        "s3:GetBucketPolicyStatus",
        "s3:PutBucketPublicAccessBlock",
        "s3:PutAccountPublicAccessBlock",
        "s3:ListAccessPoints",
        "s3:PutBucketOwnershipControls",
        "s3:PutObjectVersionTagging",
        "s3:DeleteObjectVersionTagging",
        "s3:GetBucketVersioning",
        "s3:GetBucketAcl",
        "s3:PutObject",
        "s3:GetObject",
        "s3:GetAccountPublicAccessBlock",
        "s3:ListAllMyBuckets",
        "s3:GetAnalyticsConfiguration",
        "s3:GetBucketLocation"
      ],
      "Resource": "*"
    }
  ]
}
```

------

To add AWS Glue permissions, add the `AWSGlueConsoleFullAccess` managed policy to the IAM role. For more information about AWS Glue permissions with Amazon AppFlow, see [link-to-appflow-page].

Amazon AppFlow needs to access AWS Glue and Data Wrangler for you to import the data that you've transferred. To grant Amazon AppFlow access, add the following trust policy to the IAM role.

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:root",
                "Service": [
                    "appflow.amazonaws.com"
                ]
            },
            "Action": "sts:AssumeRole"
        }
    ]
}
```

------

To display the Amazon AppFlow data in Data Wrangler, add the following policy to the IAM role:

------
#### [ JSON ]

****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "glue:SearchTables",
            "Resource": [
                "arn:aws:glue:*:*:table/*/*",
                "arn:aws:glue:*:*:database/*",
                "arn:aws:glue:*:*:catalog"
            ]
        }
    ]
}
```

------

## Using Lifecycle Configurations in Data Wrangler


You might have an Amazon EC2 instance that is configured to run Kernel Gateway applications, but not the Data Wrangler application. Kernel Gateway applications provide access to the environment and the kernels that you use to run Studio Classic notebooks and terminals. The Data Wrangler application is the UI application that runs Data Wrangler. Amazon EC2 instances that aren't Data Wrangler instances require a modification to their lifecycle configurations to run Data Wrangler. Lifecycle configurations are shell scripts that automate the customization of your Amazon SageMaker Studio Classic environment.

For more information about lifecycle configurations, see [Use Lifecycle Configurations to Customize Amazon SageMaker Studio Classic](studio-lcc.md).

The default lifecycle configuration for your instance doesn't support using Data Wrangler. You can make the following modifications to the default configuration to use Data Wrangler with your instance.

```
#!/bin/bash
set -eux
STATUS=$(
python3 -c "import sagemaker_dataprep"
echo $?
)
if [ "$STATUS" -eq 0 ]; then
echo 'Instance is of Type Data Wrangler'
else
echo 'Instance is not of Type Data Wrangler'

# Replace this with the URL of your git repository
export REPOSITORY_URL="https://github.com/aws-samples/sagemaker-studio-lifecycle-config-examples.git"

git -C /root clone $REPOSTIORY_URL

fi
```

You can save the script as `lifecycle_configuration.sh`.

You attach the lifecycle configuration to your Studio Classic domain or user profile. For more information about creating and attaching a lifecycle configuration, see [Create and Associate a Lifecycle Configuration with Amazon SageMaker Studio Classic](studio-lcc-create.md).

The following instructions show you how to attach a lifecycle configuration to a Studio Classic domain or user profile.

You might run into errors when you're creating or attaching a lifecycle configuration. For information about debugging lifecycle configuration errors, [KernelGateway app failure](studio-lcc-debug.md#studio-lcc-debug-kernel).

# Release Notes


Data Wrangler is regularly updated with new features and bug fixes. To upgrade the version of Data Wrangler you are using in Studio Classic, follow the instructions in [Shut Down and Update Amazon SageMaker Studio Classic Apps](studio-tasks-update-apps.md).


****  

| Release Notes | 
| --- | 
|  **8/31/2023** New functionality: You can now create a Data Quality and Insights report on your entire dataset. For more information, see [Get Insights On Data and Data Quality](data-wrangler-data-insights.md). **5/20/2023** New functionality: You can now import your data from Salesforce Data Cloud. For more information, see [Import data from Salesforce Data Cloud](data-wrangler-import.md#data-wrangler-import-salesforce-data-cloud). **4/18/2023** New functionality: You can now get your data in a format that Amazon Personalize can interpret. For more information, see [Map Columns for Amazon Personalize](data-wrangler-transform.md#data-wrangler-transform-personalize). **3/1/2023** New functionality: You can now use Hive to import your data from Amazon EMR. For more information, see [Import data from Amazon EMR](data-wrangler-import.md#data-wrangler-emr). **12/10/2022** New functionality: You can now export your Data Wrangler flow to an inference endpoint. For more information, see [Export to an Inference Endpoint](data-wrangler-data-export.md#data-wrangler-data-export-inference). New functionality: You can now use an interactive notebook widget for data preparation. For more information, see [Use an Interactive Data Preparation Widget in an Amazon SageMaker Studio Classic Notebook to Get Data Insights](data-wrangler-interactively-prepare-data-notebook.md). New functionality: You can now import data from SaaS platforms. For more information, see [Import Data From Software as a Service (SaaS) Platforms](data-wrangler-import.md#data-wrangler-import-saas). **10/12/2022** New functionality: You can now reuse data flows for different data sets. For more information, see [Reusing Data Flows for Different Datasets](data-wrangler-parameterize.md). **10/05/2022** New functionality: You can now use Principal Component Analysis (PCA) as a transform. For more information, see [Reduce Dimensionality within a Dataset](data-wrangler-transform.md#data-wrangler-transform-dimensionality-reduction). **10/05/2022** New functionality: You can now refit parameters in your Data Wrangler flow. For more information, see [Export](data-wrangler-data-export.md). **10/03/2022** New functionality: You can now deploy models from your Data Wrangler flow. For more information, see [Automatically Train Models on Your Data Flow](data-wrangler-autopilot.md). **9/20/2022** New functionality: You can now set data retention periods in Athena. For more information, see [Import data from Athena](data-wrangler-import.md#data-wrangler-import-athena). **6/9/2022** New functionality: You can now use Amazon SageMaker Autopilot to train a model directly from your Data Wrangler flow. For more information, see [Automatically Train Models on Your Data Flow](data-wrangler-autopilot.md). **5/6/2022** New functionality: You can now use additional m5 and r5 instances. For more information, see [Instances](data-wrangler-data-flow.md#data-wrangler-data-flow-instances). **4/27/2022** New functionalities: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) **4/1/2022** New functionality: You can now use Databricks as a data source. For more information, see [Import data from Databricks (JDBC)](data-wrangler-import.md#data-wrangler-databricks). **2/2/2022** New functionalities: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) **10/16/2021** New functionality: Data Wrangler now supports Athena workgroups. For more information, see [Import data from Athena](data-wrangler-import.md#data-wrangler-import-athena). **10/6/2021** New functionality: Data Wrangler now supports transforming time series data. For more information, see [Transform Time Series](data-wrangler-transform.md#data-wrangler-transform-time-series). **7/15/2021** New functionalities: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) Enhancements: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html)  Bug Fixes: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) **4/26/2021**  Enhancements: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) Bug Fixes: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) **2/8/2021**  New Functionalities: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) Enhancements: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html) Bug Fixes: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/sagemaker/latest/dg/data-wrangler-release-notes.html)  | 

# Troubleshoot


If an issue arises when using Amazon SageMaker Data Wrangler, we recommend you do the following:
+ If an error message is provided, read the message and resolve the issue it reports if possible.
+ Make sure the IAM role of your Studio Classic user has the required permissions to perform the action. For more information, see [Security and Permissions](data-wrangler-security.md).
+ If the issue occurs when you are trying to import from another AWS service, such as Amazon Redshift or Athena, make sure that you have configured the necessary permissions and resources to perform the data import. For more information, see [Import](data-wrangler-import.md).
+ If you're still having issues, choose **Get help** at the top right of your screen to reach out to the Data Wrangler team. For more information, see the following images.  
![\[The location of the Data Wrangler help form in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/get-help/get-help.png)  
![\[The Data Wrangler help form in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/get-help/get-help-forms.png)

As a last resort, you can try restarting the kernel on which Data Wrangler is running. 

1. Save and exit the .flow file for which you want to restart the kernel. 

1. Select the ****Running Terminals and Kernels**** icon, as shown in the following image.  
![\[The location of the Running Terminals and Kernels icon in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/stop-kernel-option.png)

1. Select the **Stop** icon to the right of the .flow file for which you want to terminate the kernel, as shown in the following image.  
![\[The location of the Stop icon in the console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/studio/mohave/stop-kernel.png)

1. Refresh the browser. 

1. Reopen the .flow file on which you were working. 

## Troubleshooting issues with Amazon EMR


Use the following information to help you troubleshoot errors that might come up when you're using Amazon EMR.
+ Connection failure – If the connection fails with the following message `The IP address of the EMR cluster isn't private error message`, your Amazon EMR cluster might not have been launched in a private subnet. As a security best practice, Data Wrangler only supports connecting to private Amazon EMR clusters. Choose a private EC2 subnet you launch an EMR cluster.
+ Connection hanging and timing out – The issue is most likely due to a network connectivity issue. After you start connecting to the cluster, the screen doesn't refresh. After about 2 minutes, you might see the following error `JdbcAddConnectionError: An error occurred when trying to connect to presto: xxx: Connect to xxx failed: Connection timed out (Connection timed out) will display on top of the screen.`.

  The errors might have two root causes:
  + The Amazon EMR and Amazon SageMaker Studio Classic are in different VPCs. We recommend launching both Amazon EMR and Studio Classic in the same VPC. You can also use VPC peering. For more information, see [What is VPC peering?](https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html).
  + The Amazon EMR master security group lacks the inbound traffic rule for the security group of Amazon SageMaker Studio Classic on the port used for Presto. To resolve the issue, allow inbound traffic on port 8889.
+ Connection fails due to the connection type being misconfigured – You might see the following error message: ` Data Wrangler couldn't create a connection to {connection_source} successfully. Try connecting to {connection_source} again. For more information, see Troubleshoot. If you’re still experiencing issues, contact support. `

  Check the authentication method. The authentication method that you've specified in Data Wrangler should match the authentication method that you're using on the cluster.
+ You don't have HDFS permissions for LDAP authentication – Use the following guidance to resolve the issue [Set up HDFS Permissions using Linux Credentials](https://docs.aws.amazon.com/whitepapers/latest/teaching-big-data-skills-with-amazon-emr/set-up-hdfs-permissions-using-linux-credentials.html). You can log into the cluster using the following commands:

  ```
  hdfs dfs -mkdir /user/USERNAME
  hdfs dfs -chown USERNAME:USERNAME /user/USERNAME
  ```
+ LDAP authentication missing connection key error – You might see the following error message: `Data Wrangler couldn't connect to EMR hive successfully. JDBC connection is missing required connection key(s): PWD`.

  For LDAP authentication, you must specify both a username and a password. The JDBC URL stored in Secrets Manager is missing property `PWD`.
+ When you're troubleshooting the LDAP configuration: We recommend making sure that the LDAP authenticator (LDAP server) is correctly configured to connect to the Amazon EMR cluster. Use the `ldapwhoami` command to help you resolve the configuration issue. The following are example commands that you can run:
  + For LDAPS – `ldapwhoami -x -H ldaps://ldap-server`
  + For LDAP – `ldapwhoami -x -H ldap://ldap-server`

  Either command should return `Anonymous` if you've configured the authenticator successfully.

## Troubleshooting with Salesforce


### Lifecycle configuration error


When your user opens Studio Classic for the first time, they might get an error saying that there's something wrong with their lifecycle configuration. Use Amazon CloudWatch to access the logs written by your lifecycle configuration script. For more information about debugging lifecycle configurations, see [Debug Lifecycle Configurations in Amazon SageMaker Studio Classic](studio-lcc-debug.md).

If you aren't able to debug the error, you can create the configuration file manually. You must create the file every time you delete or restart the Jupyter server. Use the following procedure to create the file manually.

**To create a configuration file**

1. Navigate to Studio Classic.

1. Choose **File**, then **New**, then **Terminal**.

1. Create `.sfgenie_identity_provider_oauth_config`.

1. Open the file in a text editor.

1. Add a JSON object containing the Amazon Resource Name (ARN) of the Secrets Manager secret to the file. You can use the following template to create the object.

   ```
   {
     "secret_arn": "example-secret-ARN"
   }
   ```

1. Save your changes to the file.

### Unable to access Salesforce Data Cloud from the Data Wrangler flow


After your user chooses **Salesforce Data Cloud** from your Data Wrangler flow, they might get an error indicating the prerequisites to set up the connection haven't been met. It might be caused by following errors:
+ The Salesforce secret in Secrets Manager hasn't been created.
+ The Salesforce secret in Secrets Manager has been created, but it's missing the Salesforce tag.
+ The Salesforce secret in Secrets Manager has been created in the wrong AWS Region. For example, your user won't be able to access the Salesforce Data Cloud in `ca-central-1` because you've created the secret in `us-east-1`. You can either replicate the secret to `ca-central-1` or create a new secret with the same credentials in `ca-central-1`. For information about replicating secrets, see [Replicate an AWS Secrets Manager secret to other AWS Regions](https://docs.aws.amazon.com/secretsmanager/latest/userguide/create-manage-multi-region-secrets.html).
+ The policy that your users are using to access Amazon SageMaker Studio Classic are missing permissions for AWS Secrets Manager
+ There's a typo in the Secrets Manager ARN of the JSON object that you've specified through your lifecycle configuration.
+ There's a typo in the Secrets Manager secret containing your Salesforce OAuth configuration

### Blank page showing `redirect_uri_mismatch`


After your users choose **Save and Connect**, they might get redirected to a page that shows `redirect_uri_mismatch`. The callback URI that you've registered in your Salesforce Connected App settings is either missing or incorrect.

Use the following URL to check that your Studio Classic URL is correctly registered in your Salesforce org's Connected App settings: `https://EXAMPLE_SALESFORCE_ORG/lightning/setup/NavigationMenus/home/`. For more information about using the connected app settings, navigate to the following URL: `https://EXAMPLE_SALESFORCE_ORG/lightning/setup/NavigationMenus/home/`.

**Note**  
It takes roughly ten minutes to propagate the URI within Salesforce's systems.

### Shared spaces


Shared spaces doesn't currently work with the Salesforce Data Cloud integration. You can either delete the shared spaces in the Amazon SageMaker AI domain that you intend to use, or you can use another domain that doesn't have shared spaces set up.

### OAuth Redirect Error


Your users should be able to import their data from the Salesforce Data Cloud after they choose **Connect**. If they're running into an error, we recommend asking them to do the following:
+ Tell them to be patient – When they get redirected back to Amazon SageMaker Studio Classic, it can take up to a minute to complete the authentication process. While they're getting redirected, we recommend telling them to avoid interacting with the browser. For example, they shouldn't close the browser tab, switch to another tab, or interact with the Data Wrangler flow. Interacting with the browser might remove the authorization code required to connect to the data cloud.
+ Have your users reconnect to the data cloud – There are transient issues that can cause a connection to the Salesforce Data Cloud to fail. Have your users create a new Data Wrangler flow and try connecting to the Salesforce Data Cloud again.
+ Make sure your users close all other tabs with Amazon SageMaker Studio Classic – Having Studio Classic open in multiple tabs can cause the Salesforce Data Cloud connection to fail. Make sure your users only have one Studio Classic tab open.
+ Multiple users accessing Studio Classic at the same time – Only one user should access an Amazon SageMaker AI domain at a time. If multiple users access the same domain, the connection that a user is trying to create to the Salesforce Data Cloud might fail.

Updating both Data Wrangler and Studio Classic might also fix their error. For information about updating Data Wrangler, see [Update Data Wrangler](data-wrangler-update.md). For information about updating Studio Classic, see [Shut Down and Update Amazon SageMaker Studio Classic](studio-tasks-update-studio.md).

If none of the preceding troubleshooting steps work, you might find an error message from Salesforce with a corresponding description embedded in the Studio Classic URL. The following is an example of a message you could find: `error=invalid_client_id&error_description=client%20identifier%20invalid`.

You can look at the error message in the URL and try to address the issues it presents. If the error message or description is unclear, we recommend searching the Salesforce Knowledge Base. If searching the knowledge base doesn't work, you can reach out to the Salesforce help desk for more assistance.

### Data Wrangler takes a long time to load


When your users are getting redirected back to Data Wrangler from the Salesforce Data Cloud, they might experience long load times.

If this is the user's first time using Data Wrangler or they've deleted the kernel, it might take about 5 minutes to provision the new Amazon EC2 instance to use Data Wrangler.

If this isn't the user's first time using Data Wrangler and they haven't deleted the kernel, you can ask them to refresh the page or close as many browser tabs as possible.

If none of the preceding interventions work, have them set up a new connection to the Salesforce Data Cloud.

### User fails to export their data with an `Invalid batch Id` error


When your user exports the transformations that they've made to their Salesforce data, the SageMaker Processing job that Data Wrangler uses on the backend might fail. The Salesforce Data Cloud might be temporarily unavailable or there could be a caching issue.

To address the issue, we recommend having your users go back to the step where they're importing the data and changing the order of the columns that they're querying . For example, they can change the following query:

```
SELECT col_A, col_B FROM table                
```

To the following query:

```
SELECT col_B, col_A FROM table                
```

After they've changed the order of the columns and made sure that the subsequent transformations they've made are still valid, they can start exporting their data again.

### Users can't export a very large dataset


If your users imported a very large dataset from the Salesforce Data Cloud, they might not be able to export the transformations that they've made. A large dataset might have too many rows, or it can result from a complex query.

We recommend having your users take the following actions:
+ Simplifying their SQL query
+ Sampling their data

The following are some strategies that they can use to simplify their queries:
+ Specify column names instead of using the `*` operator
+ Finding a subset of the data that they'd like to import instead of using a larger subset
+ Minimizing joins between very large datasets

They can use sampling to reduce the number of rows in their dataset. For information about sampling methods, your users can refer to [Sampling](data-wrangler-transform.md#data-wrangler-transform-sampling).

### Users can't export data due to invalid refresh token


Data Wrangler uses a JDBC driver to integrate with the Salesforce Data Cloud. The method for authentication is OAuth. For OAuth, the refresh token and the access token are two different pieces of data that are used to authorize access to resources within your Salesforce Data Cloud.

The access token, or core token, is what allows you to access your Salesforce data and run queries directly through Data Wrangler. It's short lived and designed to expire quickly. To maintain access to your Salesforce data, Data Wrangler uses the refresh token to get a new access token from Salesforce.

You might have set the refresh to expire too quickly to get a new access token for your users. You might have to revisit your refresh token policy to make sure that it can accommodate queries that take a long time to run for your users. For information about configuring your refresh token policy, see `https://EXAMPLE_SALESFORCE_ORG_URL/lightning/setup/ConnectedApplication/home/`.

### Queries failing or tables not loading


Salesforce experiences service outages. Even if you’ve configured everything correctly, your users might not be able to import their data for periods of time.

Service outages can happen for maintenance reasons. We recommend checking in the following day to see if the issue has been resolved.

If you’re experiencing issues for more than a day, we recommend contacting Salesforce’s help desk for further assistance. For information about contacting Salesforce, see [How would you like to contact Salesforce?](https://www.salesforce.com/company/contact-us/)

### `OAUTH_APP_BLOCKED` during Studio Classic redirect


When your user gets redirected back to Amazon SageMaker Studio Classic, they might notice the query parameter `error=OAUTH_APP_BLOCKED` within the URL. They're might be experiencing a transient issue that should resolve itself within a day.

It's possible that you've blocked their access to the Connected App as well. For information about resolving the issue, see `https://EXAMPLE_SALESFORCE_ORG_URL/lightning/setup/ConnectedApplication/home/`.

### `OAUTH_APP_DENIED` during Studio Classic redirect


When your user gets redirected back to Amazon SageMaker Studio Classic, they might notice the query parameter `error=OAUTH_APP_ACCESS_DENIED` within the URL. You haven't given their profile type permissions to access the `Connected App` associated with Data Wrangler.

To resolve their access issue, navigate to `https://EXAMPLE_SALESFORCE_ORG_URL/lightning/setup/ManageUsers/home/` and check whether the user is assigned to the correct profile.

# Increase Amazon EC2 Instance Limit


You might see the following error message when you're using Data Wrangler: `The following instance type is not available: ml.m5.4xlarge. Try selecting a different instance below.`

The message can indicate that you need to select a different instance type, but it can also indicate that you don't have enough Amazon EC2 instances to successfully run Data Wrangler on your workflow. You can increase the number of instances by using the following procedure.

To increase the number of instances, do the following.

1. Open the AWS Management Console.

1. In the search bar, specify **Services Quotas**.

1. Choose **Service Quotas**.

1. Choose **AWS services**.

1. In the search bar, specify **Amazon SageMaker AI**.

1. Choose **Amazon SageMaker AI**.

1. Under **Service quotas**, specify **Studio KernelGateway Apps running on *ml.m5.4xlarge* instance**.
**Note**  
ml.m5.4xlarge is the default instance type for Data Wrangler. You can use other instance types and request quota increases for them. For more information, see [Instances](data-wrangler-data-flow.md#data-wrangler-data-flow-instances).

1. Select **Studio KernelGateway Apps running on *ml.m5.4xlarge* instance**.

1. Choose **Request quota increase**.

1. For **Change quota value**, specify a value greater than **Applied quota value**.

1. Choose **Request**.

If your request is approved, AWS sends a notification to the email address associated with your account. You can also check the status of your request by choosing **Quota request history** on the **Service Quotas** page. Processed requests have a **Status** of **Closed**.

# Update Data Wrangler


To update Data Wrangler to the latest release, first shut down the corresponding KernelGateway app from the Amazon SageMaker Studio Classic control panel. After the KernelGateway app is shut down, restart it by opening a new or existing Data Wrangler flow in Studio Classic. When you open a new or existing Data Wrangler flow, the kernel that starts contains the latest version of Data Wrangler.

**Update your Studio Classic and Data Wrangler instance**

1. Navigate to your [SageMaker AI Console](https://console.aws.amazon.com/sagemaker).

1. Choose SageMaker AI and then Studio Classic.

1. Choose your user name.

1. Under **Apps**, in the row displaying the **App name**, choose **Delete app** for the app that starts with `sagemaker-data-wrang`, and for the JupyterServer app.

1. Choose **Yes, delete app**.

1. Type `delete` in the confirmation box.

1. Choose **Delete**.

1. Reopen your Studio Classic instance. When you begin to create a Data Wrangler flow, your instance now uses the latest version of Data Wrangler.

Alternatively, if you are using a Data Wrangler application version that is not the latest version, and you have an existing Data Wrangler flow open, you are prompted to update your Data Wrangler application version in the Studio Classic UI. The following screenshot shows this prompt. 

**Important**  
This updates the Data Wrangler kernel gateway app only. You still need to shut down the JupyterServer app in your user account. To do this, follow the preceding steps.

![\[The Update Data Wrangler section in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/data-wrangler-1click-restart.png)


You can also choose **Remind me later**, in which case an **Update** button appears in the top-right corner of the screen.

![\[The location of the Update in the Data Wrangler console.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/data-wrangler-1click-restart-update.png)


# Shut Down Data Wrangler


When you are not using Data Wrangler, it is important to shut down the instance on which it runs to avoid incurring additional fees. 

To avoid losing work, save your data flow before shutting Data Wrangler down. To save your data flow in Studio Classic, choose **File** and then choose **Save Data Wrangler Flow**. Data Wrangler automatically saves your data flow every 60 seconds. 

**To shut down the Data Wrangler instance in Studio Classic**

1. In Studio Classic, select the **Running Instances and Kernels** icon (![\[Icon of a gear or cog symbol representing settings or configuration options.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/studio_classic_dw_instances.png)).

1. Under **RUNNING APPS** is the **sagemaker-data-wrangler-1.0** app. Select the shutdown icon (![\[Power button icon with a circular shape and vertical line symbol.\]](http://docs.aws.amazon.com/sagemaker/latest/dg/images/icons/Shutdown_light.png)) next to this app . 

   Data Wrangler runs on an ml.m5.4xlarge instance. This instance disappears from **RUNNING INSTANCES** when you shut down the Data Wrangler app.

**Important**  
If you open Data Wrangler again, an Amazon EC2 instance starts running the application and you will be charged for the compute. In addition to compute, you are also charged for the storage that you use. For example, you're charged for any Amazon S3 buckets that you're using with Data Wrangler.  
If you find that you're still getting charged for Data Wrangler after shutting down your applications, there's a Jupyter extension that you can use to automatically shut down idle sessions. For information about the extension, see [SageMaker-Studio-Autoshutdown-Extension](https://github.com/aws-samples/sagemaker-studio-auto-shutdown-extension).

After you shut down the Data Wrangler app, it has to restart the next time you open a Data Wrangler flow file. This can take a few minutes.