

# Connecting to data with AWS Glue DataBrew
<a name="datasets"></a>

In AWS Glue DataBrew, a *dataset* represents data that's either uploaded from a file or stored elsewhere. For example, data can be stored in Amazon S3, in a supported JDBC data source, or an AWS Glue Data Catalog. If you're not uploading a file directly to DataBrew, the dataset also contains details on how DataBrew can connect to the data.

When you create your dataset (for example, `inventory-dataset`), you enter the connection details only once. From that point, DataBrew can access the underlying data for you. With this approach, you can create projects and develop transformations for your data, without having to worry about connection details or file formats.

**Topics**
+ [Supported file types for data sources](supported-data-file-sources.md)
+ [Supported connections for data sources and outputs](supported-data-connection-sources.md)
+ [Using datasets in AWS Glue DataBrew](datasets.creating.md)
+ [Connecting to your data](datasets.connecting-to-data.md)
+ [Connecting to data in a text file with DataBrew](datasets.connecting-to-data-in-text-files.md)
+ [Connecting data in multiple files in Amazon S3](datasets.multiple-files.md)
+ [Data types](datatypes.md)
+ [Advanced data types](projects.adv-data-types.md)

# Supported file types for data sources
<a name="supported-data-file-sources"></a>

The following file requirements apply to files stored in Amazon S3 and to files that you upload from a local drive. DataBrew supports the following file formats: comma-separated value (CSV), Microsoft Excel, JSON, ORC, and Parquet. You can use files with a nonstandard extension or no extension if the file is of one of the supported types. 

If DataBrew is unable to infer the file type, make sure to select the correct file type yourself (CSV, Excel, JSON, ORC, or Parquet). Compressed CSV, JSON, ORC, and Parquet files are supported, but CSV and JSON files must include the compression codec as the file extension. If you are importing a folder, all files in the folder must be of the same file type.

File formats and supported compression algorithms are shown in the following table.

**Note**  
CSV, Excel, and JSON files must be encoded with Unicode (UTF-8).<a name="datasets.databrew-input-formats"></a>


| **Format** | **File extension (optional)** |  **Extensions for compressed files (required)**  | 
| --- | --- | --- | 
|  Comma-separated values  |  `.csv`  |  `.gz`  `.snappy` `.lz4` `.bz2` `.deflate`  | 
| Microsoft Excel workbook |  `.xlsx`  | No compression support | 
|  JSON (JSON document and JSON lines)  |  `.json, .jsonl`  |  `.gz` `.snappy` `.lz4` `.bz2` `.deflate`  | 
| Apache ORC |  `.orc`  |  `.zlib` `.snappy`  | 
| Apache Parquet |  `.parquet`  |  `.gz` `.snappy` `.lz4`  | 

# Supported connections for data sources and outputs
<a name="supported-data-connection-sources"></a>

You can connect to the following data sources for DataBrew recipe jobs. These include any source of data that isn't a file you're uploading directly to DataBrew. The data source that you're using might be called a database, a data warehouse, or something else. We refer to all data providers as data sources or connections. 

You can create a dataset using any of the following as data sources. 

You can also use Amazon S3, AWS Glue Data Catalog, or JDBC databases supported through Amazon RDS for the output of DataBrew recipe jobs. Amazon AppFlow and AWS Data Exchange aren't supported data stores for the output of DataBrew recipe jobs.
+ **Amazon S3**

  You can use S3 to store and protect any amount of data. To create a dataset, you specify an S3 URL where DataBrew can access a data file, for example: `s3://your-bucket-name/inventory-data.csv`

  DataBrew can also read all of the files in an S3 folder, which means that you can create a dataset that spans multiple files. To do this, specify an S3 URL in this form: `s3://your-bucket-name/your-folder-name/`.

   DataBrew supports only the following Amazon S3 storage classes: Standard, Reduced Redundancy, Standard-IA, and S3 One Zone-IA. DataBrew ignores files with other storage classes. DataBrew also ignores empty files (files containing 0 bytes). For more information about Amazon S3 storage classes, see [ Using Amazon S3 storage classes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.html) in the *Amazon S3 Console User Guide*. 
+ **AWS Glue Data Catalog**

  You can use the Data Catalog to define references to data that's stored in the AWS Cloud. With the Data Catalog, you can build connections to individual tables in the following services:
  + Data Catalog Amazon S3
  + Data Catalog Amazon Redshift
  + Data Catalog Amazon RDS
  + AWS Glue

  DataBrew can also read all of the files in an Amazon S3 folder, which means that you can create a dataset that spans multiple files. To do this, specify an Amazon S3 URL in this form: `s3://your-bucket-name/your-folder-name/`

  To be used with DataBrew, Amazon S3 tables defined in the AWS Glue Data Catalog, must have a table property added to them called a `classification`, which identifies the format of the data as `csv`, `json`, or `parquet`, and the `typeOfData` as `file`. If the table property was not added when the table was created, you can add it using the AWS Glue console. 

   DataBrew supports only the Amazon S3 storage classes Standard, Reduced Redundancy, Standard-IA, and S3 One Zone-IA. DataBrew ignores files with other storage classes. DataBrew also ignores empty files (files containing 0 bytes). For more information about Amazon S3 storage classes, see [ Using Amazon S3 storage classes](https://docs.aws.amazon.com/AmazonS3/latest/userguide/storage-class-intro.html) in the *Amazon S3 Console User Guide*. 

  DataBrew can also access AWS Glue Data Catalog S3 tables from other accounts if an appropriate resource policy is created. You can create a policy in the AWS Glue console on the **Settings** tab under **Data Catalog**. The following is an example policy specifically for a single AWS Region.
**Warning**  
This is a highly permissive resource policy that grants `*$ACCOUNT_TO*` unrestricted access to the Data Catalog of `*$ACCOUNT_FROM*`. In most cases, we recommend that you lock your resource policy down to specific catalogs or tables. For more information, see [AWS Glue resource policies for access control](https://docs.aws.amazon.com/glue/latest/dg/glue-resource-policies.html) in the *AWS Glue Developer Guide*.

  In some cases, you might want to create a project or run a job in AWS Glue DataBrew in `*$ACCOUNT_TO*` with an AWS Glue Data Catalog S3 table in `*$ACCOUNT_FROM*` that points to an S3 location that is also in `*$ACCOUNT_FROM*`. In such cases, the IAM role used when creating the project and job in `*$ACCOUNT_TO*` must have permission to list and get objects in that S3 location from `*$ACCOUNT_FROM*`. For more information, see [Granting cross-account access](https://docs.aws.amazon.com/glue/latest/dg/cross-account-access.html) in the *AWS Glue Developer Guide*.
+ **Data connected using JDBC drivers**

  You can create a dataset by connecting to data with a supported JDBC driver. For more information, see [Using drivers with AWS Glue DataBrew](dbms-driver-connections.md).

  DataBrew officially supports the following data sources using Java Database Connectivity (JDBC): 
  + Microsoft SQL Server
  + MySQL
  + Oracle
  + PostgreSQL
  + Amazon Redshift
  + Snowflake Connector for Spark

  The data sources can be located anywhere that you can connect to them from DataBrew. This list includes only JDBC connections that we've tested and can therefore support.

  Amazon Redshift and Snowflake Connector for Spark data sources can be connected in either of the following ways:
  + With a table name.
  + With a SQL query that spans multiple tables and operations.

  SQL queries are executed when you start a project or a job run.

  To connect to data that requires an unlisted JDBC driver, make sure that the driver is compatible with JDK 8. To use the driver, store it in S3 in a bucket where you can access it with your IAM role for DataBrew. Then point your dataset at the driver file. For more information, see [Using drivers with AWS Glue DataBrew](dbms-driver-connections.md). 

  Example query for a SQL-based dataset:

  ```
  SELECT
      * 
  FROM
      public.customer as c 
  JOIN
      public.customer_address as ca on c.current_address=ca.current_address
  WHERE
      ca.address_id>0 AND ca.address_id<10001 ORDER BY ca.address_id
  ```

  **Limitations of Custom SQL**

  If you use a JDBC connection to access data for a DataBrew dataset, keep in mind the following:
  + AWS Glue DataBrew does not validate the custom SQL you provide as part of dataset creation. The SQL query will be executed when you start a project or job run. DataBrew takes the query you provide and passes it to the database engine using the default or provided JDBC drivers.
  + A dataset created with an invalid query will fail when it is used in a project or job. Validate your query before creating the dataset.
  + The **Validate SQL** feature is only available for Amazon Redshift-based data sources.
  + If you want to use a dataset in a project, limit SQL query runtime to under three minutes to avoid a timeout during project loading. Check the query runtime before creating a project.
+ **Amazon AppFlow**

  Using Amazon AppFlow, you can transfer data into Amazon S3 from third-party Software-as-a-Service (SaaS) applications such as Salesforce, Zendesk, Slack, and ServiceNow. You can then use the data to create a DataBrew dataset.

  In Amazon AppFlow, you create a connection and a flow to transfer data between your third-party application and a destination application. When using Amazon AppFlow with DataBrew, make sure that the Amazon AppFlow destination application is Amazon S3. Amazon AppFlow destination applications other than Amazon S3 don't appear in the DataBrew console. For more information on transferring data from your third-party application and creating Amazon AppFlow connections and flows, see the [Amazon AppFlow documentation](https://docs.aws.amazon.com/appflow/index.html). 

  When you choose **Connect new dataset** in the **Datasets** tab of DataBrew and click Amazon AppFlow, you see all flows in Amazon AppFlow that are configured with Amazon S3 as the destination application. To use a flow's data for your dataset, choose that flow.

   Choosing **Create flow**, **Manage flows**, and **View details** for Amazon AppFlow in the DataBrew console opens the Amazon AppFlow console so that you can perform those tasks. 

  After you create a dataset from Amazon AppFlow, you can run the flow and view the lastest flow run details when viewing dataset details or job details. When you run the flow in DataBrew, the dataset is updated in S3 and is ready to be used in DataBrew.

  The following situations can arise when you select an Amazon AppFlow flow in the DataBrew console to create a dataset:
  + **Data hasn't been aggregated** - If the flow trigger is **Run on demand** or is **Run on schedule** with full data transfer, make sure to aggregate the data for the flow before using it to create a DataBrew dataset. Aggregating the flow combines all records in the flow into a single file. Flows with the trigger type **Run on schedule** with incremental data transfer, or **Run on event** don't require aggregation. To aggregate data in Amazon AppFlow, choose **Edit flow configuration** > **Destination details** > **Additional settings** > **Data transfer preference**.
  + **Flow hasn't been run** - If the run status for a flow is empty, it means one of the following:
    + If the trigger for running the flow is ** Run on demand**, the flow has not yet been run.
    + If the trigger for running the flow is ** Run on event**, the triggering event has not yet occurred.
    + If the trigger for running the flow is **Run on schedule**, a scheduled run has not yet occurred.

    Before creating a dataset with a flow, choose **Run flow** for that flow.

    For more information, see [ Amazon AppFlow flows](https://docs.aws.amazon.com/appflow/latest/userguide/flows.html) in the Amazon AppFlow User Guide.
+ **AWS Data Exchange**

  You can choose from hundreds of third-party data sources that are available in AWS Data Exchange. By subscribing to these data sources, you get the most up-to-date version of the data.

  To create a dataset, you specify the name of a AWS Data Exchange data product that you're subscribed to and entitled to use.

# Using datasets in AWS Glue DataBrew
<a name="datasets.creating"></a>

To view a list of your datasets in the DataBrew console, choose **DATASET** at left. In the datasets page, you can view detailed information for each dataset by clicking its name or choosing **Actions**, **Edit** from its context menu. 

To create a new dataset, you choose **DATASET**, **Connect new dataset**. Different data sources have different connection parameters, and you enter these so that DataBrew can connect. When you save your connection and choose **Create dataset**, DataBrew connects to your data and begins loading data. For more information, see [Connecting to your data](datasets.connecting-to-data.md). 

The dataset page has the following elements to help you explore your data.

**Dataset preview** – On this tab, you can find connection information for the dataset and an overview of the overall structure of the dataset, as shown following.

![\[Dataset details and preview showing metadata and sample rows from a JSON file stored in S3.\]](http://docs.aws.amazon.com/databrew/latest/dg/images/dataset-preview.png)


**Data profile overview** – On this tab, you can find a graphical data profile of statistics and volumetrics for your dataset, as shown following.

![\[Data profile overview showing dataset summary, data types, missing cells, and correlations.\]](http://docs.aws.amazon.com/databrew/latest/dg/images/data-profile-overview2.png)


**Note**  
To create a data profile, run a DataBrew profile job on your dataset. For information about how to do this, see [Step 5: Create a data profile](getting-started.05.md).

**Column statistics** – On this tab, you can find detailed statistics about each column in your dataset, as shown following.

![\[Data profile overview showing column statistics, data quality, and value distribution for a dataset.\]](http://docs.aws.amazon.com/databrew/latest/dg/images/dataset-column-stats.png)


**Data lineage** – This tab shows a graphical representation of how your dataset was created and how it's used in DataBrew, as shown following.

![\[Data lineage diagram showing dataset creation and usage flow in DataBrew.\]](http://docs.aws.amazon.com/databrew/latest/dg/images/dataset-lineage.png)


**Topics**
+ [Deleting a dataset](datasets.deleting.md)

# Deleting a dataset
<a name="datasets.deleting"></a>

If you no longer need a dataset, you can delete it. Deleting a dataset doesn't affect the underlying data source in any way. It simply removes the information that DataBrew used to access the data source.

You can't delete a dataset if any other DataBrew resources rely on it. For example, if you currently have a DataBrew project that uses the dataset, delete the project first before you delete the dataset.

To delete a dataset, choose **Dataset** from the navigation pane. Choose the dataset that you want to delete, and then for **Actions**, choose **Delete**.

# Connecting to your data
<a name="datasets.connecting-to-data"></a>

For more information on connecting to the following data sources, choose the section that applies to you. 
+ **AWS Glue Data Catalog** – You can use the Data Catalog to define references to data objects stored in the AWS Cloud, including the following services:
  + Amazon Redshift
  + Aurora MySQL
  + Aurora PostgreSQL
  + Amazon RDS for MySQL
  + Amazon RDS for PostgreSQL

  DataBrew recognizes all Lake Formation permissions that have been applied to Data Catalog resources, so DataBrew users can only access these resources if they're authorized.

  To create a dataset, you specify a Data Catalog database name and a table name. DataBrew takes care of the other connection details.
+ **AWS Data Exchange** – You can choose from hundreds of third-party data sources that are available in AWS Data Exchange. By subscribing to these data sources, you always have the most up-to-date version of the data.

  To create a dataset, you specify the name of a Data Exchange data product that you're subscribed to or entitled to use.
+  **JDBC driver connections** – You can create a dataset by connecting DataBrew to a JDBC-compatible data source. DataBrew supports connecting to the following sources through JDBC:
  + Amazon Redshift
  + Microsoft SQL Server
  + MySQL
  + Oracle
  + PostgreSQL
  + Snowflake

**Topics**
+ [Using drivers with AWS Glue DataBrew](dbms-driver-connections.md)
+ [Supported JDBC drivers](jdbc-drivers.md)

# Using drivers with AWS Glue DataBrew
<a name="dbms-driver-connections"></a>

A *database driver* is a file or URL that implements a database connection protocol, for example Java Database Connectivity (JDBC). The driver functions as an adaptor or a translator between a specific database management system (DBMS) and another system. 

In this case, it allows AWS Glue DataBrew to connect to your data. Then you can access a database object, like a table or view, from a supported data source. The data source that you're using might be called a database, a data warehouse, or something else. However, for the purpose of this documentation we refer to all data providers as data sources or connections.<a name="jdbc"></a><a name="download-jars"></a>

To use a JDBC driver or jar file, download the file or files you need and put them in an S3 bucket. The IAM role that you use to access the data needs to have read permissions for both the driver files. 

**Note**  
With AWS Glue 4.0, connecting to Snowflake as a data source is supported natively. You don't need to provide custom `jar` files. In AWS Glue DataBrew, choose Snowflake as the External source connection and provide the URL of your Snowflake instance. The URL will use a hostname in the form `https://account_identifier.snowflakecomputing.com`.   
 Provide the data access credentials, Snowflake database name, and Snowflake schema name. Additionally, if your Snowflake user does not have a default warehouse set, you will need to provide a warehouse name.   
 Snowflake connections use an AWS Secrets Manager secret to provide credential information. Your project and job roles in must have permission to read this secret.   

![\[The screenshot shows a Connection access wizard with Snowflake selected as the External source. Complete the other fields to finish the connection: JDBC URL and Database access credentials.\]](http://docs.aws.amazon.com/databrew/latest/dg/images/databrew-connection-access-snowflake-new.png)


**To use drivers with DataBrew**

1. Find out which version of your data source you're on, using the method provided by the product. 

1. Find the latest version of connectors and driver required. You can locate this information on the data providers website.

1. Download the required version of the JDBC files. These are normally stored as Java ARchives (.JAR) files.

1. Either upload the drivers from the console to your S3 bucket or provide the S3 path to your .JAR files.

1. Enter the basic connection details, for example class, instance, and so on.

1. Enter any additional configuration information that your data source needs, for example virtual private cloud (VPC) information.

# Supported JDBC drivers
<a name="jdbc-drivers"></a>


| Product | Supported version | Driver instructions and downloads | SQL queries supported | 
| --- | --- | --- | --- | 
|   Microsoft SQL Server  |  v6.x or higher  |  [Microsoft JDBC Driver for SQL Server](https://docs.microsoft.com/sql/connect/jdbc/download-microsoft-jdbc-driver-for-sql-server)  | Not supported | 
|   MySQL  |  v5.1 or higher  |  [MySQL Connectors](https://www.mysql.com/products/connector/)  | Not supported | 
|   Oracle  |  v11.2 or higher  |  [Oracle JDBC downloads](https://www.oracle.com/database/technologies/appdev/jdbc-downloads.html)  | Not supported | 
|   PostgreSQL  |  v4.2.x or higher  |  [PostgreSQL JDBC driver](https://jdbc.postgresql.org/download.html)  | Not supported | 
|   Amazon Redshift  |  v4.1 or higher  |  [Connecting to Amazon Redshift with JDBC](https://docs.aws.amazon.com/redshift/latest/mgmt/configuring-connections.html#connecting-drivers)  | Supported | 
|   Snowflake  |  To see your Snowflake version, use [CURRENT\$1VERSION](https://docs.snowflake.com/en/sql-reference/functions/current_version.html) as described in the Snowflake documentation.  |  To connect to Snowflake you need both of the following:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/databrew/latest/dg/jdbc-drivers.html)  | Supported | 

To connect to databases or data warehouses that require a different version of the driver from what DataBrew natively supports, you can provide a JDBC driver of your choice. The driver must be compatible with JDK 8 or Java 8. For instructions on how to find the latest driver version for your database, see [Using drivers with AWS Glue DataBrew](dbms-driver-connections.md).

# Connecting to data in a text file with DataBrew
<a name="datasets.connecting-to-data-in-text-files"></a>

You can configure the following format options for the input files that DataBrew supports:
+ **Comma-separated value (CSV) files**
  + **Delimiters**

    The default delimiter is a comma for .csv files. If your file uses a different delimiter, choose the delimiter for **CSV delimiter** in the **Additional configurations** section when you create your dataset. The following delimiters are supported for .csv files:
    + Comma (,)
    + Colon (:)
    + Semi-colon (;)
    + Pipe (\$1)
    + Tab (\$1t)
    + Caret (^)
    + Backslash (\$1)
    + Space
  + **Column header values**

    Your CSV file can include a header row as the first row of the file. If it doesn't, DataBrew creates a header row for you.
    + If your CSV file includes a header row, choose **Treat first row as header**. If you do, the first row of your CSV file is treated as containing the column header values. 
    + If your CSV file doesn't include a header row, choose **Add default header**. If you do, DataBrew creates a header row for the file and doesn't treat your first row of data as containing header values. The headers that DataBrew creates consist of an underscore and a number for each column in the file, in the format `Column_1`, `Column_2`, `Column_3`, and so on.
+ **JSON files**

  DataBrew supports two formats for JSON files, JSON Lines and JSON document. JSON Lines files contain one row per line. In JSON document files, all rows are contained in a single JSON structure or an array. You can specify your JSON file type in the **Additional configurations** section when you create a JSON dataset. The default format is JSON Lines. 
+ **Excel files**

  The following apply to Excel sheets in DataBrew:
  + **Excel sheet loading**

    By default, DataBrew loads the first sheet in your Excel file. However, you can specify a different sheet number or sheet name in the **Additional configurations** section when you create an Excel dataset.
  + **Column header values**

    Your Excel sheets can include a header row as the first row of the file, but if they don't, DataBrew will create a header row for you.
    + If your Excel sheets include a header row, choose **Treat first row as header**. If you do, the first row of your Excel sheets is treated as containing the column header values. 
    + If your Excel file doesn't include a header row, choose **Add default header**. By doing this, you specify that DataBrew should create a header row for the file and not treat your first row of data as containing header values. The headers that DataBrew creates consist of an underscore and a number for each column in the file, in the format `Column_1`, `Column_2`, `Column_3`, and so on.

# Connecting data in multiple files in Amazon S3
<a name="datasets.multiple-files"></a>

With the DataBrew console, you can navigate Amazon S3 buckets and folders and choose a file for your dataset. However, a dataset doesn't need to be limited to one file.

Suppose that you have an S3 bucket named `my-databrew-bucket` that contains a folder named `databrew-input`. In that folder, suppose that you have a number of JSON files, all with the same file format and `.json` file extension. On the console, you can specify a source URL of `s3://my-databrew-bucket/databrew-input/`. On the DataBrew console, you can then choose this folder. Your dataset consists of all the JSON files in that folder.

DataBrew can process all of the files in an S3 folder, but only if the following conditions are true:
+ All of the files in the folder have the same format.
+ All of the files in the folder have the same file extension.

For more information on supported file formats and extensions, see [DataBrew input formats](supported-data-file-sources.md#datasets.databrew-input-formats).

## Schemas when using multiple files as a dataset
<a name="datasets.multiple-files-schemas-multiple-files"></a>

When using multiple files as a DataBrew dataset, the schemas have to be the same across all the files. Otherwise, the Project Workspace automatically tries to choose one of the schemas from the multiple files and tries to conform the rest of the dataset files to that schema. This behavior results in the view that is shown during Project Workspace to be irregular, and as a result, the job output will also be irregular.

If your files must have different schemas, you need to create multiple datasets and profile them separately.

## Using parameterized paths for Amazon S3
<a name="datasets.multiple-files.parameterized-s3"></a>

In some cases, you might want to create a dataset with files that follow a certain naming convention, or a dataset that can span multiple Amazon S3 folders. Or you might want to reuse the same dataset for identically structured data that is periodically generated in an S3 location with a path that depends on certain parameters. An example is a path named for the date of data production.

DataBrew supports this approach with parameterized S3 paths. A *parameterized path* is an Amazon S3 URL containing regular expressions or custom path parameters, or both. 

### Defining a dataset with an S3 path using regular expressions
<a name="defining.dataset.with.s3.path.using.regex"></a>

Regular expressions in the path can be useful to match several files from one or more folders and at the same time filter out unrelated files in those folders.

Here is a couple of examples:
+ Define a dataset including all JSON files from a folder whose name begins with `invoice`.
+ Define a dataset including all files in folders with `2020` in their names.

You can implement this type of approach by using regular expressions in a dataset S3 path. These regular expressions can replace any substring in the key of the S3 URL (but not the bucket name). 

As an example of a key in an S3 URL, see the following. Here, `my-bucket` is the bucket name, US East (Ohio) is the AWS Region, and `puppy.png` is the key name. 

`https://my-bucket.s3.us-west-2.amazonaws.com/puppy.png` 

In a parameterized S3 path, any characters between two angle brackets (`<` and `>`) are treated as regular expressions. Two examples are the following:
+ `s3://my-databrew-bucket/databrew-input/invoice<.*>/data.json` matches all files named `data.json`, within all of the subfolders of `databrew-input` whose names begin with `invoice`.
+ `s3://my-databrew-bucket/databrew-input/<.*>2020<.*>/` matches all files in folders with `2020` in their names.

In these examples, `.*` matches zero or more characters.

**Note**  
You can only use regular expressions in the key part of the S3 path—the part that goes after the bucket name. Thus, `s3://my-databrew-bucket/<.*>-input/` is valid, but `s3://my-<.*>-bucket/<.*>-input/` isn't.

We recommend that you test your regular expressions to ensure that they match only the S3 URLs that you want, and not ones that you don't want. 

Here are some other examples of regular expressions:
+ `<\d{2}>` matches a string that consists of exactly two consecutive digits, for example `07` or `03`, but not `1a2`.
+ `<[a-z]+.*>` matches a string that begins with one or more lowercase Latin letters and has zero or more other characters after it. An example is `a3`, `abc/def`, or `a-z`, but not `A2`.
+ `<[^/]+>` matches a string that contains any characters except for a slash (**/**). In an S3 URL, slashes are used for separating folders in the path.
+ `<.*=.*>` matches a string that contains an equals sign (=), for example `month=02`, `abc/day=2`, or `=10`, but not `test`.
+ `<\d.*\d>` matches a string that begins and ends with a digit and can have any other characters in between the digits, for example `1abc2`, `01-02-03`, or `2020/Jul/21`, but not `123a`.

### Defining a dataset with an S3 path using custom parameters
<a name="defining.dataset.with.s3.path.using.custom.parameters"></a>

Defining a parameterized dataset using custom parameters offers advantages over using regular expressions when you might want to provide parameters for an S3 location:
+ You can achieve the same results as with a regular expression, without needing to know the syntax for regular expressions. You can define parameters using familiar terms like "starts with" and "contains."
+ When you define a dynamic dataset using parameters in the path, you can include a time range in your definition, such as "past month" or "past 24 hours." That way, your dataset definition will be used later with new incoming data.

Here are some examples of when you might want to use dynamic datasets:
+ To connect multiple files that are partitioned by *last updated* date or other meaningful attributes into a single dataset. You can then capture these partition attributes as additional columns in a dataset.
+ To restrict files in a dataset to S3 locations that satisfy certain conditions. For example, suppose that your S3 path contains date-based folders like `folder/2021/04/01/`. In this case, you can parameterize the date and restrict it to a certain range like "between Mar 01 2021 and Apr 01 2021" or "Past week."

To define a path using parameters, define the parameters and add them to your path using the following format: 

`s3://my-databrew-bucket/some-folder/{parameter1}/file-{parameter2}.json`

**Note**  
As with regular expressions in an S3 path, you can only use parameters in the key part of the path—the part that goes after the bucket name.

Two fields are required in a parameter definition, name and type. The type can be **String**, **Number**, or **Date**. Parameters of type **Date** must have a definition of the date format so that DataBrew can correctly interpret and compare date values. Optionally, you can define matching conditions for a parameter. You can also choose to add matching values of a parameter as a column to your dataset when it's being loaded by a DataBrew job or interactive session. 

#### Example
<a name="dynamic.dataset.example"></a>

Let's consider an example of defining a dynamic dataset using parameters in the DataBrew console. In this example, assume that the input data is regularly written into an S3 bucket using locations like these:
+ `s3://databrew-dynamic-datasets/new-cases/UR/daily-report-2021-03-30.csv`
+ `s3://databrew-dynamic-datasets/new-cases/UR/daily-report-2021-03-31.csv`
+ `s3://databrew-dynamic-datasets/new-cases/US/daily-report-2021-03-30.csv`
+ `s3://databrew-dynamic-datasets/new-cases/US/daily-report-2021-03-31.csv`

There are two dynamic parts here: a country code, like US, and a date in the file name like 2021-03-30. Here, you can apply the same cleanup recipe for all files. Let's say that you want to perform your cleanup job daily. Following is how you can define a parameterized path for this scenario: 

1. Navigate to a specific file.

1. Then select a varying part, like a date, and replace it with a parameter. In this case, replace a date.  
![\[S3 bucket path with dynamic dataset folder and daily report CSV file for US region.\]](http://docs.aws.amazon.com/databrew/latest/dg/images/dynamic-datasets-01.png)

1. Open the context (right-click) menu for **Create custom parameter** and set properties for it:
   + Name: report date
   + Type: Date
   + Date format: yyyy-MM-dd (selected from the predefined formats)
   + Conditions (Time range): Past 24 hours
   + Add as column: true (checked)

   Keep other fields at their default values.

1. Choose** Create**.

After you do, you see the updated path, as in the following screenshot.

![\[S3 source input field with a file path and matching files count displayed below.\]](http://docs.aws.amazon.com/databrew/latest/dg/images/dynamic-datasets-02.png)


Now you can do the same for the country code and parameterize it as follows:
+ Name: country code
+ Type: String
+ Add as column: true (checked)

You don't have to specify conditions if all values are relevant. In the `new-cases` folder, for example, we only have subfolders with country codes, so there's no need for conditions. If you had other folders to exclude, you might use the following condition.

![\[Text input field showing a string value with alphanumeric characters in square brackets.\]](http://docs.aws.amazon.com/databrew/latest/dg/images/dynamic-datasets-03.png)


This approach limits the subfolders of new cases to contain two capital Latin characters.

After this parameterization, you have only matching files in our dataset and can choose **Create Dataset**.

**Note**  
When you use relative time ranges in conditions, the time ranges are evaluated when the dataset is loaded. This is true whether they are predefined time ranges like "Past 24 hours” or custom time ranges like "5 days ago". This evaluation approach applies whether the dataset is loaded during an interactive session initialization or during a job start. 

After you choose **Create Dataset**, your dynamic dataset is ready to use. As an example, you might use it first to create a project and define a cleanup recipe using an interactive DataBrew session. Then you might create a job that is scheduled to run daily. This job might apply the cleanup recipe to the dataset files that meet the conditions of your parameters at the time when the job starts. 

### Supported conditions for dynamic datasets
<a name="conditions.for.dynamic.datasets"></a>

You can use conditions for filtering matching S3 files using parameters or the last modified date attribute. 

Following, you can find lists of supported conditions for each parameter type.


**Conditions used with String parameters**  

| Name in DataBrew SDK | SDK synonyms | Name in DataBrew console | Description | 
| --- | --- | --- | --- | 
|  is  |  eq, ==  |  Is exactly  |  The value of the parameter is the same as the value that was provided in the condition.  | 
|  is not  |  not eq, \$1=  |  Is not  |  The value of the parameter isn't the same as the value that was provided in the condition.  | 
|  contains  |    |  Contains  |  The string value of the parameter contains the value that was provided in the condition.  | 
|  not contains  |    |  Does not contain  |  The string value of the parameter doesn't contain the value that was provided in the condition.  | 
|  starts\$1with  |    |  Starts with  |  The string value of the parameter starts with the value that was provided in the condition.  | 
|  not starts\$1with  |    |  Does not start with  |  The string value of the parameter doesn't start with the value that was provided in the condition.  | 
|  ends\$1with  |    |  Ends with  |  The string value of the parameter ends with the value that was provided in the condition.  | 
|  not ends\$1with  |    |  Does not end with  |  The string value of the parameter doesn't end with the value that was provided in the condition.  | 
|  matches  |    |  Matches  |  The value of the parameter matches the regular expression provided in the condition.  | 
|  not matches  |    |  Does not match  |  The value of the parameter doesn't match the regular expression provided in the condition.  | 

**Note**  
All conditions for String parameters use case-sensitive comparison. If you aren't sure about the case used in an S3 path, you can use the "matches" condition with a regular expression value that starts with `(?i)`. Doing this results in a case-insensitive comparison.   
For example, suppose that you want your string parameter to start with `abc`, but `Abc` or `ABC` are also possible. In this case, you can use the "matches" condition with `(?i)^abc` as the condition value. 


**Conditions used with Number parameters**  

| Name in DataBrew SDK | SDK synonyms | Name in DataBrew console | Description | 
| --- | --- | --- | --- | 
|  is  |  eq, ==  |  Is exactly  |  The value of the parameter is the same as the value that was provided in the condition.  | 
|  is not  |  not eq, \$1=  |  Is not  |  The value of the parameter isn't the same as the value that was provided in the condition.  | 
|  less\$1than  |  lt, <  |  Less than  |  The numeric value of the parameter is less than the value that was provided in the condition.  | 
|  less\$1than\$1equal  |  lte, <=  |  Less than or equal to  |  The numeric value of the parameter is less than or equal to the value that was provided in the condition.  | 
|  greater\$1than  |  gt, >  |  Greater than  |  The numeric value of the parameter is greater than the value that was provided in the condition.  | 
|  greater\$1than\$1equal  |  gte, >=  |  Greater than or equal to  |  The numeric value of the parameter is greater than or equal to the value that was provided in the condition.  | 


**Conditions used with Date parameters**  

| Name in DataBrew SDK | Name in DataBrew console | Condition value format (SDK) | Description | 
| --- | --- | --- | --- | 
|  after  |  Start  |  ISO 8601 date format like `2021-03-30T01:00:00Z` or `2021-03-30T01:00-07:00`  |  The value of the date parameter is after the date provided in the condition.  | 
|  before  |  End  |  ISO 8601 date format like `2021-03-30T01:00:00Z` or `2021-03-30T01:00-07:00`  |  The value of the date parameter is before the date provided in the condition.  | 
|  relative\$1after  |  Start (relative)  |  Positive or negative number of time units, like `-48h` or `+7d`.   |  The value of the date parameter is after the relative date provided in the condition. Relative dates are evaluated when the dataset is loaded, either when an interactive session is initialized or when an associated job is started. This is the moment that is called "now" in the examples.  | 
|  relative\$1before  |  End (relative)  |  Positive or negative number of time units, like `-48h` or `+7d`.   |  The value of the date parameter is before the relative date provided in the condition.  Relative dates are evaluated when the dataset is loaded, either when an interactive session is initialized or when an associated job is started. This is the moment that is called "now" in the examples.  | 

If you use the SDK, provide relative dates in the following format: `±{number_of_time_units}{time_unit}`. You can use these time units:
+ -1h (1 hour ago)
+ \$12d (2 days from now)
+ -120m (120 minutes ago)
+ 5000s (5,000 seconds from now)
+ -3w (3 weeks ago)
+ \$14M (4 months from now)
+ -1y (1 year ago)

Relative dates are evaluated when the dataset is loaded, either when an interactive session is initialized or when an associated job is started. This is the moment that is called "now" in the examples preceding.

### Configuring settings for dynamic datasets
<a name="dynamic.datasets.configuration"></a>

Besides providing a parameterized S3 path, you can configure other settings for datasets with multiple files. These settings are filtering S3 files by their last modified date and limiting the number of files. 

Similar to setting a date parameter in a path, you can define a time range when matching files were updated and include only those files into your dataset. You can define these ranges using either absolute dates like "March 30, 2021" or relative ranges like "Past 24 hours".

![\[Dropdown menu showing "Past 24 hours" option for specifying last updated date range.\]](http://docs.aws.amazon.com/databrew/latest/dg/images/dynamic-datasets-04.png)


To limit the number of matching files, select a number of files that is greater than 0 and whether you want the latest or the oldest matching files.

![\[UI options to specify number of files to include, with dropdown for Latest and input for quantity.\]](http://docs.aws.amazon.com/databrew/latest/dg/images/dynamic-datasets-05.png)


# Data types
<a name="datatypes"></a>

The data for each column of your dataset are converted to one of the following data types:
+ **byte** – 1-byte signed integer numbers. The range of numbers is from -128 to 127.
+ **short** – 2-byte signed integer numbers. The range of numbers is from -32768 to 32767.
+ **integer** – 4-byte signed integer numbers. The range of numbers is from -2147483648 to 2147483647.
+ **long** – 8-byte signed integer numbers. The range of numbers is from -9223372036854775808 to 9223372036854775807.
+ **float** – 4-byte single-precision floating point numbers.
+ **double** – 8-byte double-precision floating point numbers.
+ **decimal** – Signed decimal numbers with up to 38 digits total and 18 digits after the decimal point.
+ **string** – Character string values.
+ **boolean** – Boolean type has one of two possible values: `true` and `false` or `yes` and `no`.
+ **timestamp** – Values comprising fields year, month, day, hour, minute, and second.
+ **date** – Values comprising fields year, month and day.

## Advanced data types
<a name="advanced-datatypes"></a>

*Advanced data types* are data types that DataBrew detects within a string column in a project, and therefore are not part of a dataset. For information about advanced data types, see [ Advanced data types](projects.adv-data-types.md#projects.adv-data-types.title).

# Advanced data types
<a name="projects.adv-data-types"></a>

 *Advanced data types* are data types that DataBrew detects within a string column in a project by means of pattern matching. When you click on a string column, the column is flagged as the corresponding advanced data type if 50% or more of the values in the column meet the criteria for that data type. 

The data types DataBrew can detect are:
+ Date/timestamp
+ SSN
+ Phone number
+ Email
+ Credit card
+ Gender
+ IP address
+ URL
+ Zipcode
+ Country
+ Currency
+ State
+ City

 You can use the following transforms to work with advanced data types:
+ [GET\$1ADVANCED\$1DATATYPE](recipe-actions.GET_ADVANCED_DATATYPE.md): Given a string column, identifies the advanced data type of the column, if any.
+ [EXTRACT\$1ADVANCED\$1DATATYPE\$1DETAILS](recipe-actions.EXTRACT_ADVANCED_DATATYPE_DETAILS.md): Extracts details for an advanced data type.
+ [ADVANCED\$1DATATYPE\$1FILTER](recipe-actions.ADVANCED_DATATYPE_FILTER.md): Filters a current source column based on advanced data type detection.
+ [ADVANCED\$1DATATYPE\$1FLAG](recipe-actions.ADVANCED_DATATYPE_FLAG.md): Creates a new flag column based on the values for the current source column.