

# What is AWS Glue DataBrew?
What is DataBrew?

AWS Glue DataBrew is a visual data preparation tool that enables users to clean and normalize data without writing any code. Using DataBrew helps reduce the time it takes to prepare data for analytics and machine learning (ML) by up to 80 percent, compared to custom developed data preparation. You can choose from over 250 ready-made transformations to automate data preparation tasks, such as filtering anomalies, converting data to standard formats, and correcting invalid values. 

Using DataBrew, business analysts, data scientists, and data engineers can more easily collaborate to get insights from raw data. Because DataBrew is serverless, no matter what your technical level, you can explore and transform terabytes of raw data without needing to create clusters or manage any infrastructure.

With the intuitive DataBrew interface, you can interactively discover, visualize, clean, and transform raw data. DataBrew makes smart suggestions to help you identify data quality issues that can be difficult to find and time-consuming to fix. With DataBrew preparing your data, you can use your time to act on the results and iterate more quickly. You can save transformation as steps in a recipe, which you can update or reuse later with other datasets, and deploy on a continuing basis. 

The following image shows how DataBrew works at a high level.

![\[A simple diagram about how DataBrew works. DataBrew can visually clean, prepare, and transform data without the need to write code. A box shows data entering DataBrew from Amazon S3. It shows boxes for a few of the transforms that DataBrew can do. The transform boxes include the following: Format, clean and standardize data. Restructure and transform data. Handle missing and invalid data. Handle categorical variables. Handle numerical variables. use natural language processing. The diagram shows that the data is exported to S3 as a prepared dataset.\]](http://docs.aws.amazon.com/databrew/latest/dg/images/databrew-overview-diagram.png)


To use DataBrew, you create a project and connect to your data. In the project workspace, you see your data displayed in a grid-like visual interface. Here, you can explore the data and see value distributions and charts to understand its profile. 

To prepare the data, you can choose from more than 250 point-and-click transformations. These include removing nulls, replacing missing values, fixing schema inconsistencies, creating columns based on functions, and many more. You can also use transformations to apply natural language processing (NLP) techniques to split sentences into phrases. Immediate previews show a portion of your data before and after transformation, so you can modify your recipe before applying it to the entire dataset. 

After DataBrew has run your recipe on your dataset, the output is stored in Amazon Simple Storage Service (Amazon S3). After your cleansed, prepared dataset is in Amazon S3, another of your data storage or data management systems can ingest it. 

# Core concepts and terms in AWS Glue DataBrew
Core concepts and terms

Following, you can find an overview of the core concepts and terminology in AWS Glue DataBrew. After you read this section, see [Getting started with AWS Glue DataBrew](getting-started.md), which walks you through the process of creating projects and connecting datasets and running jobs.

**Topics**
+ [

## Project
](#projects-concept)
+ [

## Dataset
](#datasets-concept)
+ [

## Recipe
](#recipes-concept)
+ [

## Job
](#jobs-concept)
+ [

## Data lineage
](#data-lineage-concept)
+ [

## Data profile
](#data-profile-concept)

## Project
Projects

The interactive data preparation workspace in DataBrew is called a *project*. Using a data project, you manage a collection of related items: data, transformations, and scheduled processes. As part of creating a project, you choose or create a dataset to work on. Next, you create a *recipe*, which is a set of instructions or steps that you want DataBrew to act on. These actions transform your raw data into a form that is ready to be consumed by your data pipeline.

## Dataset
Datasets

Dataset simply means a set of data—rows or records that are divided into columns or fields. When you create a DataBrew project, you connect to or upload data that you want to transform or prepare. DataBrew can work with data from any source, imported from formatted files, and it connects directly to a growing list of data stores. 

For DataBrew, a *dataset* is a read-only connection to your data. DataBrew collects a set of descriptive metadata to refer to the data. No actual data can be altered or stored by DataBrew. For simplicity, we use dataset to refer to both the actual dataset and the metadata DataBrew uses. 

## Recipe
Recipes

In DataBrew, a *recipe* is a set of instructions or steps for data that you want DataBrew to act on. A recipe can contain many steps, and each step can contain many actions. You use the transformation tools on the toolbar to set up all the changes that you want to make to your data. Later, when you're ready to see the finished product of your recipe, you assign this job to DataBrew and schedule it. DataBrew stores the instructions about the data transformation, but it doesn't store any of your actual data. You can download and reuse recipes in other projects. You can also publish multiple versions of a recipe.

## Job
Jobs

DataBrew takes on the job of transforming your data by running the instructions that you set up when you made a recipe. The process of running these instructions is called a *job.* A job can put your data recipes into action according to a preset schedule. But you aren't confined to a schedule. You can also run jobs on demand. If you want to profile some data, you don't need a recipe. In that case, you can just set up a profile job to create a data profile. 

## Data lineage
Data lineage

DataBrew tracks your data in a visual interface to determine its origin, called a *data lineage*. This view shows you how the data flows through different entities from where it originally came. You can see its origin, other entities it was influenced by, what happened to it over time, and where it was stored. 

## Data profile
Data profile

When you profile your data, DataBrew creates a report called a *data profile*. This summary tells you about the existing shape of your data, including the context of the content, the structure of the data, and its relationships. You can make a data profile for any dataset by running a data profile job. 

# Product and service integrations


Use this section to know which products and services integrate with DataBrew. 

DataBrew works with the following AWS services for networking, management, and governance:
+ [Amazon CloudFront](https://docs.aws.amazon.com/AmazonCloudFront/latest/DeveloperGuide/Introduction.html)
+ [AWS CloudFormation](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html)
+ [AWS CloudTrail](https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html)
+ [Amazon CloudWatch](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html)
+ [AWS Step Functions](https://docs.aws.amazon.com/step-functions/latest/dg/connect-databrew.html)

DataBrew works with the following AWS data lakes and data stores:
+ [AWS Lake Formation](https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html)
+ [Amazon S3](https://docs.aws.amazon.com/AmazonS3/latest/dev/Welcome.html)

DataBrew supports the following file formats and extensions for uploading data.


| **Format** | **File extension (optional)** |  **Extensions for compressed files (required)**  | 
| --- | --- | --- | 
|  Comma-separated values  |  `.csv`  |  `.gz`  `.snappy` `.lz4` `.bz2` `.deflate`  | 
| Microsoft Excel workbook |  `.xlsx`  | No compression support | 
|  JSON (JSON document and JSON lines)  |  `.json, .jsonl`  |  `.gz` `.snappy` `.lz4` `.bz2` `.deflate`  | 
| Apache ORC |  `.orc`  |  `.zlib` `.snappy`  | 
| Apache Parquet |  `.parquet`  |  `.gz` `.snappy` `.lz4`  | 

DataBrew writes output files to Amazon S3, and supports the following file formats and extensions.


| **Format** |  **File extension (uncompressed)**  |  **File extensions (compressed)**  | 
| --- | --- | --- | 
|  Comma-separated values  | .csv | .csv.snappy, .csv.gz, .csv.lz4, csv.bz2, .csv.deflate, csv.br | 
|  Tab-separated values  | .csv | .tsv.snappy, .tsv.gz, .tsv.lz4, tsv.bz2, .tsv.deflate, tsv.br | 
| Apache Parquet  | .parquet | .parquet.snappy, .parquet.gz, .parquet.lz4, .parquet.lzo, .parquet.br | 
| AWS Glue Parquet | Not supported | .glue.parquet.snappy | 
| Apache Avro | .avro | .avro.snappy, .avro.gz, .avro.lz4, .avro.bz2, .avro.deflate, .avro.br | 
| Apache ORC | .orc | .orc.snappy, .orc.lzo, .orc.zlib | 
| XML | .xml | .xml.snappy, .xml.gz, .xml.lz4, .xml.bz2, .xml.deflate, .xml.br | 
| JSON (JSON Lines format only) |  .json  | .json.snappy, .json.gz, .json.lz4, json.bz2, .json.deflate, .json.br | 
| Tableau Hyper | Not supported | Not applicable | 