AnyCompany Developer Guide



# Using crawlers to populate the Data Catalog
<a name="add-crawler"></a>

You can use an AWS Glue crawler to populate the AWS Glue Data Catalog with databases and tables. This is the primary method used by most AWS Glue users. A crawler can crawl multiple data stores in a single run. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets. The ETL job reads from and writes to the data stores that are specified in the source and target Data Catalog tables.

## Workflow
<a name="crawler-workflow"></a>

The following workflow diagram shows how AWS Glue crawlers interact with data stores and other elements to populate the Data Catalog.

![\[Workflow showing how AWS Glue crawler populates the Data Catalog in 5 basic steps.\]](http://docs.aws.amazon.com/glue/latest/dg/images/PopulateCatalog-overview.png)


The following is the general workflow for how a crawler populates the AWS Glue Data Catalog:

1. A crawler runs any custom *classifiers* that you choose to infer the format and schema of your data. You provide the code for custom classifiers, and they run in the order that you specify.

   The first custom classifier to successfully recognize the structure of your data is used to create a schema. Custom classifiers lower in the list are skipped.

1. If no custom classifier matches your data's schema, built-in classifiers try to recognize your data's schema. An example of a built-in classifier is one that recognizes JSON.

1. The crawler connects to the data store. Some data stores require connection properties for crawler access.

1. The inferred schema is created for your data.

1. The crawler writes metadata to the Data Catalog. A table definition contains metadata about the data in your data store. The table is written to a database, which is a container of tables in the Data Catalog. Attributes of a table include classification, which is a label created by the classifier that inferred the table schema.

**Topics**
+ [Workflow](#crawler-workflow)
+ [How crawlers work](#crawler-running)
+ [How does a crawler determine when to create partitions?](#crawler-s3-folder-table-partition)
+ [Supported data sources for crawling](crawler-data-stores.md)
+ [Crawler prerequisites](crawler-prereqs.md)
+ [Defining and managing classifiers](add-classifier.md)
+ [Configuring a crawler](define-crawler.md)
+ [Scheduling a crawler](schedule-crawler.md)
+ [Viewing crawler results and details](console-crawlers-details.md)
+ [Customizing crawler behavior](crawler-configuration.md)
+ [Tutorial: Adding an AWS Glue crawler](tutorial-add-crawler.md)

## How crawlers work
<a name="crawler-running"></a>

When a crawler runs, it takes the following actions to interrogate a data store:
+ **Classifies data to determine the format, schema, and associated properties of the raw data** – You can configure the results of classification by creating a custom classifier.
+ **Groups data into tables or partitions ** – Data is grouped based on crawler heuristics.
+ **Writes metadata to the Data Catalog ** – You can configure how the crawler adds, updates, and deletes tables and partitions.

When you define a crawler, you choose one or more classifiers that evaluate the format of your data to infer a schema. When the crawler runs, the first classifier in your list to successfully recognize your data store is used to create a schema for your table. You can use built-in classifiers or define your own. You define your custom classifiers in a separate operation, before you define the crawlers. AWS Glue provides built-in classifiers to infer schemas from common files with formats that include JSON, CSV, and Apache Avro. For the current list of built-in classifiers in AWS Glue, see [Built-in classifiers](add-classifier.md#classifier-built-in). 

The metadata tables that a crawler creates are contained in a database when you define a crawler. If your crawler does not specify a database, your tables are placed in the default database. In addition, each table has a classification column that is filled in by the classifier that first successfully recognized the data store.

If the file that is crawled is compressed, the crawler must download it to process it. When a crawler runs, it interrogates files to determine their format and compression type and writes these properties into the Data Catalog. Some file formats (for example, Apache Parquet) enable you to compress parts of the file as it is written. For these files, the compressed data is an internal component of the file, and AWS Glue does not populate the `compressionType` property when it writes tables into the Data Catalog. In contrast, if an *entire file* is compressed by a compression algorithm (for example, gzip), then the `compressionType` property is populated when tables are written into the Data Catalog. 

The crawler generates the names for the tables that it creates. The names of the tables that are stored in the AWS Glue Data Catalog follow these rules:
+ Only alphanumeric characters and underscore (`_`) are allowed.
+ Any custom prefix cannot be longer than 64 characters.
+ The maximum length of the name cannot be longer than 128 characters. The crawler truncates generated names to fit within the limit.
+ If duplicate table names are encountered, the crawler adds a hash string suffix to the name.

If your crawler runs more than once, perhaps on a schedule, it looks for new or changed files or tables in your data store. The output of the crawler includes new tables and partitions found since a previous run.

## How does a crawler determine when to create partitions?
<a name="crawler-s3-folder-table-partition"></a>

When an AWS Glue crawler scans Amazon S3 data store and detects multiple folders in a bucket, it determines the root of a table in the folder structure and which folders are partitions of a table. The name of the table is based on the Amazon S3 prefix or folder name. You provide an **Include path** that points to the folder level to crawl. When the majority of schemas at a folder level are similar, the crawler creates partitions of a table instead of separate tables. To influence the crawler to create separate tables, add each table's root folder as a separate data store when you define the crawler.

For example, consider the following Amazon S3 folder structure.

![\[Rectangles at multiple levels represent a folder hierarchy in Amazon S3. The top rectangle is labeled Sales. Rectangle below that is labeled year=2019. Two rectangles below that are labeled month=Jan and month=Feb. Each of those rectangles has two rectangles below them, labeled day=1 and day=2. All four "day" (bottom) rectangles have either two or four files under them. All rectangles and files are connected with lines.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawlers-s3-folders.png)


The paths to the four lowest level folders are the following:

```
S3://sales/year=2019/month=Jan/day=1
S3://sales/year=2019/month=Jan/day=2
S3://sales/year=2019/month=Feb/day=1
S3://sales/year=2019/month=Feb/day=2
```

Assume that the crawler target is set at `Sales`, and that all files in the `day=n` folders have the same format (for example, JSON, not encrypted), and have the same or very similar schemas. The crawler will create a single table with four partitions, with partition keys `year`, `month`, and `day`.

In the next example, consider the following Amazon S3 structure:

```
s3://bucket01/folder1/table1/partition1/file.txt
s3://bucket01/folder1/table1/partition2/file.txt
s3://bucket01/folder1/table1/partition3/file.txt
s3://bucket01/folder1/table2/partition4/file.txt
s3://bucket01/folder1/table2/partition5/file.txt
```

If the schemas for files under `table1` and `table2` are similar, and a single data store is defined in the crawler with **Include path** `s3://bucket01/folder1/`, the crawler creates a single table with two partition key columns. The first partition key column contains `table1` and `table2`, and the second partition key column contains `partition1` through `partition3` for the `table1` partition and `partition4` and `partition5` for the `table2` partition. To create two separate tables, define the crawler with two data stores. In this example, define the first **Include path** as `s3://bucket01/folder1/table1/` and the second as `s3://bucket01/folder1/table2`.

**Note**  
In Amazon Athena, each table corresponds to an Amazon S3 prefix with all the objects in it. If objects have different schemas, Athena does not recognize different objects within the same prefix as separate tables. This can happen if a crawler creates multiple tables from the same Amazon S3 prefix. This might lead to queries in Athena that return zero results. For Athena to properly recognize and query tables, create the crawler with a separate **Include path** for each different table schema in the Amazon S3 folder structure. For more information, see [Best Practices When Using Athena with AWS Glue](https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html) and this [AWS Knowledge Center article](https://aws.amazon.com/premiumsupport/knowledge-center/athena-empty-results/).

# Supported data sources for crawling
<a name="crawler-data-stores"></a>

Crawlers can crawl the following file-based and table-based data stores.


| Access type that crawler uses | Data stores | 
| --- | --- | 
| Native client |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/crawler-data-stores.html)  | 
| JDBC |  Amazon Redshift Snowflake Within Amazon Relational Database Service (Amazon RDS) or external to Amazon RDS: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/crawler-data-stores.html)  | 
| MongoDB client |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/crawler-data-stores.html)  | 

**Note**  
Currently AWS Glue does not support crawlers for data streams.

For JDBC, MongoDB, MongoDB Atlas, and Amazon DocumentDB (with MongoDB compatibility) data stores, you must specify an AWS Glue *connection* that the crawler can use to connect to the data store. For Amazon S3, you can optionally specify a connection of type Network. A connection is a Data Catalog object that stores connection information, such as credentials, URL, Amazon Virtual Private Cloud information, and more. For more information, see [Connecting to data](glue-connections.md).

The following are the versions of drivers supported by the crawler:


| Product | Crawler supported driver | 
| --- | --- | 
| PostgreSQL | 42.2.1 | 
| Amazon Aurora | Same as native crawler drivers | 
| MariaDB | 8.0.13 | 
| Microsoft SQL Server | 6.1.0 | 
| MySQL | 8.0.13 | 
| Oracle | 11.2.2 | 
| Amazon Redshift | 4.1 | 
| Snowflake | 3.13.20 | 
| MongoDB | 4.7.2 | 
| MongoDB Atlas | 4.7.2 | 

The following are notes about the various data stores.

**Amazon S3**  
You can choose to crawl a path in your account or in another account. If all the Amazon S3 files in a folder have the same schema, the crawler creates one table. Also, if the Amazon S3 object is partitioned, only one metadata table is created and partition information is added to the Data Catalog for that table.

**Amazon S3 and Amazon DynamoDB**  
Crawlers use an AWS Identity and Access Management (IAM) role for permission to access your data stores. *The role you pass to the crawler must have permission to access Amazon S3 paths and Amazon DynamoDB tables that are crawled*.

**Amazon DynamoDB**  
When defining a crawler using the AWS Glue console, you specify one DynamoDB table. If you're using the AWS Glue API, you can specify a list of tables. You can choose to crawl only a small sample of the data to reduce crawler run times.

**Delta Lake**  
For each Delta Lake data store, you specify how to create the Delta tables:  
+ **Create Native tables**: Allow integration with query engines that support querying of the Delta transaction log directly. For more information, see [Querying Delta Lake tables](https://docs.aws.amazon.com/athena/latest/ug/delta-lake-tables.html).
+ **Create Symlink tables**: Create a `_symlink_manifest` folder with manifest files partitioned by the partition keys, based on the specified configuration parameters.

**Iceberg**  
For each Iceberg data store, you specify an Amazon S3 path that contains the metadata for your Iceberg tables. If crawler discovers Iceberg table metadata, it registers it in the Data Catalog. You can set a schedule for the crawler to keep the tables updated.  
You can define these parameters for the data store:  
+ **Exclusions**: Allows you to skip certain folders.
+ **Maximum Traversal Depth**: Sets the depth limit the crawler can crawl in your Amazon S3 bucket. The default maximum traversal depth is 10 and the maximum depth you can set is 20.

**Hudi**  
For each Hudi data store, you specify an Amazon S3 path that contains the metadata for your Hudi tables. If crawler discovers Hudi table metadata, it registers it in the Data Catalog. You can set a schedule for the crawler to keep the tables updated.  
You can define these parameters for the data store:  
+ **Exclusions**: Allows you to skip certain folders.
+ **Maximum Traversal Depth**: Sets the depth limit the crawler can crawl in your Amazon S3 bucket. The default maximum traversal depth is 10 and the maximum depth you can set is 20.
Timestamp columns with `millis` as logical types will be interpreted as `bigint`, due to an incompatibility with Hudi 0.13.1 and timestamp types. A resolution may be provided in the upcoming Hudi release.
Hudi tables are categorized as follows, with specific implications for each:  
+ Copy on Write (CoW): Data is stored in a columnar format (Parquet), and each update creates a new version of files during a write.
+ Merge on Read (MoR): Data is stored using a combination of columnar (Parquet) and row-based (Avro) formats. Updates are logged to row-based delta files and are compacted as needed to create new versions of the columnar files.
With CoW datasets, each time there is an update to a record, the file that contains the record is rewritten with the updated values. With a MoR dataset, each time there is an update, Hudi writes only the row for the changed record. MoR is better suited for write- or change-heavy workloads with fewer reads. CoW is better suited for read-heavy workloads on data that change less frequently.  
Hudi provides three query types for accessing the data:  
+ Snapshot queries: Queries that see the latest snapshot of the table as of a given commit or compaction action. For MoR tables, snapshot queries expose the most recent state of the table by merging the base and delta files of the latest file slice at the time of the query.
+ Incremental queries: Queries only see new data written to the table, since a given commit/compaction. This effectively provides change streams to enable incremental data pipelines.
+ Read optimized queries: For MoR tables, queries see the latest data compacted. For CoW tables, queries see the latest data committed.
For Copy-On-Write tables, the crawlers creates a single table in the Data Catalog with the ReadOptimized serde `org.apache.hudi.hadoop.HoodieParquetInputFormat`.  
For Merge-On-Read tables, the crawler creates two tables in the Data Catalog for the same table location:  
+ A table with suffix `_ro` which uses the ReadOptimized serde `org.apache.hudi.hadoop.HoodieParquetInputFormat`.
+ A table with suffix `_rt` which uses the RealTime Serde allowing for Snapshot queries: `org.apache.hudi.hadoop.realtime.HoodieParquetRealtimeInputFormat`.

**MongoDB and Amazon DocumentDB (with MongoDB compatibility)**  
MongoDB versions 3.2 and later are supported. You can choose to crawl only a small sample of the data to reduce crawler run times.

**Relational database**  
Authentication is with a database user name and password. Depending on the type of database engine, you can choose which objects are crawled, such as databases, schemas, and tables.

**Snowflake**  
The Snowflake JDBC crawler supports crawling the Table, External Table, View, and Materialized View. The Materialized View Definition will not be populated.  
For Snowflake external tables, the crawler only will crawl if it points to an Amazon S3 location. In addition to the the table schema, the crawler will also crawl the Amazon S3 location, file format and output as table parameters in the Data Catalog table. Note that the partition information of the partitioned external table is not populated.  
ETL is currently not supported for Data Catalog tables created using the Snowflake crawler.

# Crawler prerequisites
<a name="crawler-prereqs"></a>

The crawler assumes the permissions of the AWS Identity and Access Management (IAM) role that you specify when you define it. This IAM role must have permissions to extract data from your data store and write to the Data Catalog. The AWS Glue console lists only IAM roles that have attached a trust policy for the AWS Glue principal service. From the console, you can also create an IAM role with an IAM policy to access Amazon S3 data stores accessed by the crawler. For more information about providing roles for AWS Glue, see [Identity-based policies for AWS Glue](security_iam_service-with-iam.md#security_iam_service-with-iam-id-based-policies).

**Note**  
When crawling a Delta Lake data store, you must have Read/Write permissions to the Amazon S3 location.

For your crawler, you can create a role and attach the following policies:
+ The `AWSGlueServiceRole` AWS managed policy, which grants the required permissions on the Data Catalog
+ An inline policy that grants permissions on the data source.
+ An inline policy that grants `iam:PassRole` permission on the role.

A quicker approach is to let the AWS Glue console crawler wizard create a role for you. The role that it creates is specifically for the crawler, and includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source.

If you specify an existing role for a crawler, ensure that it includes the `AWSGlueServiceRole` policy or equivalent (or a scoped down version of this policy), plus the required inline policies. For example, for an Amazon S3 data store, the inline policy would at a minimum be the following: 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject"
      ],
      "Resource": [
        "arn:aws:s3:::bucket/object*"
      ]
    }
  ]
}
```

------

For an Amazon DynamoDB data store, the policy would at a minimum be the following: 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:DescribeTable",
        "dynamodb:Scan"
      ],
      "Resource": [
        "arn:aws:dynamodb:us-east-1:111122223333:table/table-name*"
      ]
    }
  ]
}
```

------

In addition, if the crawler reads AWS Key Management Service (AWS KMS) encrypted Amazon S3 data, then the IAM role must have decrypt permission on the AWS KMS key. For more information, see [Step 2: Create an IAM role for AWS Glue](create-an-iam-role.md).

# Defining and managing classifiers
<a name="add-classifier"></a>

A classifier reads the data in a data store. If it recognizes the format of the data, it generates a schema. The classifier also returns a certainty number to indicate how certain the format recognition was. 

AWS Glue provides a set of built-in classifiers, but you can also create custom classifiers. AWS Glue invokes custom classifiers first, in the order that you specify in your crawler definition. Depending on the results that are returned from custom classifiers, AWS Glue might also invoke built-in classifiers. If a classifier returns `certainty=1.0` during processing, it indicates that it's 100 percent certain that it can create the correct schema. AWS Glue then uses the output of that classifier. 

If no classifier returns `certainty=1.0`, AWS Glue uses the output of the classifier that has the highest certainty. If no classifier returns a certainty greater than `0.0`, AWS Glue returns the default classification string of `UNKNOWN`.

## When do I use a classifier?
<a name="classifier-when-used"></a>

You use classifiers when you crawl a data store to define metadata tables in the AWS Glue Data Catalog. You can set up your crawler with an ordered set of classifiers. When the crawler invokes a classifier, the classifier determines whether the data is recognized. If the classifier can't recognize the data or is not 100 percent certain, the crawler invokes the next classifier in the list to determine whether it can recognize the data. 

 For more information about creating a classifier using the AWS Glue console, see [Creating classifiers using the AWS Glue console](console-classifiers.md). 

## Custom classifiers
<a name="classifier-defining"></a>

The output of a classifier includes a string that indicates the file's classification or format (for example, `json`) and the schema of the file. For custom classifiers, you define the logic for creating the schema based on the type of classifier. Classifier types include defining schemas based on grok patterns, XML tags, and JSON paths.

If you change a classifier definition, any data that was previously crawled using the classifier is not reclassified. A crawler keeps track of previously crawled data. New data is classified with the updated classifier, which might result in an updated schema. If the schema of your data has evolved, update the classifier to account for any schema changes when your crawler runs. To reclassify data to correct an incorrect classifier, create a new crawler with the updated classifier. 

For more information about creating custom classifiers in AWS Glue, see [Writing custom classifiers for diverse data formats](custom-classifier.md).

**Note**  
If your data format is recognized by one of the built-in classifiers, you don't need to create a custom classifier.

## Built-in classifiers
<a name="classifier-built-in"></a>

 AWS Glue provides built-in classifiers for various formats, including JSON, CSV, web logs, and many database systems.

If AWS Glue doesn't find a custom classifier that fits the input data format with 100 percent certainty, it invokes the built-in classifiers in the order shown in the following table. The built-in classifiers return a result to indicate whether the format matches (`certainty=1.0`) or does not match (`certainty=0.0`). The first classifier that has `certainty=1.0` provides the classification string and schema for a metadata table in your Data Catalog.


| Classifier type | Classification string | Notes | 
| --- | --- | --- | 
| Apache Avro | avro | Reads the schema at the beginning of the file to determine format. | 
| Apache ORC | orc | Reads the file metadata to determine format. | 
| Apache Parquet | parquet | Reads the schema at the end of the file to determine format. | 
| JSON | json | Reads the beginning of the file to determine format. | 
| Binary JSON | bson | Reads the beginning of the file to determine format. | 
| XML | xml | Reads the beginning of the file to determine format. AWS Glue determines the table schema based on XML tags in the document.  For information about creating a custom XML classifier to specify rows in the document, see [Writing XML custom classifiers](custom-classifier.md#custom-classifier-xml).  | 
| Amazon Ion | ion | Reads the beginning of the file to determine format. | 
| Combined Apache log | combined\$1apache | Determines log formats through a grok pattern. | 
| Apache log | apache | Determines log formats through a grok pattern. | 
| Linux kernel log | linux\$1kernel | Determines log formats through a grok pattern. | 
| Microsoft log | microsoft\$1log | Determines log formats through a grok pattern. | 
| Ruby log | ruby\$1logger | Reads the beginning of the file to determine format. | 
| Squid 3.x log | squid | Reads the beginning of the file to determine format. | 
| Redis monitor log | redismonlog | Reads the beginning of the file to determine format. | 
| Redis log | redislog | Reads the beginning of the file to determine format. | 
| CSV | csv | Checks for the following delimiters: comma (,), pipe (\$1), tab (\$1t), semicolon (;), and Ctrl-A (\$1u0001). Ctrl-A is the Unicode control character for Start Of Heading. | 
| Amazon Redshift | redshift | Uses JDBC connection to import metadata. | 
| MySQL | mysql | Uses JDBC connection to import metadata. | 
| PostgreSQL | postgresql | Uses JDBC connection to import metadata. | 
| Oracle database | oracle | Uses JDBC connection to import metadata. | 
| Microsoft SQL Server | sqlserver | Uses JDBC connection to import metadata. | 
| Amazon DynamoDB | dynamodb | Reads data from the DynamoDB table. | 

Files in the following compressed formats can be classified:
+ ZIP (supported for archives containing only a single file). Note that Zip is not well-supported in other services (because of the archive).
+ BZIP
+ GZIP
+ LZ4
+ Snappy (supported for both standard and Hadoop native Snappy formats)

### Built-in CSV classifier
<a name="classifier-builtin-rules"></a>

The built-in CSV classifier parses CSV file contents to determine the schema for an AWS Glue table. This classifier checks for the following delimiters:
+ Comma (,)
+ Pipe (\$1)
+ Tab (\$1t)
+ Semicolon (;)
+ Ctrl-A (\$1u0001)

  Ctrl-A is the Unicode control character for `Start Of Heading`.

To be classified as CSV, the table schema must have at least two columns and two rows of data. The CSV classifier uses a number of heuristics to determine whether a header is present in a given file. If the classifier can't determine a header from the first row of data, column headers are displayed as `col1`, `col2`, `col3`, and so on. The built-in CSV classifier determines whether to infer a header by evaluating the following characteristics of the file:
+ Every column in a potential header parses as a STRING data type.
+ Except for the last column, every column in a potential header has content that is fewer than 150 characters. To allow for a trailing delimiter, the last column can be empty throughout the file.
+ Every column in a potential header must meet the AWS Glue `regex` requirements for a column name.
+ The header row must be sufficiently different from the data rows. To determine this, one or more of the rows must parse as other than STRING type. If all columns are of type STRING, then the first row of data is not sufficiently different from subsequent rows to be used as the header.

**Note**  
If the built-in CSV classifier does not create your AWS Glue table as you want, you might be able to use one of the following alternatives:  
Change the column names in the Data Catalog, set the `SchemaChangePolicy` to LOG, and set the partition output configuration to `InheritFromTable` for future crawler runs.
Create a custom grok classifier to parse the data and assign the columns that you want.
The built-in CSV classifier creates tables referencing the `LazySimpleSerDe` as the serialization library, which is a good choice for type inference. However, if the CSV data contains quoted strings, edit the table definition and change the SerDe library to `OpenCSVSerDe`. Adjust any inferred types to STRING, set the `SchemaChangePolicy` to LOG, and set the partitions output configuration to `InheritFromTable` for future crawler runs. For more information about SerDe libraries, see [SerDe Reference](https://docs.aws.amazon.com/athena/latest/ug/serde-reference.html) in the Amazon Athena User Guide.

# Writing custom classifiers for diverse data formats
<a name="custom-classifier"></a>

You can provide a custom classifier to classify your data in AWS Glue. You can create a custom classifier using a grok pattern, an XML tag, JavaScript Object Notation (JSON), or comma-separated values (CSV). An AWS Glue crawler calls a custom classifier. If the classifier recognizes the data, it returns the classification and schema of the data to the crawler. You might need to define a custom classifier if your data doesn't match any built-in classifiers, or if you want to customize the tables that are created by the crawler.

 For more information about creating a classifier using the AWS Glue console, see [Creating classifiers using the AWS Glue console](console-classifiers.md). 

AWS Glue runs custom classifiers before built-in classifiers, in the order you specify. When a crawler finds a classifier that matches the data, the classification string and schema are used in the definition of tables that are written to your AWS Glue Data Catalog.

**Topics**
+ [Writing grok custom classifiers](#custom-classifier-grok)
+ [Writing XML custom classifiers](#custom-classifier-xml)
+ [Writing JSON custom classifiers](#custom-classifier-json)
+ [Writing CSV custom classifiers](#custom-classifier-csv)

## Writing grok custom classifiers
<a name="custom-classifier-grok"></a>

Grok is a tool that is used to parse textual data given a matching pattern. A grok pattern is a named set of regular expressions (regex) that are used to match data one line at a time. AWS Glue uses grok patterns to infer the schema of your data. When a grok pattern matches your data, AWS Glue uses the pattern to determine the structure of your data and map it into fields.

AWS Glue provides many built-in patterns, or you can define your own. You can create a grok pattern using built-in patterns and custom patterns in your custom classifier definition. You can tailor a grok pattern to classify custom text file formats.

**Note**  
AWS Glue grok custom classifiers use the `GrokSerDe` serialization library for tables created in the AWS Glue Data Catalog. If you are using the AWS Glue Data Catalog with Amazon Athena, Amazon EMR, or Redshift Spectrum, check the documentation about those services for information about support of the `GrokSerDe`. Currently, you might encounter problems querying tables created with the `GrokSerDe` from Amazon EMR and Redshift Spectrum.

The following is the basic syntax for the components of a grok pattern:

```
%{PATTERN:field-name}
```

Data that matches the named `PATTERN` is mapped to the `field-name` column in the schema, with a default data type of `string`. Optionally, the data type for the field can be cast to `byte`, `boolean`, `double`, `short`, `int`, `long`, or `float` in the resulting schema.

```
%{PATTERN:field-name:data-type}
```

For example, to cast a `num` field to an `int` data type, you can use this pattern: 

```
%{NUMBER:num:int}
```

Patterns can be composed of other patterns. For example, you can have a pattern for a `SYSLOG` timestamp that is defined by patterns for month, day of the month, and time (for example, `Feb 1 06:25:43`). For this data, you might define the following pattern:

```
SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}
```

**Note**  
Grok patterns can process only one line at a time. Multiple-line patterns are not supported. Also, line breaks within a pattern are not supported.

### Custom values for grok classifier
<a name="classifier-values"></a>

When you define a grok classifier, you supply the following values to create the custom classifier.

**Name**  
Name of the classifier.

**Classification**  
The text string that is written to describe the format of the data that is classified; for example, `special-logs`.

**Grok pattern**  
The set of patterns that are applied to the data store to determine whether there is a match. These patterns are from AWS Glue [built-in patterns](#classifier-builtin-patterns) and any custom patterns that you define.  
The following is an example of a grok pattern:  

```
%{TIMESTAMP_ISO8601:timestamp} \[%{MESSAGEPREFIX:message_prefix}\] %{CRAWLERLOGLEVEL:loglevel} : %{GREEDYDATA:message}
```
When the data matches `TIMESTAMP_ISO8601`, a schema column `timestamp` is created. The behavior is similar for the other named patterns in the example.

**Custom patterns**  
Optional custom patterns that you define. These patterns are referenced by the grok pattern that classifies your data. You can reference these custom patterns in the grok pattern that is applied to your data. Each custom component pattern must be on a separate line. [Regular expression (regex)](http://en.wikipedia.org/wiki/Regular_expression) syntax is used to define the pattern.   
The following is an example of using custom patterns:  

```
CRAWLERLOGLEVEL (BENCHMARK|ERROR|WARN|INFO|TRACE)
MESSAGEPREFIX .*-.*-.*-.*-.*
```
The first custom named pattern, `CRAWLERLOGLEVEL`, is a match when the data matches one of the enumerated strings. The second custom pattern, `MESSAGEPREFIX`, tries to match a message prefix string.

AWS Glue keeps track of the creation time, last update time, and version of your classifier.

### Built-in patterns
<a name="classifier-builtin-patterns"></a>

AWS Glue provides many common patterns that you can use to build a custom classifier. You add a named pattern to the `grok pattern` in a classifier definition.

The following list consists of a line for each pattern. In each line, the pattern name is followed its definition. [Regular expression (regex)](http://en.wikipedia.org/wiki/Regular_expression) syntax is used in defining the pattern.

```
#<noloc>&GLU;</noloc> Built-in patterns
 USERNAME [a-zA-Z0-9._-]+
 USER %{USERNAME:UNWANTED}
 INT (?:[+-]?(?:[0-9]+))
 BASE10NUM (?<![0-9.+-])(?>[+-]?(?:(?:[0-9]+(?:\.[0-9]+)?)|(?:\.[0-9]+)))
 NUMBER (?:%{BASE10NUM:UNWANTED})
 BASE16NUM (?<![0-9A-Fa-f])(?:[+-]?(?:0x)?(?:[0-9A-Fa-f]+))
 BASE16FLOAT \b(?<![0-9A-Fa-f.])(?:[+-]?(?:0x)?(?:(?:[0-9A-Fa-f]+(?:\.[0-9A-Fa-f]*)?)|(?:\.[0-9A-Fa-f]+)))\b
 BOOLEAN (?i)(true|false)
 
 POSINT \b(?:[1-9][0-9]*)\b
 NONNEGINT \b(?:[0-9]+)\b
 WORD \b\w+\b
 NOTSPACE \S+
 SPACE \s*
 DATA .*?
 GREEDYDATA .*
 #QUOTEDSTRING (?:(?<!\\)(?:"(?:\\.|[^\\"])*"|(?:'(?:\\.|[^\\'])*')|(?:`(?:\\.|[^\\`])*`)))
 QUOTEDSTRING (?>(?<!\\)(?>"(?>\\.|[^\\"]+)+"|""|(?>'(?>\\.|[^\\']+)+')|''|(?>`(?>\\.|[^\\`]+)+`)|``))
 UUID [A-Fa-f0-9]{8}-(?:[A-Fa-f0-9]{4}-){3}[A-Fa-f0-9]{12}
 
 # Networking
 MAC (?:%{CISCOMAC:UNWANTED}|%{WINDOWSMAC:UNWANTED}|%{COMMONMAC:UNWANTED})
 CISCOMAC (?:(?:[A-Fa-f0-9]{4}\.){2}[A-Fa-f0-9]{4})
 WINDOWSMAC (?:(?:[A-Fa-f0-9]{2}-){5}[A-Fa-f0-9]{2})
 COMMONMAC (?:(?:[A-Fa-f0-9]{2}:){5}[A-Fa-f0-9]{2})
 IPV6 ((([0-9A-Fa-f]{1,4}:){7}([0-9A-Fa-f]{1,4}|:))|(([0-9A-Fa-f]{1,4}:){6}(:[0-9A-Fa-f]{1,4}|((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){5}(((:[0-9A-Fa-f]{1,4}){1,2})|:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3})|:))|(([0-9A-Fa-f]{1,4}:){4}(((:[0-9A-Fa-f]{1,4}){1,3})|((:[0-9A-Fa-f]{1,4})?:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){3}(((:[0-9A-Fa-f]{1,4}){1,4})|((:[0-9A-Fa-f]{1,4}){0,2}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){2}(((:[0-9A-Fa-f]{1,4}){1,5})|((:[0-9A-Fa-f]{1,4}){0,3}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(([0-9A-Fa-f]{1,4}:){1}(((:[0-9A-Fa-f]{1,4}){1,6})|((:[0-9A-Fa-f]{1,4}){0,4}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:))|(:(((:[0-9A-Fa-f]{1,4}){1,7})|((:[0-9A-Fa-f]{1,4}){0,5}:((25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)(\.(25[0-5]|2[0-4]\d|1\d\d|[1-9]?\d)){3}))|:)))(%.+)?
 IPV4 (?<![0-9])(?:(?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2})[.](?:25[0-5]|2[0-4][0-9]|[0-1]?[0-9]{1,2}))(?![0-9])
 IP (?:%{IPV6:UNWANTED}|%{IPV4:UNWANTED})
 HOSTNAME \b(?:[0-9A-Za-z][0-9A-Za-z-_]{0,62})(?:\.(?:[0-9A-Za-z][0-9A-Za-z-_]{0,62}))*(\.?|\b)
 HOST %{HOSTNAME:UNWANTED}
 IPORHOST (?:%{HOSTNAME:UNWANTED}|%{IP:UNWANTED})
 HOSTPORT (?:%{IPORHOST}:%{POSINT:PORT})
 
 # paths
 PATH (?:%{UNIXPATH}|%{WINPATH})
 UNIXPATH (?>/(?>[\w_%!$@:.,~-]+|\\.)*)+
 #UNIXPATH (?<![\w\/])(?:/[^\/\s?*]*)+
 TTY (?:/dev/(pts|tty([pq])?)(\w+)?/?(?:[0-9]+))
 WINPATH (?>[A-Za-z]+:|\\)(?:\\[^\\?*]*)+
 URIPROTO [A-Za-z]+(\+[A-Za-z+]+)?
 URIHOST %{IPORHOST}(?::%{POSINT:port})?
 # uripath comes loosely from RFC1738, but mostly from what Firefox
 # doesn't turn into %XX
 URIPATH (?:/[A-Za-z0-9$.+!*'(){},~:;=@#%_\-]*)+
 #URIPARAM \?(?:[A-Za-z0-9]+(?:=(?:[^&]*))?(?:&(?:[A-Za-z0-9]+(?:=(?:[^&]*))?)?)*)?
 URIPARAM \?[A-Za-z0-9$.+!*'|(){},~@#%&/=:;_?\-\[\]]*
 URIPATHPARAM %{URIPATH}(?:%{URIPARAM})?
 URI %{URIPROTO}://(?:%{USER}(?::[^@]*)?@)?(?:%{URIHOST})?(?:%{URIPATHPARAM})?
 
 # Months: January, Feb, 3, 03, 12, December
 MONTH \b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|Jun(?:e)?|Jul(?:y)?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|Nov(?:ember)?|Dec(?:ember)?)\b
 MONTHNUM (?:0?[1-9]|1[0-2])
 MONTHNUM2 (?:0[1-9]|1[0-2])
 MONTHDAY (?:(?:0[1-9])|(?:[12][0-9])|(?:3[01])|[1-9])
 
 # Days: Monday, Tue, Thu, etc...
 DAY (?:Mon(?:day)?|Tue(?:sday)?|Wed(?:nesday)?|Thu(?:rsday)?|Fri(?:day)?|Sat(?:urday)?|Sun(?:day)?)
 
 # Years?
 YEAR (?>\d\d){1,2}
 # Time: HH:MM:SS
 #TIME \d{2}:\d{2}(?::\d{2}(?:\.\d+)?)?
 # TIME %{POSINT<24}:%{POSINT<60}(?::%{POSINT<60}(?:\.%{POSINT})?)?
 HOUR (?:2[0123]|[01]?[0-9])
 MINUTE (?:[0-5][0-9])
 # '60' is a leap second in most time standards and thus is valid.
 SECOND (?:(?:[0-5]?[0-9]|60)(?:[:.,][0-9]+)?)
 TIME (?!<[0-9])%{HOUR}:%{MINUTE}(?::%{SECOND})(?![0-9])
 # datestamp is YYYY/MM/DD-HH:MM:SS.UUUU (or something like it)
 DATE_US %{MONTHNUM}[/-]%{MONTHDAY}[/-]%{YEAR}
 DATE_EU %{MONTHDAY}[./-]%{MONTHNUM}[./-]%{YEAR}
 DATESTAMP_US %{DATE_US}[- ]%{TIME}
 DATESTAMP_EU %{DATE_EU}[- ]%{TIME}
 ISO8601_TIMEZONE (?:Z|[+-]%{HOUR}(?::?%{MINUTE}))
 ISO8601_SECOND (?:%{SECOND}|60)
 TIMESTAMP_ISO8601 %{YEAR}-%{MONTHNUM}-%{MONTHDAY}[T ]%{HOUR}:?%{MINUTE}(?::?%{SECOND})?%{ISO8601_TIMEZONE}?
 TZ (?:[PMCE][SD]T|UTC)
 DATESTAMP_RFC822 %{DAY} %{MONTH} %{MONTHDAY} %{YEAR} %{TIME} %{TZ}
 DATESTAMP_RFC2822 %{DAY}, %{MONTHDAY} %{MONTH} %{YEAR} %{TIME} %{ISO8601_TIMEZONE}
 DATESTAMP_OTHER %{DAY} %{MONTH} %{MONTHDAY} %{TIME} %{TZ} %{YEAR}
 DATESTAMP_EVENTLOG %{YEAR}%{MONTHNUM2}%{MONTHDAY}%{HOUR}%{MINUTE}%{SECOND}
 CISCOTIMESTAMP %{MONTH} %{MONTHDAY} %{TIME}
 
 # Syslog Dates: Month Day HH:MM:SS
 SYSLOGTIMESTAMP %{MONTH} +%{MONTHDAY} %{TIME}
 PROG (?:[\w._/%-]+)
 SYSLOGPROG %{PROG:program}(?:\[%{POSINT:pid}\])?
 SYSLOGHOST %{IPORHOST}
 SYSLOGFACILITY <%{NONNEGINT:facility}.%{NONNEGINT:priority}>
 HTTPDATE %{MONTHDAY}/%{MONTH}/%{YEAR}:%{TIME} %{INT}
 
 # Shortcuts
 QS %{QUOTEDSTRING:UNWANTED}
 
 # Log formats
 SYSLOGBASE %{SYSLOGTIMESTAMP:timestamp} (?:%{SYSLOGFACILITY} )?%{SYSLOGHOST:logsource} %{SYSLOGPROG}:
 
 MESSAGESLOG %{SYSLOGBASE} %{DATA}
 
 COMMONAPACHELOG %{IPORHOST:clientip} %{USER:ident} %{USER:auth} \[%{HTTPDATE:timestamp}\] "(?:%{WORD:verb} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion})?|%{DATA:rawrequest})" %{NUMBER:response} (?:%{Bytes:bytes=%{NUMBER}|-})
 COMBINEDAPACHELOG %{COMMONAPACHELOG} %{QS:referrer} %{QS:agent}
 COMMONAPACHELOG_DATATYPED %{IPORHOST:clientip} %{USER:ident;boolean} %{USER:auth} \[%{HTTPDATE:timestamp;date;dd/MMM/yyyy:HH:mm:ss Z}\] "(?:%{WORD:verb;string} %{NOTSPACE:request}(?: HTTP/%{NUMBER:httpversion;float})?|%{DATA:rawrequest})" %{NUMBER:response;int} (?:%{NUMBER:bytes;long}|-)
 
 
 # Log Levels
 LOGLEVEL ([A|a]lert|ALERT|[T|t]race|TRACE|[D|d]ebug|DEBUG|[N|n]otice|NOTICE|[I|i]nfo|INFO|[W|w]arn?(?:ing)?|WARN?(?:ING)?|[E|e]rr?(?:or)?|ERR?(?:OR)?|[C|c]rit?(?:ical)?|CRIT?(?:ICAL)?|[F|f]atal|FATAL|[S|s]evere|SEVERE|EMERG(?:ENCY)?|[Ee]merg(?:ency)?)
```

## Writing XML custom classifiers
<a name="custom-classifier-xml"></a>

XML defines the structure of a document with the use of tags in the file. With an XML custom classifier, you can specify the tag name used to define a row.

### Custom classifier values for an XML classifier
<a name="classifier-values-xml"></a>

When you define an XML classifier, you supply the following values to AWS Glue to create the classifier. The classification field of this classifier is set to `xml`.

**Name**  
Name of the classifier.

**Row tag**  
The XML tag name that defines a table row in the XML document, without angle brackets `< >`. The name must comply with XML rules for a tag.  
The element containing the row data **cannot** be a self-closing empty element. For example, this empty element is **not** parsed by AWS Glue:  

```
            <row att1=”xx” att2=”yy” />  
```
 Empty elements can be written as follows:  

```
            <row att1=”xx” att2=”yy”> </row> 
```

AWS Glue keeps track of the creation time, last update time, and version of your classifier.

For example, suppose that you have the following XML file. To create an AWS Glue table that only contains columns for author and title, create a classifier in the AWS Glue console with **Row tag** as `AnyCompany`. Then add and run a crawler that uses this custom classifier.

```
<?xml version="1.0"?>
<catalog>
   <book id="bk101">
     <AnyCompany>
       <author>Rivera, Martha</author>
       <title>AnyCompany Developer Guide</title>
     </AnyCompany>
   </book>
   <book id="bk102">
     <AnyCompany>   
       <author>Stiles, John</author>
       <title>Style Guide for AnyCompany</title>
     </AnyCompany>
   </book>
</catalog>
```

## Writing JSON custom classifiers
<a name="custom-classifier-json"></a>

JSON is a data-interchange format. It defines data structures with name-value pairs or an ordered list of values. With a JSON custom classifier, you can specify the JSON path to a data structure that is used to define the schema for your table.

### Custom classifier values in AWS Glue
<a name="classifier-values-json"></a>

When you define a JSON classifier, you supply the following values to AWS Glue to create the classifier. The classification field of this classifier is set to `json`.

**Name**  
Name of the classifier.

**JSON path**  
A JSON path that points to an object that is used to define a table schema. The JSON path can be written in dot notation or bracket notation. The following operators are supported:      
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/custom-classifier.html)

AWS Glue keeps track of the creation time, last update time, and version of your classifier.

**Example Using a JSON classifier to pull records from an array**  
Suppose that your JSON data is an array of records. For example, the first few lines of your file might look like the following:  

```
[
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ak",
    "name": "Alaska"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:1",
    "name": "Alabama's 1st congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:2",
    "name": "Alabama's 2nd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:3",
    "name": "Alabama's 3rd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:4",
    "name": "Alabama's 4th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:5",
    "name": "Alabama's 5th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:6",
    "name": "Alabama's 6th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:al\/cd:7",
    "name": "Alabama's 7th congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:1",
    "name": "Arkansas's 1st congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:2",
    "name": "Arkansas's 2nd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:3",
    "name": "Arkansas's 3rd congressional district"
  },
  {
    "type": "constituency",
    "id": "ocd-division\/country:us\/state:ar\/cd:4",
    "name": "Arkansas's 4th congressional district"
  }
]
```
When you run a crawler using the built-in JSON classifier, the entire file is used to define the schema. Because you don’t specify a JSON path, the crawler treats the data as one object, that is, just an array. For example, the schema might look like the following:  

```
root
|-- record: array
```
However, to create a schema that is based on each record in the JSON array, create a custom JSON classifier and specify the JSON path as `$[*]`. When you specify this JSON path, the classifier interrogates all 12 records in the array to determine the schema. The resulting schema contains separate fields for each object, similar to the following example:  

```
root
|-- type: string
|-- id: string
|-- name: string
```

**Example Using a JSON classifier to examine only parts of a file**  
Suppose that your JSON data follows the pattern of the example JSON file `s3://awsglue-datasets/examples/us-legislators/all/areas.json` drawn from [http://everypolitician.org/](http://everypolitician.org/). Example objects in the JSON file look like the following:  

```
{
  "type": "constituency",
  "id": "ocd-division\/country:us\/state:ak",
  "name": "Alaska"
}
{
  "type": "constituency",
  "identifiers": [
    {
      "scheme": "dmoz",
      "identifier": "Regional\/North_America\/United_States\/Alaska\/"
    },
    {
      "scheme": "freebase",
      "identifier": "\/m\/0hjy"
    },
    {
      "scheme": "fips",
      "identifier": "US02"
    },
    {
      "scheme": "quora",
      "identifier": "Alaska-state"
    },
    {
      "scheme": "britannica",
      "identifier": "place\/Alaska"
    },
    {
      "scheme": "wikidata",
      "identifier": "Q797"
    }
  ],
  "other_names": [
    {
      "lang": "en",
      "note": "multilingual",
      "name": "Alaska"
    },
    {
      "lang": "fr",
      "note": "multilingual",
      "name": "Alaska"
    },
    {
      "lang": "nov",
      "note": "multilingual",
      "name": "Alaska"
    }
  ],
  "id": "ocd-division\/country:us\/state:ak",
  "name": "Alaska"
}
```
When you run a crawler using the built-in JSON classifier, the entire file is used to create the schema. You might end up with a schema like this:  

```
root
|-- type: string
|-- id: string
|-- name: string
|-- identifiers: array
|    |-- element: struct
|    |    |-- scheme: string
|    |    |-- identifier: string
|-- other_names: array
|    |-- element: struct
|    |    |-- lang: string
|    |    |-- note: string
|    |    |-- name: string
```
However, to create a schema using just the "`id`" object, create a custom JSON classifier and specify the JSON path as `$.id`. Then the schema is based on only the "`id`" field:  

```
root
|-- record: string
```
The first few lines of data extracted with this schema look like this:  

```
{"record": "ocd-division/country:us/state:ak"}
{"record": "ocd-division/country:us/state:al/cd:1"}
{"record": "ocd-division/country:us/state:al/cd:2"}
{"record": "ocd-division/country:us/state:al/cd:3"}
{"record": "ocd-division/country:us/state:al/cd:4"}
{"record": "ocd-division/country:us/state:al/cd:5"}
{"record": "ocd-division/country:us/state:al/cd:6"}
{"record": "ocd-division/country:us/state:al/cd:7"}
{"record": "ocd-division/country:us/state:ar/cd:1"}
{"record": "ocd-division/country:us/state:ar/cd:2"}
{"record": "ocd-division/country:us/state:ar/cd:3"}
{"record": "ocd-division/country:us/state:ar/cd:4"}
{"record": "ocd-division/country:us/state:as"}
{"record": "ocd-division/country:us/state:az/cd:1"}
{"record": "ocd-division/country:us/state:az/cd:2"}
{"record": "ocd-division/country:us/state:az/cd:3"}
{"record": "ocd-division/country:us/state:az/cd:4"}
{"record": "ocd-division/country:us/state:az/cd:5"}
{"record": "ocd-division/country:us/state:az/cd:6"}
{"record": "ocd-division/country:us/state:az/cd:7"}
```
To create a schema based on a deeply nested object, such as "`identifier`," in the JSON file, you can create a custom JSON classifier and specify the JSON path as `$.identifiers[*].identifier`. Although the schema is similar to the previous example, it is based on a different object in the JSON file.   
The schema looks like the following:  

```
root
|-- record: string
```
Listing the first few lines of data from the table shows that the schema is based on the data in the "`identifier`" object:  

```
{"record": "Regional/North_America/United_States/Alaska/"}
{"record": "/m/0hjy"}
{"record": "US02"}
{"record": "5879092"}
{"record": "4001016-8"}
{"record": "destination/alaska"}
{"record": "1116270"}
{"record": "139487266"}
{"record": "n79018447"}
{"record": "01490999-8dec-4129-8254-eef6e80fadc3"}
{"record": "Alaska-state"}
{"record": "place/Alaska"}
{"record": "Q797"}
{"record": "Regional/North_America/United_States/Alabama/"}
{"record": "/m/0gyh"}
{"record": "US01"}
{"record": "4829764"}
{"record": "4084839-5"}
{"record": "161950"}
{"record": "131885589"}
```
To create a table based on another deeply nested object, such as the "`name`" field in the "`other_names`" array in the JSON file, you can create a custom JSON classifier and specify the JSON path as `$.other_names[*].name`. Although the schema is similar to the previous example, it is based on a different object in the JSON file. The schema looks like the following:  

```
root
|-- record: string
```
Listing the first few lines of data in the table shows that it is based on the data in the "`name`" object in the "`other_names`" array:  

```
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Аляска"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "ألاسكا"}
{"record": "ܐܠܐܣܟܐ"}
{"record": "الاسكا"}
{"record": "Alaska"}
{"record": "Alyaska"}
{"record": "Alaska"}
{"record": "Alaska"}
{"record": "Штат Аляска"}
{"record": "Аляска"}
{"record": "Alaska"}
{"record": "আলাস্কা"}
```

## Writing CSV custom classifiers
<a name="custom-classifier-csv"></a>

 Custom CSV classifiers allows you to specify datatypes for each column in the custom csv classifier field. You can specify each column’s datatype separated by a comma. By specifying datatypes, you can override the crawlers inferred datatypes and ensure data will be classified appropriately.

You can set the SerDe for processing CSV in the classifier, which will be applied in the Data Catalog.

When you create a custom classifier, you can also re-use the classifer for different crawlers.
+  For csv files with only headers (no data), these files will be classified as UNKNOWN since not enough information is provided. If you specify that the CSV 'Has headings' in the *Column headings* option, and provide the datatypes, we can classify these files correctly. 

You can use a custom CSV classifier to infer the schema of various types of CSV data. The custom attributes that you can provide for your classifier include delimiters, a CSV SerDe option, options about the header, and whether to perform certain validations on the data.

### Custom classifier values in AWS Glue
<a name="classifier-values-csv"></a>

When you define a CSV classifier, you provide the following values to AWS Glue to create the classifier. The classification field of this classifier is set to `csv`.

**Classifier name**  
Name of the classifier.

**CSV Serde**  
Sets the SerDe for processing CSV in the classifier, which will be applied in the Data Catalog. Options are Open CSV SerDe, Lazy Simple SerDe, and None. You can specify the None value when you want the crawler to do the detection.

**Column delimiter**  
A custom symbol to denote what separates each column entry in the row. Provide a unicode character. If you cannot type your delimiter, you can copy and paste it. This works for printable characters, including those your system does not support (typically shown as □).

**Quote symbol**  
A custom symbol to denote what combines content into a single column value. Must be different from the column delimiter. Provide a unicode character. If you cannot type your delimiter, you can copy and paste it. This works for printable characters, including those your system does not support (typically shown as □).

**Column headings**  
Indicates the behavior for how column headings should be detected in the CSV file. If your custom CSV file has column headings, enter a comma-delimited list of the column headings.

**Processing options: Allow files with single column**  
Enables the processing of files that contain only one column.

**Processing options: Trim white space before identifying column values**  
Specifies whether to trim values before identifying the type of column values.

**Custom datatypes - *optional***  
 Enter the custom datatype separated by a comma. Specifies the custom datatypes in the CSV file. The custom datatype must be a supported datatype. Supported datatypes are: “BINARY”, “BOOLEAN”, “DATE”, “DECIMAL”, “DOUBLE”, “FLOAT”, “INT”, “LONG”, “SHORT”, “STRING”, “TIMESTAMP”. Unsupported datatypes will display an error. 

# Creating classifiers using the AWS Glue console
<a name="console-classifiers"></a>

A classifier determines the schema of your data. You can write a custom classifier and point to it from AWS Glue. 

## Creating classifiers
<a name="add-classifier-console"></a>

To add a classifier in the AWS Glue console, choose **Add classifier**. When you define a classifier, you supply values for the following:
+ **Classifier name** – Provide a unique name for your classifier.
+ **Classifier type** – The classification type of tables inferred by this classifier.
+ **Last updated** – The last time this classifier was updated.

**Classifier name**  
Provide a unique name for your classifier.

**Classifier type**  
Choose the type of classifier to create.

Depending on the type of classifier you choose, configure the following properties for your classifier:

------
#### [ Grok ]
+ **Classification** 

  Describe the format or type of data that is classified or provide a custom label. 
+ **Grok pattern** 

  This is used to parse your data into a structured schema. The grok pattern is composed of named patterns that describe the format of your data store. You write this grok pattern using the named built-in patterns provided by AWS Glue and custom patterns you write and include in the **Custom patterns** field. Although grok debugger results might not match the results from AWS Glue exactly, we suggest that you try your pattern using some sample data with a grok debugger. You can find grok debuggers on the web. The named built-in patterns provided by AWS Glue are generally compatible with grok patterns that are available on the web. 

  Build your grok pattern by iteratively adding named patterns and check your results in a debugger. This activity gives you confidence that when the AWS Glue crawler runs your grok pattern, your data can be parsed.
+ **Custom patterns** 

  For grok classifiers, these are optional building blocks for the **Grok pattern** that you write. When built-in patterns cannot parse your data, you might need to write a custom pattern. These custom patterns are defined in this field and referenced in the **Grok pattern** field. Each custom pattern is defined on a separate line. Just like the built-in patterns, it consists of a named pattern definition that uses [regular expression (regex)](http://en.wikipedia.org/wiki/Regular_expression) syntax. 

  For example, the following has the name `MESSAGEPREFIX` followed by a regular expression definition to apply to your data to determine whether it follows the pattern. 

  ```
  MESSAGEPREFIX .*-.*-.*-.*-.*
  ```

------
#### [ XML ]
+ **Row tag** 

  For XML classifiers, this is the name of the XML tag that defines a table row in the XML document. Type the name without angle brackets `< >`. The name must comply with XML rules for a tag.

  For more information, see [Writing XML custom classifiers](custom-classifier.md#custom-classifier-xml). 

------
#### [ JSON ]
+ **JSON path** 

  For JSON classifiers, this is the JSON path to the object, array, or value that defines a row of the table being created. Type the name in either dot or bracket JSON syntax using AWS Glue supported operators. 

  For more information, see the list of operators in [Writing JSON custom classifiers](custom-classifier.md#custom-classifier-json). 

------
#### [ CSV ]
+ **Column delimiter** 

  A single character or symbol to denote what separates each column entry in the row. Choose the delimiter from the list, or choose `Other` to enter a custom delimiter.
+ **Quote symbol** 

  A single character or symbol to denote what combines content into a single column value. Must be different from the column delimiter. Choose the quote symbol from the list, or choose `Other` to enter a custom quote character.
+ **Column headings** 

  Indicates the behavior for how column headings should be detected in the CSV file. You can choose `Has headings`, `No headings`, or `Detect headings`. If your custom CSV file has column headings, enter a comma-delimited list of the column headings. 
+ **Allow files with single column** 

  To be classified as CSV, the data must have at least two columns and two rows of data. Use this option to allow the processing of files that contain only one column.
+ **Trim whitespace before identifying column values** 

  This option specifies whether to trim values before identifying the type of column values.
+  **Custom datatype** 

   (Optional) - Enter custom datatypes in a comma-delimited list. The supported datatypes are: “BINARY”, “BOOLEAN”, “DATE”, “DECIMAL”, “DOUBLE”, “FLOAT”, “INT”, “LONG”, “SHORT”, “STRING”, “TIMESTAMP”. 
+  **CSV Serde** 

   (Optional) - A SerDe for processing CSV in the classifier, which will be applied in the Data Catalog. Choose from `Open CSV SerDe`, `Lazy Simple SerDe`, or `None`. You can specify the `None` value when you want the crawler to do the detection. 

------

For more information, see [Writing custom classifiers for diverse data formats](custom-classifier.md).

## Viewing classifiers
<a name="view-classifiers-console"></a>

To see a list of all the classifiers that you have created, open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/), and choose the **Classifiers** tab.

The list displays the following properties about each classifier:
+ **Classifier** – The classifier name. When you create a classifier, you must provide a name for it.
+ **Classification** – The classification type of tables inferred by this classifier.
+ **Last updated** – The last time this classifier was updated.

## Managing classifiers
<a name="manage-classifiers-console"></a>

From the **Classifiers** list in the AWS Glue console, you can add, edit, and delete classifiers. To see more details for a classifier, choose the classifier name in the list. Details include the information you defined when you created the classifier. 

# Configuring a crawler
<a name="define-crawler"></a>

A crawler accesses your data store, identifies metadata, and creates table definitions in the AWS Glue Data Catalog. The **Crawlers** pane in the AWS Glue console lists all the crawlers that you create. The list displays status and metrics from the last run of your crawler.

 This topic contains the step-by-step process of configuring a crawler, covering essential aspects such as setting up the crawler's parameters, defining the data sources to crawl, setting up security, and managing the crawled data. 

**Topics**
+ [Step 1: Set crawler properties](define-crawler-set-crawler-properties.md)
+ [Step 2: Choose data sources and classifiers](define-crawler-choose-data-sources.md)
+ [Step 3: Configure security settings](define-crawler-configure-security-settings.md)
+ [Step 4: Set output and scheduling](define-crawler-set-output-and-scheduling.md)
+ [Step 5: Review and create](define-crawler-review.md)

# Step 1: Set crawler properties
<a name="define-crawler-set-crawler-properties"></a>

**To configure a crawler**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\). Choose **Crawlers** in the navigation pane.

1.  Choose **Create crawler**, and follow the instructions in the **Add crawler** wizard. The wizard will guide you the steps required to create a crawler. If you want to add custom calssifiers to define the schema, see [Defining and managing classifiers](add-classifier.md). 

1.  Enter a name for your crawler and description (optional). Optionally, you can tag your crawler with a **Tag key** and optional **Tag value**. Once created, tag keys are read-only. Use tags on some resources to help you organize and identify them. For more information, see AWS tags in AWS Glue.   
**Name**  
Name may contain letters (A-Z), numbers (0-9), hyphens (-), or underscores (\$1), and can be up to 255 characters long.  
**Description**  
Descriptions can be up to 2048 characters long.  
**Tags**  
Use tags to organize and identify your resources. For more information, see the following:   
   + [AWS tags in AWS Glue](monitor-tags.md)

# Step 2: Choose data sources and classifiers
<a name="define-crawler-choose-data-sources"></a>

Next, configure the data sources and classifiers for the crawler.

For more information about supported data sources, see [Supported data sources for crawling](crawler-data-stores.md).

**Data source configuration**  
Select the appropriate option for **Is your data already mapped to AWS Glue tables?** choose 'Not yet' or 'Yes'. By default, 'Not yet' is selected.   
The crawler can access data stores directly as the source of the crawl, or it can use existing tables in the Data Catalog as the source. If the crawler uses existing catalog tables, it crawls the data stores that are specified by those catalog tables.   
+ Not yet: Select one or more data sources to be crawled. A crawler can crawl multiple data stores of different types (Amazon S3, JDBC, and so on).

  You can configure only one data store at a time. After you have provided the connection information and include paths and exclude patterns, you then have the option of adding another data store.
+ Yes: Select existing tables from your AWS Glue Data Catalog. The catalog tables specify the data stores to crawl. The crawler can crawl only catalog tables in a single run; it can't mix in other source types.

  A common reason to specify a catalog table as the source is when you create the table manually (because you already know the structure of the data store) and you want a crawler to keep the table updated, including adding new partitions. For a discussion of other reasons, see [Updating manually created Data Catalog tables using crawlers](tables-described.md#update-manual-tables).

  When you specify existing tables as the crawler source type, the following conditions apply:
  + Database name is optional.
  + Only catalog tables that specify Amazon S3, Amazon DynamoDB, or Delta Lake data stores are permitted.
  + No new catalog tables are created when the crawler runs. Existing tables are updated as needed, including adding new partitions.
  + Deleted objects found in the data stores are ignored; no catalog tables are deleted. Instead, the crawler writes a log message. (`SchemaChangePolicy.DeleteBehavior=LOG`)
  + The crawler configuration option to create a single schema for each Amazon S3 path is enabled by default and cannot be disabled. (`TableGroupingPolicy`=`CombineCompatibleSchemas`) For more information, see [Creating a single schema for each Amazon S3 include path](crawler-grouping-policy.md).
  + You can't mix catalog tables as a source with any other source types (for example Amazon S3 or Amazon DynamoDB).
  
 To use Delta tables, first create a Delta table using Athena DDL or the AWS Glue API.   
 Using Athena, set the location to your Amazon S3 folder and the table type to 'DELTA'.   

```
CREATE EXTERNAL TABLE database_name.table_name
LOCATION 's3://bucket/folder/'
TBLPROPERTIES ('table_type' = 'DELTA')
```
 Using the AWS Glue API, specify the table type within the table parameters map. The table parameters need to include the following key/value pair. For more information on how to create a table, see [ Boto3 documentation for create\$1table ](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/glue/client/create_table.html).   

```
{
    "table_type":"delta"
}
```

**Data sources**  
Select or add the list of data sources to be scanned by the crawler.  
 (Optional) If you choose JDBC as the data source, you can use your own JDBC drivers when specifying the Connection access where the driver info is stored. 

**Include path**  
 When evaluating what to include or exclude in a crawl, a crawler starts by evaluating the required include path. For Amazon S3, MongoDB, MongoDB Atlas, Amazon DocumentDB (with MongoDB compatibility), and relational data stores, you must specify an include path.     
For an Amazon S3 data store  
Choose whether to specify a path in this account or in a different account, and then browse to choose an Amazon S3 path.  
For Amazon S3 data stores, include path syntax is `bucket-name/folder-name/file-name.ext`. To crawl all objects in a bucket, you specify just the bucket name in the include path. The exclude pattern is relative to the include path  
For a Delta Lake data store  
Specify one or more Amazon S3 paths to Delta tables as s3://*bucket*/*prefix*/*object*.  
For an Iceberg or Hudi data store  
Specify one or more Amazon S3 paths that contain folders with Iceberg or Hudi table metadata as s3://*bucket*/*prefix*.  
For Iceberg and Hudi data stores, the Iceberg/Hudi folder may be located in a child folder of the root folder. The crawler will scan all folders underneath a path for a Hudi folder.  
For a JDBC data store  
Enter *<database>*/*<schema>*/*<table>* or *<database>*/*<table>*, depending on the database product. Oracle Database and MySQL don’t support schema in the path. You can substitute the percent (%) character for *<schema>* or *<table>*. For example, for an Oracle database with a system identifier (SID) of `orcl`, enter `orcl/%` to import all tables to which the user named in the connection has access.  
This field is case-sensitive.
 If you choose to bring in your own JDBC driver versions, AWS Glue crawlers will consume resources in AWS Glue jobs and Amazon S3 buckets to ensure your provided driver are run in your environment. The additional usage of resources will be reflected in your account. Drivers are limited to the properties described in [Adding an AWS Glue connection](https://docs.aws.amazon.com/glue/latest/dg/console-connections.html).   
For a MongoDB, MongoDB Atlas, or Amazon DocumentDB data store  
For MongoDB, MongoDB Atlas, and Amazon DocumentDB (with MongoDB compatibility), the syntax is `database/collection`.
For JDBC data stores, the syntax is either `database-name/schema-name/table-name` or `database-name/table-name`. The syntax depends on whether the database engine supports schemas within a database. For example, for database engines such as MySQL or Oracle, don't specify a `schema-name` in your include path. You can substitute the percent sign (`%`) for a schema or table in the include path to represent all schemas or all tables in a database. You cannot substitute the percent sign (`%`) for database in the include path. 

**Maximum transversal depth (for Iceberg or Hudi data stores only)**  
Defines the maximum depth of the Amazon S3 path that the crawler can traverse to discover the Iceberg or Hudi metadata folder in your Amazon S3 path. The purpose of this parameter is to limit the crawler run time. The default value is 10 and the maximum is 20.

**Exclude patterns**  
These enable you to exclude certain files or tables from the crawl. The exclude path is relative to the include path. For example, to exclude a table in your JDBC data store, type the table name in the exclude path.   
A crawler connects to a JDBC data store using an AWS Glue connection that contains a JDBC URI connection string. The crawler only has access to objects in the database engine using the JDBC user name and password in the AWS Glue connection. *The crawler can only create tables that it can access through the JDBC connection.* After the crawler accesses the database engine with the JDBC URI, the include path is used to determine which tables in the database engine are created in the Data Catalog. For example, with MySQL, if you specify an include path of `MyDatabase/%`, then all tables within `MyDatabase` are created in the Data Catalog. When accessing Amazon Redshift, if you specify an include path of `MyDatabase/%`, then all tables within all schemas for database `MyDatabase` are created in the Data Catalog. If you specify an include path of `MyDatabase/MySchema/%`, then all tables in database `MyDatabase` and schema `MySchema` are created.   
After you specify an include path, you can then exclude objects from the crawl that your include path would otherwise include by specifying one or more Unix-style `glob` exclude patterns. These patterns are applied to your include path to determine which objects are excluded. These patterns are also stored as a property of tables created by the crawler. AWS Glue PySpark extensions, such as `create_dynamic_frame.from_catalog`, read the table properties and exclude objects defined by the exclude pattern.   
AWS Glue supports the following `glob` patterns in the exclude pattern.       
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/define-crawler-choose-data-sources.html)
AWS Glue interprets `glob` exclude patterns as follows:  
+ The slash (`/`) character is the delimiter to separate Amazon S3 keys into a folder hierarchy.
+ The asterisk (`*`) character matches zero or more characters of a name component without crossing folder boundaries.
+ A double asterisk (`**`) matches zero or more characters crossing folder or schema boundaries.
+ The question mark (`?`) character matches exactly one character of a name component.
+ The backslash (`\`) character is used to escape characters that otherwise can be interpreted as special characters. The expression `\\` matches a single backslash, and `\{` matches a left brace.
+ Brackets `[ ]` create a bracket expression that matches a single character of a name component out of a set of characters. For example, `[abc]` matches `a`, `b`, or `c`. The hyphen (`-`) can be used to specify a range, so `[a-z]` specifies a range that matches from `a` through `z` (inclusive). These forms can be mixed, so [`abce-g`] matches `a`, `b`, `c`, `e`, `f`, or `g`. If the character after the bracket (`[`) is an exclamation point (`!`), the bracket expression is negated. For example, `[!a-c]` matches any character except `a`, `b`, or `c`.

  Within a bracket expression, the `*`, `?`, and `\` characters match themselves. The hyphen (`-`) character matches itself if it is the first character within the brackets, or if it's the first character after the `!` when you are negating.
+ Braces (`{ }`) enclose a group of subpatterns, where the group matches if any subpattern in the group matches. A comma (`,`) character is used to separate the subpatterns. Groups cannot be nested.
+ Leading period or dot characters in file names are treated as normal characters in match operations. For example, the `*` exclude pattern matches the file name `.hidden`.

**Example Amazon S3 exclude patterns**  
Each exclude pattern is evaluated against the include path. For example, suppose that you have the following Amazon S3 directory structure:  

```
/mybucket/myfolder/
   departments/
      finance.json
      market-us.json
      market-emea.json
      market-ap.json
   employees/
      hr.json
      john.csv
      jane.csv
      juan.txt
```
Given the include path `s3://mybucket/myfolder/`, the following are some sample results for exclude patterns:    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/define-crawler-choose-data-sources.html)

**Example Excluding a subset of Amazon S3 partitions**  
Suppose that your data is partitioned by day, so that each day in a year is in a separate Amazon S3 partition. For January 2015, there are 31 partitions. Now, to crawl data for only the first week of January, you must exclude all partitions except days 1 through 7:  

```
 2015/01/{[!0],0[8-9]}**, 2015/0[2-9]/**, 2015/1[0-2]/**    
```
Take a look at the parts of this glob pattern. The first part, ` 2015/01/{[!0],0[8-9]}**`, excludes all days that don't begin with a "0" in addition to day 08 and day 09 from month 01 in year 2015. Notice that "\$1\$1" is used as the suffix to the day number pattern and crosses folder boundaries to lower-level folders. If "\$1" is used, lower folder levels are not excluded.  
The second part, ` 2015/0[2-9]/**`, excludes days in months 02 to 09, in year 2015.  
The third part, `2015/1[0-2]/**`, excludes days in months 10, 11, and 12, in year 2015.

**Example JDBC exclude patterns**  
Suppose that you are crawling a JDBC database with the following schema structure:  

```
MyDatabase/MySchema/
   HR_us
   HR_fr
   Employees_Table
   Finance
   Market_US_Table
   Market_EMEA_Table
   Market_AP_Table
```
Given the include path `MyDatabase/MySchema/%`, the following are some sample results for exclude patterns:    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/define-crawler-choose-data-sources.html)

**Additional crawler source parameters**  
Each source type requires a different set of additional parameters.

**Connection**  
Select or add an AWS Glue connection. For information about connections, see [Connecting to data](glue-connections.md).

**Additional metadata - optional (for JDBC data stores)**  
Select additional metadata properties for the crawler to crawl.  
+ Comments: Crawl associated table level and column level comments.
+ Raw types: Persist the raw datatypes of the table columns in additional metadata. As a default behavior, the crawler translates the raw datatypes to Hive-compatible types.

**JDBC Driver Class name - optional (for JDBC data stores)**  
 Type a custom JDBC driver class name for the crawler to connect to the data source:   
+ Postgres: org.postgresql.Driver
+ MySQL: com.mysql.jdbc.Driver, com.mysql.cj.jdbc.Driver
+ Redshift: com.amazon.redshift.jdbc.Driver, com.amazon.redshift.jdbc42.Driver
+ Oracle: oracle.jdbc.driver.OracleDriver
+ SQL Server: com.microsoft.sqlserver.jdbc.SQLServerDriver

**JDBC Driver S3 Path - optional (for JDBC data stores)**  
Choose an existing Amazon S3 path to a `.jar` file. This is where the `.jar` file will be stored when using a custom JDBC driver for the crawler to connect to the data source.

**Enable data sampling (for Amazon DynamoDB, MongoDB, MongoDB Atlas, and Amazon DocumentDB data stores only)**  
Select whether to crawl a data sample only. If not selected the entire table is crawled. Scanning all the records can take a long time when the table is not a high throughput table.

**Create tables for querying (for Delta Lake data stores only)**  
Select how you want to create the Delta Lake tables:  
+ Create Native tables: Allow integration with query engines that support querying of the Delta transaction log directly.
+ Create Symlink tables: Create a symlink manifest folder with manifest files partitioned by the partition keys, based on the specified configuration parameters.

**Scanning rate - optional (for DynamoDB data stores only)**  
Specify the percentage of the DynamoDB table Read Capacity Units to use by the crawler. Read capacity units is a term defined by DynamoDB, and is a numeric value that acts as rate limiter for the number of reads that can be performed on that table per second. Enter a value between 0.1 and 1.5. If not specified, defaults to 0.5 for provisioned tables and 1/4 of maximum configured capacity for on-demand tables. Note that only provisioned capacity mode should be used with AWS Glue crawlers.  
For DynamoDB data stores, set the provisioned capacity mode for processing reads and writes on your tables. The AWS Glue crawler should not be used with the on-demand capacity mode.

**Network connection - optional (for Amazon S3, Delta, Iceberg, Hudi and Catalog target data stores)**  
Optionally include a Network connection to use with this Amazon S3 target. Note that each crawler is limited to one Network connection so any other Amazon S3 targets will also use the same connection (or none, if left blank).  
For information about connections, see [Connecting to data](glue-connections.md).

**Sample only a subset of files and Sample size (for Amazon S3 data stores only)**  
Specify the number of files in each leaf folder to be crawled when crawling sample files in a dataset. When this feature is turned on, instead of crawling all the files in this dataset, the crawler randomly selects some files in each leaf folder to crawl.   
The sampling crawler is best suited for customers who have previous knowledge about their data formats and know that schemas in their folders do not change. Turning on this feature will significantly reduce crawler runtime.  
A valid value is an integer between 1 and 249. If not specified, all the files are crawled.

**Subsequent crawler runs**  
This field is a global field that affects all Amazon S3 data sources.  
+ Crawl all sub-folders: Crawl all folders again with every subsequent crawl.
+ Crawl new sub-folders only: Only Amazon S3 folders that were added since the last crawl will be crawled. If the schemas are compatible, new partitions will be added to existing tables. For more information, see [Scheduling incremental crawls for adding new partitions](incremental-crawls.md).
+ Crawl based on events: Rely on Amazon S3 events to control what folders to crawl. For more information, see [Accelerating crawls using Amazon S3 event notifications](crawler-s3-event-notifications.md).

**Custom classifiers - optional**  
Define custom classifiers before defining crawlers. A classifier checks whether a given file is in a format the crawler can handle. If it is, the classifier creates a schema in the form of a `StructType` object that matches that data format.  
For more information, see [Defining and managing classifiers](add-classifier.md).

# Step 3: Configure security settings
<a name="define-crawler-configure-security-settings"></a>

**IAM role**  
The crawler assumes this role. It must have permissions similar to the AWS managed policy `AWSGlueServiceRole`. For Amazon S3 and DynamoDB sources, it must also have permissions to access the data store. If the crawler reads Amazon S3 data encrypted with AWS Key Management Service (AWS KMS), then the role must have decrypt permissions on the AWS KMS key.   
For an Amazon S3 data store, additional permissions attached to the role would be similar to the following:     
****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject"
      ],
      "Resource": [
        "arn:aws:s3:::bucket/object*"
      ]
    }
  ]
}
```
For an Amazon DynamoDB data store, additional permissions attached to the role would be similar to the following:     
****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "dynamodb:DescribeTable",
        "dynamodb:Scan"
      ],
      "Resource": [
        "arn:aws:dynamodb:*:111122223333:table/table-name*"
      ]
    }
  ]
}
```
 In order to add your own JDBC driver, additional permissions need to be added.   
+  Grant permissions for the following job actions: `CreateJob`, `DeleteJob`, `GetJob`, `GetJobRun`, `StartJobRun`. 
+  Grant permissions for Amazon S3 actions: `s3:DeleteObjects`, `s3:GetObject`, `s3:ListBucket`, `s3:PutObject`. 
**Note**  
The `s3:ListBucket` is not needed if the Amazon S3 bucket policy is disabled.
+  Grant service principal access to bucket/folder in the Amazon S3 policy. 
 Example Amazon S3 policy:     
****  

```
{
    "Version":"2012-10-17",		 	 	 
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket",
                "s3:DeleteObject"
            ],
            "Resource": [
                "arn:aws:s3:::amzn-s3-demo-bucket/driver-parent-folder/driver.jar",
                "arn:aws:s3:::amzn-s3-demo-bucket"
            ]
        }
    ]
}
```
 AWS Glue creates the following folders (`_crawler` and `_glue_job_crawler` at the same level as the JDBC driver in your Amazon S3 bucket. For example, if the driver path is `<s3-path/driver_folder/driver.jar>`, then the following folders will be created if they do not already exist:   
+  <s3-path/driver\$1folder/\$1crawler> 
+  <s3-path/driver\$1folder/\$1glue\$1job\$1crawler> 
 Optionally, you can add a security configuration to a crawler to specify at-rest encryption options.  
For more information, see [Step 2: Create an IAM role for AWS Glue](create-an-iam-role.md) and [Identity and access management for AWS Glue](security-iam.md).

**Lake Formation configuration - optional**  
Allow the crawler to use Lake Formation credentials for crawling the data source.  
Checking **Use Lake Formation credentials for crawling S3 data source** will allow the crawler to use Lake Formation credentials for crawling the data source. If the data source belongs to another account, you must provide the registered account ID. Otherwise, the crawler will crawl only those data sources associated to the account. Only applicable to Amazon S3 and Data Catalog data sources.

**Security configuration - optional**  
Settings include security configurations. For more information, see the following:   
+ [Encrypting data written by AWS Glue](encryption-security-configuration.md)
Once a security configuration has been set on a crawler, you can change, but you cannot remove it. To lower the level of security on a crawler, explicitly set the security feature to `DISABLED` within your configuration, or create a new crawler.

# Step 4: Set output and scheduling
<a name="define-crawler-set-output-and-scheduling"></a>

**Output configuration**  
Options include how the crawler should handle detected schema changes, deleted objects in the data store, and more. For more information, see [Customizing crawler behavior](crawler-configuration.md)

**Crawler schedule**  
You can run a crawler on demand or define a time-based schedule for your crawlers and jobs in AWS Glue. The definition of these schedules uses the Unix-like cron syntax. For more information, see [Scheduling a crawler](schedule-crawler.md).

# Step 5: Review and create
<a name="define-crawler-review"></a>

Review the crawler settings you configured, and create the crawler.

# Scheduling a crawler
<a name="schedule-crawler"></a>

You can run an AWS Glue crawler on demand or on a regular schedule. When you set up a crawler based on a schedule, you can specify certain constraints, such as the frequency of the crawler runs, which days of the week it runs, and at what time. You can create these custom schedules in *cron* format. For more information, see [cron](http://en.wikipedia.org/wiki/Cron) in Wikipedia.

When setting up a crawler schedule, you should consider the features and limitations of cron. For example, if you choose to run your crawler on day 31 each month, keep in mind that some months don't have 31 days.

**Topics**
+ [Create a crawler schedule](create-crawler-schedule.md)
+ [Create a schedule for an existing crawler](Update-crawler-schedule.md)

# Create a crawler schedule
<a name="create-crawler-schedule"></a>

You can create a schedule for the crawler using the AWS Glue console or AWS CLI.

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console, and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\). 

1. Choose **Crawlers** in the navigation pane.

1. Follow steps 1-3 in the [Configuring a crawler](define-crawler.md) section.

1. In [Step 4: Set output and scheduling](define-crawler-set-output-and-scheduling.md), choose a **Crawler schedule** to set the frequency of the run. You can choose the crawler to run hourly, daily, weekly, monthly or define custom schedule using cron expressions.

   A cron expression is a string representing a schedule pattern, consisting of 6 fields separated by spaces: \$1 \$1 \$1 \$1 \$1 <minute> <hour> <day of month> <month> <day of week> <year> 

   For example, to run a task every day at midnight, the cron expression is: 0 0 \$1 \$1 ? \$1

   For more information, see [Cron expressions](https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html#CronExpressions).

1. Review the crawler settings you configured, and create the crawler to run on a schedule.

------
#### [ AWS CLI ]

```
aws glue create-crawler 
 --name myCrawler \
 --role AWSGlueServiceRole-myCrawler  \
 --targets '{"S3Targets":[{Path="s3://amzn-s3-demo-bucket/"}]}' \
 --schedule cron(15 12 * * ? *)
```

------

For more information about using cron to schedule jobs and crawlers, see [Time-based schedules for jobs and crawlers](monitor-data-warehouse-schedule.md). 

# Create a schedule for an existing crawler
<a name="Update-crawler-schedule"></a>

Follow these steps to set up a recurring schedule for an existing crawler.

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue\). 

1. Choose **Crawlers** in the navigation pane.

1. Choose a crawler that you want to schedule from the available list.

1. Choose **Edit** from the **Actions menu.**

1. Scroll down to **Step 4: Set output and scheduling**, and choose **Edit**. 

1.  Update your crawler schedule under **Crawler schedule**. 

1. Choose **Update**.

------
#### [ AWS CLI ]

Use the following CLI command to update an existing crawler configuration:

```
aws glue update-crawler-schedule 
   --crawler-name myCrawler
   --schedule cron(15 12 * * ? *)
```

------

# Viewing crawler results and details
<a name="console-crawlers-details"></a>

 After the crawler runs successfully, it creates table definitions in the Data Catalog. Choose **Tables** in the navigation pane to see the tables that were created by your crawler in the database that you specified. 

 You can view information related to the crawler itself as follows:
+ The **Crawlers** page on the AWS Glue console displays the following properties for a crawler:    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/console-crawlers-details.html)
+  To view the history of a crawler, choose **Crawlers** in the navigation pane to see the crawlers you created. Choose a crawler from the list of available crawlers. You can view the crawler properties and view the crawler history in the **Crawler runs** tab. 

   The Crawler runs tab displays information about each time the crawler ran, including ** Start time (UTC),** **End time (UTC)**, **Duration**, **Status**, **DPU hours**, and **Table changes**. 

  The Crawler runs tab displays only the crawls that have occurred since the launch date of the crawler history feature, and only retains up to 12 months of crawls. Older crawls will not be returned.
+ To see additional information, choose a tab in the crawler details page. Each tab will display information related to the crawler. 
  +  **Schedule**: Any schedules created for the crawler will be visible here. 
  +  **Data sources**: All data sources scanned by the crawler will be visible here. 
  +  **Classifiers**: All classifiers assigned to the crawler will be visible here. 
  +  **Tags**: Any tags created and assigned to an AWS resource will be visible here. 

# Parameters set on Data Catalog tables by crawler
<a name="table-properties-crawler"></a>

 These table properties are set by AWS Glue crawlers. We expect users to consume the `classification` and `compressionType` properties. Other properties, including table size estimates, are used for internal calculations, and we do not guarantee their accuracy or applicability to customer use cases. Changing these parameters may alter the behavior of the crawler, we do not support this workflow. 


| Property key | Property value | 
| --- | --- | 
| UPDATED\$1BY\$1CRAWLER | Name of crawler performing update. | 
| connectionName | The name of the connection in the Data Catalog for the crawler used to connect the to the data store. | 
| recordCount | Estimate count of records in table, based on file sizes and headers. | 
| skip.header.line.count | Rows skipped to skip header. Set on tables classified as CSV. | 
| CrawlerSchemaSerializerVersion | For internal use | 
| classification | Format of data, inferred by crawler. For more information about data formats supported by AWS Glue crawlers see [Built-in classifiers](add-classifier.md#classifier-built-in). | 
| CrawlerSchemaDeserializerVersion | For internal use | 
| sizeKey | Combined size of files in table crawled. | 
| averageRecordSize | Average size of row in table, in bytes. | 
| compressionType | Type of compression used on data in the table. For more information about compression types supported by AWS Glue crawlers see [Built-in classifiers](add-classifier.md#classifier-built-in). | 
| typeOfData | `file`, `table` or `view`. | 
| objectCount | Number of objects under Amazon S3 path for table. | 

 These additional table properties are set by AWS Glue crawlers for Snowflake data stores. 


| Property key | Property value | 
| --- | --- | 
| aws:RawTableLastAltered | Records the last altered timestamp of the Snowflake table. | 
| ViewOriginalText | View SQL statement. | 
| ViewExpandedText | View SQL statement encoded in Base64 format. | 
| ExternalTable:S3Location | Amazon S3 location of the Snowflake external table. | 
| ExternalTable:FileFormat | Amazon S3 file format of the Snowflake external table. | 

 These additional table properties are set by AWS Glue crawlers for JDBC-type data stores such as Amazon Redshift, Microsoft SQL Server, MySQL, PostgreSQL, and Oracle. 


| Property key | Property value | 
| --- | --- | 
| aws:RawType | When a crawler store the data in the Data Catalog it translates the datatypes to Hive-compatible types, which many times causes the information on the native datatype to be lost. The crawler outputs the `aws:RawType` parameter to provide the native-level datatype. | 
| aws:RawColumnComment | If a comment is associated with a column in the database, the crawler outputs the corresponding comment in the catalog table. The comment string is truncated to 255 bytes. Comments are not supported for Microsoft SQL Server.  | 
| aws:RawTableComment | If a comment is associated with a table in the database, the crawler outputs corresponding comment in the catalog table. The comment string is truncated to 255 bytes. Comments are not supported for Microsoft SQL Server. | 

# Customizing crawler behavior
<a name="crawler-configuration"></a>

When you configure an AWS Glue crawler, you have several options for defining the behavior of your crawler.
+ **Incremental crawls** – You can configure a crawler to run incremental crawls to add only new partitions to the table schema. 
+ **Partition indexes** – A crawler creates partition indexes for Amazon S3 and Delta Lake targets by default to provide efficient lookup for specific partitions.
+ **Accelerate crawl time by using Amazon S3 events **– You can configure a crawler to use Amazon S3 events to identify the changes between two crawls by listing all the files from the subfolder which triggered the event instead of listing the full Amazon S3 or Data Catalog target.
+ **Handling schema changes** – You can prevent a crawlers from making any schema changes to the existing schema. You can use the AWS Management Console or the AWS Glue API to configure how your crawler processes certain types of changes. 
+ **A single schema for multiple Amazon S3 paths** – You can configure a crawler to create a single schema for each S3 path if the data is compatible.
+ **Table location and partitioning levels** – The table level crawler option provides you the flexibility to tell the crawler where the tables are located, and how you want partitions created. 
+ **Table threshold** – You can specify the maximum number of tables the crawler is allowed to create by specifying a table threshold.
+ **AWS Lake Formation credentials** – You can configure a crawler to use Lake Formation credentials to access an Amazon S3 data store or a Data Catalog table with an underlying Amazon S3 location within the same AWS account or another AWS account. 

 For more information about using the AWS Glue console to add a crawler, see [Configuring a crawler](define-crawler.md). 

**Topics**
+ [Scheduling incremental crawls for adding new partitions](incremental-crawls.md)
+ [Generating partition indexes](crawler-configure-partition-indexes.md)
+ [Preventing a crawler from changing an existing schema](crawler-schema-changes-prevent.md)
+ [Creating a single schema for each Amazon S3 include path](crawler-grouping-policy.md)
+ [Specifying the table location and partitioning level](crawler-table-level.md)
+ [Specifying the maximum number of tables the crawler is allowed to create](crawler-maximum-number-of-tables.md)
+ [Configuring a crawler to use Lake Formation credentials](crawler-lf-integ.md)
+ [Accelerating crawls using Amazon S3 event notifications](crawler-s3-event-notifications.md)

# Scheduling incremental crawls for adding new partitions
<a name="incremental-crawls"></a>

You can configure an AWS Glue crawler run incremental crawls to add only new partitions to the table schema. When the crawler runs for the first time, it performs a full crawl to processes the entire data source to record the complete schema and all existing partitions in the AWS Glue Data Catalog.

Subsequent crawls after the initial full crawl will be incremental, where the crawler identifies and adds only the new partitions that have been introduced since the previous crawl. This approach results in faster crawl times, as the crawler no longer needs to process the entire data source for each run, but instead focuses only on the new partitions. 

**Note**  
Incremental crawls don't detect modifications or deletions of existing partitions. This configuration is best suited for data sources with a stable schema. If a one-time major schema change occurs, it is advisable to temporarily set the crawler to perform a full crawl to capture the new schema accurately, and then switch back to incremental crawling mode. 

The following diagram shows that with the incremental crawl setting enabled, the crawler will only detect and add the newly added folder, month=March, to the catalog.

![\[The following diagram shows that files for the month of March have been added.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawlers-s3-folders-new.png)


Follow these steps to update your crawler to perform incremental crawls:

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Crawlers** under the **Data Catalog**.

1. Choose a crawler that you want to set up to crawl incrementally.

1. Choose **Edit**.

1. Choose **Step 2. Choose data sources and classifiers**.

1. Choose the data source that you want to incrementally crawl. 

1. Choose **Edit**.

1. Choose **Crawl new sub-folders only** under **Subsequent crawler runs**.

1. Choose **Update**.

To create a schedule for a crawler, see [Scheduling a crawler](schedule-crawler.md).

------
#### [ AWS CLI ]

```
aws glue update-crawler \
 --name myCrawler \
 --recrawl-policy RecrawlBehavior=CRAWL_NEW_FOLDERS_ONLY \
 --schema-change-policy UpdateBehavior=LOG,DeleteBehavior=LOG
```

------

**Notes and restrictions**  
When this option is turned on, you can't change the Amazon S3 target data stores when editing the crawler. This option affects certain crawler configuration settings. When turned on, it forces the update behavior and delete behavior of the crawler to `LOG`. This means that:
+ If it discovers objects where schemas are not compatible, the crawler will not add the objects in the Data Catalog, and adds this detail as a log in CloudWatch Logs.
+ It will not update deleted objects in the Data Catalog.

# Generating partition indexes
<a name="crawler-configure-partition-indexes"></a>

The Data Catalog supports creating partition indexes to provide efficient lookup for specific partitions. For more information, see [Creating partition indexes](https://docs.aws.amazon.com/glue/latest/dg/partition-indexes.html). The AWS Glue crawler creates partition indexes for Amazon S3 and Delta Lake targets by default.

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Crawlers** under the **Data Catalog**.

1. When you define a crawler, the option to **Create partition indexes automatically ** is enabled by default under **Advanced options** on the **Set output and scheduling** page.

   To disable this option, you can unselect the checkbox **Create partition indexes automatically ** in the console. 

1. Complete the crawler configuration and choose **Create crawler**.

------
#### [ AWS CLI ]

 You can also disable this option by using the AWS CLI, set the `CreatePartitionIndex ` in the `configuration` parameter. The default value is true.

```
aws glue update-crawler \
    --name myCrawler \
    --configuration '{"Version": 1.0, "CreatePartitionIndex": false }'
```

------

## Usage notes for partition indexes
<a name="crawler-configure-partition-indexes-usage-notes"></a>
+ Tables created by the crawler do not have the variable `partition_filtering.enabled` by default. For more information, see [AWS Glue partition indexing and filtering](https://docs.aws.amazon.com/athena/latest/ug/glue-best-practices.html#glue-best-practices-partition-index).
+ Creating partition indexes for encrypted partitions is not supported.

# Preventing a crawler from changing an existing schema
<a name="crawler-schema-changes-prevent"></a>

 You can prevent AWS Glue crawlers from making any schema changes to the Data Catalog when they run. By default, crawlers updates the schema in the Data Catalog to match the data source being crawled. However, in some cases, you may want to prevent the Crawler from modifying the existing schema, especially if you have transformed or cleaned the data and don't want the original schema to overwrite the changes.

 Follow these steps to configure your crawler not to overwrite the existing schema in a table definition. 

------
#### [  AWS Management Console  ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Crawlers** under the **Data Catalog**.

1. Choose a crawler from the list, and choose **Edit**.

1. Choose **step 4, Set output and scheduling**.

1. Under **Advance options**, choose **Add new columns only** or **Ignore the change and don't update the table in the Data Catalog**. 

1.  You can also set a configuration option to **Update all new and existing partitions with metadata from the table**. This sets partition schemas to inherit from the table. 

1. Choose **Update**.

------
#### [ AWS CLI ]

The following example shows how to configure a crawler to not change existing schema, only add new columns:

```
aws glue update-crawler \
  --name myCrawler \
  --configuration '{"Version": 1.0, "CrawlerOutput": {"Tables": {"AddOrUpdateBehavior": "MergeNewColumns"}}}'
```

The following example shows how to configure a crawler to not change the existing schema, and not add new columns:

```
aws glue update-crawler \
  --name myCrawler \
  --schema-change-policy UpdateBehavior=LOG \
  --configuration '{"Version": 1.0, "CrawlerOutput": {"Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }}}'
```

------
#### [ API ]

If you don't want a table schema to change at all when a crawler runs, set the schema change policy to `LOG`. 

When you configure the crawler using the API, set the following parameters:
+ Set the `UpdateBehavior` field in `SchemaChangePolicy` structure to `LOG`.
+  Set the `Configuration` field with a string representation of the following JSON object in the crawler API; for example: 

  ```
  {
     "Version": 1.0,
     "CrawlerOutput": {
        "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" }
     }
  }
  ```

------

# Creating a single schema for each Amazon S3 include path
<a name="crawler-grouping-policy"></a>

By default, when a crawler defines tables for data stored in Amazon S3, it considers both data compatibility and schema similarity. Data compatibility factors that it considers include whether the data is of the same format (for example, JSON), the same compression type (for example, GZIP), the structure of the Amazon S3 path, and other data attributes. Schema similarity is a measure of how closely the schemas of separate Amazon S3 objects are similar.

To help illustrate this option, suppose that you define a crawler with an include path `s3://amzn-s3-demo-bucket/table1/`. When the crawler runs, it finds two JSON files with the following characteristics:
+ **File 1** – `S3://amzn-s3-demo-bucket/table1/year=2017/data1.json`
+ *File content* – `{“A”: 1, “B”: 2}`
+ *Schema* – `A:int, B:int`
+ **File 2** – `S3://amzn-s3-demo-bucket/table1/year=2018/data2.json`
+ *File content* – `{“C”: 3, “D”: 4}`
+ *Schema* – `C: int, D: int`

By default, the crawler creates two tables, named `year_2017` and `year_2018` because the schemas are not sufficiently similar. However, if the option **Create a single schema for each S3 path** is selected, and if the data is compatible, the crawler creates one table. The table has the schema `A:int,B:int,C:int,D:int` and `partitionKey` `year:string`.

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Crawlers** under the **Data Catalog**.

1. When you configure a new crawler, under **Output and scheduling **, select the option **Create a single schema for each S3 path** under Advance options. 

------
#### [ AWS CLI ]

You can configure a crawler to `CombineCompatibleSchemas` into a common table definition when possible. With this option, the crawler still considers data compatibility, but ignores the similarity of the specific schemas when evaluating Amazon S3 objects in the specified include path.

When you configure the crawler using the AWS CLI, set the following configuration option:

```
aws glue update-crawler \
   --name myCrawler \
   --configuration '{"Version": 1.0, "Grouping": {"TableGroupingPolicy": "CombineCompatibleSchemas" }}'
```

------
#### [ API ]

When you configure the crawler using the API, set the following configuration option:

 Set the `Configuration` field with a string representation of the following JSON object in the crawler API; for example: 

```
{
   "Version": 1.0,
   "Grouping": {
      "TableGroupingPolicy": "CombineCompatibleSchemas" }
}
```

------

# Specifying the table location and partitioning level
<a name="crawler-table-level"></a>

By default, when a crawler defines tables for data stored in Amazon S3 the crawler attempts to merge schemas together, and create top-level tables (`year=2019`). In some cases, you may expect the crawler to create a table for the folder `month=Jan` but instead the crawler creates a partition since a sibling folder (`month=Mar`) was merged into the same table.

The table level crawler option provides you the flexibility to tell the crawler where the tables are located, and how you want partitions created. When you specify a **Table level**, the table is created at that absolute level from the Amazon S3 bucket.

![\[Crawler grouping with table level specified as level 2.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-table-level1.jpg)


 When configuring the crawler on the console, you can specify a value for the **Table level** crawler option. The value must be a positive integer that indicates the table location (the absolute level in the dataset). The level for the top level folder is 1. For example, for the path `mydataset/year/month/day/hour`, if the level is set to 3, the table is created at location `mydataset/year/month`. 

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose **Crawlers** under the **Data Catalog**.

1. When you configure a crawler, under **Output and scheduling**, choose **Table level** under **Advance options**.

![\[Specifying a table level in the crawler configuration.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-configuration-console.png)


------
#### [ AWS CLI ]

When you configure the crawler using the AWS CLI, set the `configuration` parameter as shown in the example code: 

```
aws glue update-crawler \
  --name myCrawler \
  --configuration '{"Version": 1.0, "Grouping": { "TableLevelConfiguration": 2 }}'
```

------
#### [ API ]

When you configure the crawler using the API, set the `Configuration` field with a string representation of the following JSON object; for example: 

```
configuration = jsonencode(
{
   "Version": 1.0,
   "Grouping": {
            TableLevelConfiguration = 2  
        }
})
```

------
#### [ CloudFormation ]

In this example, you set the **Table level** option available in the console within your CloudFormation template:

```
"Configuration": "{
    \"Version\":1.0,
    \"Grouping\":{\"TableLevelConfiguration\":2}
}"
```

------

# Specifying the maximum number of tables the crawler is allowed to create
<a name="crawler-maximum-number-of-tables"></a>

You can optionally specify the maximum number of tables the crawler is allowed to create by specifying a `TableThreshold` via the AWS Glue console or AWS CLI. If the tables detected by the crawler during its crawl is greater that this input value, the crawl fails and no data is written to the Data Catalog.

This parameter is useful when the tables that would be detected and created by the crawler are much greater more than what you expect. There can be multiple reasons for this, such as:
+ When using an AWS Glue job to populate your Amazon S3 locations you can end up with empty files at the same level as a folder. In such cases when you run a crawler on this Amazon S3 location, the crawler creates multiple tables due to files and folders present at the same level.
+ If you do not configure `"TableGroupingPolicy": "CombineCompatibleSchemas"` you may end up with more tables than expected. 

You specify the `TableThreshold` as an integer value greater than 0. This value is configured on a per crawler basis. That is, for every crawl this value is considered. For example: a crawler has the `TableThreshold` value set as 5. In each crawl AWS Glue compares the number of tables detected with this table threshold value (5) and if the number of tables detected is less than 5, AWS Glue writes the tables to the Data Catalog and if not, the crawl fails without writing to the Data Catalog.

------
#### [ AWS Management Console ]

**To set `TableThreshold` using the AWS Management Console:**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. When configuring a crawler, in **Output and scheduling**, set the **Maximum table threshold** to the number of tables the crawler is allowed generate.  
![\[The Output and scheduling section of the AWS console showing the Maximum table threshold parameter.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-max-tables.png)

------
#### [ AWS CLI ]

To set `TableThreshold` using the AWS CLI:

```
aws glue update-crawler \
    --name myCrawler \
    --configuration '{"Version": 1.0, "CrawlerOutput": {"Tables": { "TableThreshold": 5 }}}'
```

------
#### [ API ]

To set `TableThreshold` using the API:

```
"{"Version":1.0,
"CrawlerOutput":
{"Tables":{"AddOrUpdateBehavior":"MergeNewColumns",
"TableThreshold":5}}}";
```

------

Error messages are logged to help you identify table paths and clean-up your data. Example log in your account if the crawler fails because the table count was greater than table threshold value provided:

```
Table Threshold value = 28, Tables detected - 29
```

In CloudWatch, we log all table locations detected as an INFO message. An error is logged as the reason for the failure.

```
ERROR com.amazonaws.services.glue.customerLogs.CustomerLogService - CustomerLogService received CustomerFacingException with message 
The number of tables detected by crawler: 29 is greater than the table threshold value provided: 28. Failing crawler without writing to Data Catalog.
com.amazonaws.services.glue.exceptions.CustomerFacingInternalException: The number of tables detected by crawler: 29 is greater than the table threshold value provided: 28. 
Failing crawler without writing to Data Catalog.
```

# Configuring a crawler to use Lake Formation credentials
<a name="crawler-lf-integ"></a>

You can configure a crawler to use AWS Lake Formation credentials to access an Amazon S3 data store or a Data Catalog table with an underlying Amazon S3 location within the same AWS account or another AWS account. You can configure an existing Data Catalog table as a crawler's target, if the crawler and the Data Catalog table reside in the same account. Currently, only a single catalog target with a single catalog table is allowed when using a Data Catalog table as a crawler’s target.

**Note**  
When you are defining a Data Catalog table as a crawler target, make sure that the underlying location of the Data Catalog table is an Amazon S3 location. Crawlers that use Lake Formation credentials only support Data Catalog targets with underlying Amazon S3 locations.

## Setup required when the crawler and registered Amazon S3 location or Data Catalog table reside in the same account (in-account crawling)
<a name="in-account-crawling"></a>

To allow the crawler to access a data store or Data Catalog table by using Lake Formation credentials, you need to register the data location with Lake Formation. Also, the crawler's IAM role must have permissions to read the data from the destination where the Amazon S3 bucket is registered.

You can complete the following configuration steps using the AWS Management Console or AWS Command Line Interface (AWS CLI).

------
#### [ AWS Management Console ]

1. Before configuring a crawler to access the crawler source, register the data location of the data store or the Data Catalog with Lake Formation. In the Lake Formation console ([https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/)), register an Amazon S3 location as the root location of your data lake in the AWS account where the crawler is defined. For more information, see [Registering an Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html).

1. Grant **Data location** permissions to the IAM role that's used for the crawler run so that the crawler can read the data from the destination in Lake Formation. For more information, see [Granting data location permissions (same account)](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-location-permissions-local.html).

1. Grant the crawler role access permissions (`Create`) to the database, which is specified as the output database. For more information, see [Granting database permissions using the Lake Formation console and the named resource method](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-database-permissions.html).

1. In the IAM console ([https://console.aws.amazon.com/iam/](https://console.aws.amazon.com/iam/)), create an IAM role for the crawler. Add the `lakeformation:GetDataAccess` policy to the role.

1. In the AWS Glue console ([https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/)), while configuring the crawler, select the option **Use Lake Formation credentials for crawling Amazon S3 data source**.
**Note**  
The accountId field is optional for in-account crawling.

------
#### [ AWS CLI ]

```
aws glue --profile demo create-crawler --debug --cli-input-json '{
    "Name": "prod-test-crawler",
    "Role": "arn:aws:iam::111122223333:role/service-role/AWSGlueServiceRole-prod-test-run-role",
    "DatabaseName": "prod-run-db",
    "Description": "",
    "Targets": {
    "S3Targets":[
                {
                 "Path": "s3://amzn-s3-demo-bucket"
                }
                ]
                },
   "SchemaChangePolicy": {
      "UpdateBehavior": "LOG",
      "DeleteBehavior": "LOG"
  },
  "RecrawlPolicy": {
    "RecrawlBehavior": "CRAWL_EVERYTHING"
  },
  "LineageConfiguration": {
    "CrawlerLineageSettings": "DISABLE"
  },
  "LakeFormationConfiguration": {
    "UseLakeFormationCredentials": true,
    "AccountId": "111122223333"
  },
  "Configuration": {
           "Version": 1.0,
           "CrawlerOutput": {
             "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" },
             "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" }
           },
           "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" }
         },
  "CrawlerSecurityConfiguration": "",
  "Tags": {
    "KeyName": ""
  }
}'
```

------

# Setup required when the crawler and registered Amazon S3 location reside in different accounts (cross-account crawling)
<a name="cross-account-crawling"></a>

To allow the crawler to access a data store in a different account using Lake Formation credentials, you must first register the Amazon S3 data location with Lake Formation. Then, you grant data location permissions to the crawler's account by taking the following steps.

You can complete the following steps using the AWS Management Console or AWS CLI.

------
#### [ AWS Management Console ]

1. In the account where the Amazon S3 location is registered (account B):

   1. Register an Amazon S3 path with Lake Formation. For more information, see [Registering Amazon S3 location](https://docs.aws.amazon.com/lake-formation/latest/dg/register-location.html).

   1.  Grant **Data location** permissions to the account (account A) where the crawler will be run. For more information, see [Grant data location permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-location-permissions-local.html). 

   1. Create an empty database in Lake Formation with the underlying location as the target Amazon S3 location. For more information, see [Creating a database](https://docs.aws.amazon.com/lake-formation/latest/dg/creating-database.html).

   1. Grant account A (the account where the crawler will be run) access to the database that you created in the previous step. For more information, see [Granting database permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-database-permissions.html). 

1. In the account where the crawler is created and will be run (account A):

   1.  Using the AWS RAM console, accept the database that was shared from the external account (account B). For more information, see [Accepting a resource share invitation from AWS Resource Access Manager](https://docs.aws.amazon.com/lake-formation/latest/dg/accepting-ram-invite.html). 

   1.  Create an IAM role for the crawler. Add `lakeformation:GetDataAccess` policy to the role.

   1.  In the Lake Formation console ([https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/)), grant **Data location** permissions on the target Amazon S3 location to the IAM role used for the crawler run so that the crawler can read the data from the destination in Lake Formation. For more information, see [Granting data location permissions](https://docs.aws.amazon.com/lake-formation/latest/dg/granting-location-permissions-local.html). 

   1.  Create a resource link on the shared database. For more information, see [Create a resource link](https://docs.aws.amazon.com/lake-formation/latest/dg/create-resource-link-database.html). 

   1.  Grant the crawler role access permissions (`Create`) on the shared database and (`Describe`) the resource link. The resource link is specified in the output for the crawler. 

   1.  In the AWS Glue console ([https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/)), while configuring the crawler, select the option **Use Lake Formation credentials for crawling Amazon S3 data source**.

      For cross-account crawling, specify the AWS account ID where the target Amazon S3 location is registered with Lake Formation. For in-account crawling, the accountId field is optional.   
![\[IAM role selection and Lake Formation configuration options for AWS Glue crawler security settings.\]](http://docs.aws.amazon.com/glue/latest/dg/images/cross-account-crawler.png)

------
#### [ AWS CLI ]

```
aws glue --profile demo create-crawler --debug --cli-input-json '{
    "Name": "prod-test-crawler",
    "Role": "arn:aws:iam::111122223333:role/service-role/AWSGlueServiceRole-prod-test-run-role",
    "DatabaseName": "prod-run-db",
    "Description": "",
    "Targets": {
    "S3Targets":[
                {
                 "Path": "s3://amzn-s3-demo-bucket"
                }
                ]
                },
   "SchemaChangePolicy": {
      "UpdateBehavior": "LOG",
      "DeleteBehavior": "LOG"
  },
  "RecrawlPolicy": {
    "RecrawlBehavior": "CRAWL_EVERYTHING"
  },
  "LineageConfiguration": {
    "CrawlerLineageSettings": "DISABLE"
  },
  "LakeFormationConfiguration": {
    "UseLakeFormationCredentials": true,
    "AccountId": "111111111111"
  },
  "Configuration": {
           "Version": 1.0,
           "CrawlerOutput": {
             "Partitions": { "AddOrUpdateBehavior": "InheritFromTable" },
             "Tables": {"AddOrUpdateBehavior": "MergeNewColumns" }
           },
           "Grouping": { "TableGroupingPolicy": "CombineCompatibleSchemas" }
         },
  "CrawlerSecurityConfiguration": "",
  "Tags": {
    "KeyName": ""
  }
}'
```

------

**Note**  
A crawler using Lake Formation credentials is only supported for Amazon S3 and Data Catalog targets.
For targets using Lake Formation credential vending, the underlying Amazon S3 locations must belong to the same bucket. For example, customers can use multiple targets (s3://amzn-s3-demo-bucket1/folder1, s3://amzn-s3-demo-bucket1/folder2) as long as all target locations are under the same bucket (amzn-s3-demo-bucket1). Specifying different buckets (s3://amzn-s3-demo-bucket1/folder1, s3://amzn-s3-demo-bucket2/folder2) is not allowed.
Currently for Data Catalog target crawlers, only a single catalog target with a single catalog table is allowed.

# Accelerating crawls using Amazon S3 event notifications
<a name="crawler-s3-event-notifications"></a>

Instead of listing the objects from an Amazon S3 or Data Catalog target, you can configure the crawler to use Amazon S3 events to find any changes. This feature improves the recrawl time by using Amazon S3 events to identify the changes between two crawls by listing all the files from the subfolder which triggered the event instead of listing the full Amazon S3 or Data Catalog target.

The first crawl lists all Amazon S3 objects from the target. After the first successful crawl, you can choose to recrawl manually or on a set schedule. The crawler will list only the objects from those events instead of listing all objects.

When the target is a Data Catalog table, the crawler updates the existing tables in the Data Catalog with changes (for example, extra partitions in a table).

The advantages of moving to an Amazon S3 event based crawler are:
+ A faster recrawl as the listing of all the objects from the target is not required, instead the listing of specific folders is done where objects are added or deleted.
+ A reduction in the overall crawl cost as the listing of specific folders is done where objects are added or deleted.

The Amazon S3 event crawl runs by consuming Amazon S3 events from the SQS queue based on the crawler schedule. There will be no cost if there are no events in the queue. Amazon S3 events can be configured to go directly to the SQS queue or in cases where multiple consumers need the same event, a combination of SNS and SQS. For more information, see [Setting up your account for Amazon S3 event notifications](#crawler-s3-event-notifications-setup).

After creating and configuring the crawler in event mode, the first crawl runs in listing mode by performing full a listing of the Amazon S3 or Data Catalog target. The following log confirms the operation of the crawl by consuming Amazon S3 events after the first successful crawl: "The crawl is running by consuming Amazon S3 events."

After creating the Amazon S3 event crawl and updating the crawler properties which may impact the crawl, the crawl operates in list mode and the following log is added: "Crawl is not running in S3 event mode".

**Note**  
The maximum number of messages to consume is 100,000 messages per crawl.

## Considerations and limitations
<a name="s3event-crawler-limitations"></a>

The following considerations and limitations apply when you configure a crawler to use Amazon S3 event notifications to find any changes. 
+  **Important behavior with deleted partitions** 

  When using Amazon S3 event crawlers with Data Catalog tables:
  +  If you delete a partition using the `DeletePartition` API call, you must also delete all S3 objects under that partition, and select **All object removal events** when you configure your S3 event notifications. If deletion events are not configured, the crawler recreates the deleted partition during its next run. 
+ Only a single target is supported by the crawler, whether for Amazon S3 or Data Catalog targets.
+ SQS on private VPC is not supported.
+ Amazon S3 sampling is not supported.
+ The crawler target should be a folder for an Amazon S3 target, or one or more AWS Glue Data Catalog tables for a Data Catalog target.
+ The 'everything' path wildcard is not supported: s3://%
+ For a Data Catalog target, all catalog tables should point to same Amazon S3 bucket for Amazon S3 event mode.
+ For a Data Catalog target, a catalog table should not point to an Amazon S3 location in the Delta Lake format (containing \$1symlink folders, or checking the catalog table's `InputFormat`).

**Topics**
+ [Considerations and limitations](#s3event-crawler-limitations)
+ [Setting up your account for Amazon S3 event notifications](#crawler-s3-event-notifications-setup)
+ [Setting up a crawler for Amazon S3 event notifications for an Amazon S3 target](crawler-s3-event-notifications-setup-console-s3-target.md)
+ [Setting up a crawler for Amazon S3 event notifications for a Data Catalog table](crawler-s3-event-notifications-setup-console-catalog-target.md)

## Setting up your account for Amazon S3 event notifications
<a name="crawler-s3-event-notifications-setup"></a>

Complete the following setup tasks. Note the values in parenthesis reference the configurable settings from the script.

1. You need to set up event notifications for your Amazon S3 bucket.

   For more information, see [Amazon S3 event notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html).

1. To use the Amazon S3 event based crawler, you should enable event notification on the Amazon S3 bucket with events filtered from the prefix which is the same as the S3 target and store in SQS. You can set up SQS and event notification through the console by following the steps in [Walkthrough: Configuring a bucket for notifications](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ways-to-add-notification-config-to-bucket.html).

1. Add the following SQS policy to the role used by the crawler. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "VisualEditor0",
         "Effect": "Allow",
         "Action": [
           "sqs:DeleteMessage",
           "sqs:GetQueueUrl",
           "sqs:ListDeadLetterSourceQueues",
           "sqs:ReceiveMessage",
           "sqs:GetQueueAttributes",
           "sqs:ListQueueTags",
           "sqs:SetQueueAttributes",
           "sqs:PurgeQueue"
         ],
         "Resource": "arn:aws:sqs:us-east-1:111122223333:cfn-sqs-queue"
       }
     ]
   }
   ```

------

# Setting up a crawler for Amazon S3 event notifications for an Amazon S3 target
<a name="crawler-s3-event-notifications-setup-console-s3-target"></a>

Follow these steps to set up a crawler for Amazon S3 event notifications for an Amazon S3 target using the AWS Management Console or AWS CLI.

------
#### [ AWS Management Console ]

1. Sign in to the AWS Management Console and open the GuardDuty console at [https://console.aws.amazon.com/guardduty/](https://console.aws.amazon.com/guardduty/).

1.  Set your crawler properties. For more information, see [ Setting Crawler Configuration Options on the AWS Glue console ](https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-console). 

1.  In the section **Data source configuration**, you are asked * Is your data already mapped to AWS Glue tables? * 

    By default **Not yet** is already selected. Leave this as the default as you are using an Amazon S3 data source and the data is not already mapped to AWS Glue tables. 

1.  In the section **Data sources**, choose **Add a data source**.   
![\[Data source configuration interface with options to select or add data sources for crawling.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-s3-event-console1.png)

1.  In the **Add data source** modal, configure the Amazon S3 data source: 
   +  **Data source**: By default, Amazon S3 is selected. 
   +  **Network connection** (Optional): Choose **Add new connection**. 
   +  **Location of Amazon S3 data**: By default, **In this account** is selected. 
   +  **Amazon S3 path**: Specify the Amazon S3 path where folders and files are crawled. 
   +  **Subsequent crawler runs**: Choose **Crawl based on events** to use Amazon S3 event notifications for your crawler. 
   +  **Include SQS ARN**: Specify the data store parameters including the a valid SQS ARN. (For example, `arn:aws:sqs:region:account:sqs`). 
   +  **Include dead-letter SQS ARN** (Optional): Specify a valid Amazon dead-letter SQS ARN. (For example, `arn:aws:sqs:region:account:deadLetterQueue`). 
   +  Choose **Add an Amazon S3 data source**.   
![\[Add data source dialog for S3, showing options for network connection and crawl settings.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-s3-event-console2.png)

------
#### [ AWS CLI ]

 The following is an example Amazon S3 AWS CLI call to configure a crawler to use event notifications to crawl an Amazon S3 target bucket. 

```
Create Crawler:
aws glue update-crawler \
    --name myCrawler \
    --recrawl-policy RecrawlBehavior=CRAWL_EVENT_MODE \
    --schema-change-policy UpdateBehavior=UPDATE_IN_DATABASE,DeleteBehavior=LOG
    --targets '{"S3Targets":[{"Path":"s3://amzn-s3-demo-bucket/", "EventQueueArn": "arn:aws:sqs:us-east-1:012345678910:MyQueue"}]}'
```

------

# Setting up a crawler for Amazon S3 event notifications for a Data Catalog table
<a name="crawler-s3-event-notifications-setup-console-catalog-target"></a>

When you have a Data Catalog table, set up a crawler for Amazon S3 event notifications using the AWS Glue console:

1.  Set your crawler properties. For more information, see [ Setting Crawler Configuration Options on the AWS Glue console ](https://docs.aws.amazon.com/glue/latest/dg/crawler-configuration.html#crawler-configure-changes-console). 

1.  In the section **Data source configuration**, you are asked * Is your data already mapped to AWS Glue tables? * 

    Select **Yes** to select existing tables from your Data Catalog as your data source. 

1.  In the section **Glue tables**, choose **Add tables**.   
![\[Data source configuration interface with options to select existing Glue tables or add new ones.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-s3-event-console1-cat.png)

1.  In the **Add table** modal, configure the database and tables: 
   +  **Network connection** (Optional): Choose **Add new connection**. 
   +  **Database**: Select a database in the Data Catalog. 
   +  **Tables**: Select one or more tables from that database in the Data Catalog. 
   +  **Subsequent crawler runs**: Choose **Crawl based on events** to use Amazon S3 event notifications for your crawler. 
   +  **Include SQS ARN**: Specify the data store parameters including the a valid SQS ARN. (For example, `arn:aws:sqs:region:account:sqs`). 
   +  **Include dead-letter SQS ARN** (Optional): Specify a valid Amazon dead-letter SQS ARN. (For example, `arn:aws:sqs:region:account:deadLetterQueue`). 
   +  Choose **Confirm**.   
![\[Add Glue tables dialog with network, database, tables, and crawler options.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawler-s3-event-console2-cat.png)

# Tutorial: Adding an AWS Glue crawler
<a name="tutorial-add-crawler"></a>

For this AWS Glue scenario, you're asked to analyze arrival data for major air carriers to calculate the popularity of departure airports month over month. You have flights data for the year 2016 in CSV format stored in Amazon S3. Before you transform and analyze your data, you catalog its metadata in the AWS Glue Data Catalog.

In this tutorial, let’s add a crawler that infers metadata from these flight logs in Amazon S3 and creates a table in your Data Catalog.

**Topics**
+ [Prerequisites](#tutorial-add-crawler-prerequisites)
+ [Step 1: Add a crawler](#tutorial-add-crawler-step1)
+ [Step 2: Run the crawler](#tutorial-add-crawler-step2)
+ [Step 3: View AWS Glue Data Catalog objects](#tutorial-add-crawler-step3)

## Prerequisites
<a name="tutorial-add-crawler-prerequisites"></a>

This tutorial assumes that you have an AWS account and access to AWS Glue.

## Step 1: Add a crawler
<a name="tutorial-add-crawler-step1"></a>

Use these steps to configure and run a crawler that extracts the metadata from a CSV file stored in Amazon S3.

**To create a crawler that reads files stored on Amazon S3**

1. On the AWS Glue service console, on the left-side menu, choose **Crawlers**.

1. On the Crawlers page, choose **Create crawler**. This starts a series of pages that prompt you for the crawler details.  
![\[The screenshot shows the crawler page. From here you can create a crawler or edit, duplicate, delete, view an existing crawler.\]](http://docs.aws.amazon.com/glue/latest/dg/images/crawlers-create_crawler.png)

1. In the Crawler name field, enter **Flights Data Crawler**, and choose **Next**.

   Crawlers invoke classifiers to infer the schema of your data. This tutorial uses the built-in classifier for CSV by default. 

1. For the crawler source type, choose **Data stores** and choose **Next**.

1. Now let's point the crawler to your data. On the **Add a data store** page, choose the Amazon S3 data store. This tutorial doesn't use a connection, so leave the **Connection** field blank if it's visible. 

   For the option **Crawl data in**, choose **Specified path in another account**. Then, for the **Include path**, enter the path where the crawler can find the flights data, which is **s3://crawler-public-us-east-1/flight/2016/csv**. After you enter the path, the title of this field changes to **Include path.** Choose **Next**.

1. You can crawl multiple data stores with a single crawler. However, in this tutorial, we're using only a single data store, so choose **No**, and then choose **Next**.

1. The crawler needs permissions to access the data store and create objects in the AWS Glue Data Catalog. To configure these permissions, choose **Create an IAM role**. The IAM role name starts with `AWSGlueServiceRole-`, and in the field, you enter the last part of the role name. Enter **CrawlerTutorial**, and then choose **Next**. 
**Note**  
To create an IAM role, your AWS user must have `CreateRole`, `CreatePolicy`, and `AttachRolePolicy` permissions.

   The wizard creates an IAM role named `AWSGlueServiceRole-CrawlerTutorial`, attaches the AWS managed policy `AWSGlueServiceRole` to this role, and adds an inline policy that allows read access to the Amazon S3 location `s3://crawler-public-us-east-1/flight/2016/csv`.

1. Create a schedule for the crawler. For **Frequency**, choose **Run on demand**, and then choose **Next**. 

1. Crawlers create tables in your Data Catalog. Tables are contained in a database in the Data Catalog. First, choose **Add database** to create a database. In the pop-up window, enter **test-flights-db** for the database name, and then choose **Create**.

   Next, enter **flights** for **Prefix added to tables**. Use the default values for the rest of the options, and choose **Next**.

1. Verify the choices you made in the **Add crawler** wizard. If you see any mistakes, you can choose **Back** to return to previous pages and make changes.

   After you have reviewed the information, choose **Finish** to create the crawler.

## Step 2: Run the crawler
<a name="tutorial-add-crawler-step2"></a>

After creating a crawler, the wizard sends you to the Crawlers view page. Because you create the crawler with an on-demand schedule, you're given the option to run the crawler.

**To run the crawler**

1. The banner near the top of this page lets you know that the crawler was created, and asks if you want to run it now. Choose **Run it now?** to run the crawler.

   The banner changes to show "Attempting to run" and Running" messages for your crawler. After the crawler starts running, the banner disappears, and the crawler display is updated to show a status of Starting for your crawler. After a minute, you can click the Refresh icon to update the status of the crawler that is displayed in the table.

1. When the crawler completes, a new banner appears that describes the changes made by the crawler. You can choose the **test-flights-db** link to view the Data Catalog objects.

## Step 3: View AWS Glue Data Catalog objects
<a name="tutorial-add-crawler-step3"></a>

The crawler reads data at the source location and creates tables in the Data Catalog. A table is the metadata definition that represents your data, including its schema. The tables in the Data Catalog do not contain data. Instead, you use these tables as a source or target in a job definition.

**To view the Data Catalog objects created by the crawler**

1. In the left-side navigation, under **Data catalog**, choose **Databases**. Here you can view the `flights-db` database that is created by the crawler.

1. In the left-side navigation, under **Data catalog** and below **Databases**, choose **Tables**. Here you can view the `flightscsv` table created by the crawler. If you choose the table name, then you can view the table settings, parameters, and properties. Scrolling down in this view, you can view the schema, which is information about the columns and data types of the table.

1. If you choose **View partitions** on the table view page, you can see the partitions created for the data. The first column is the partition key.