

# Importing data using workflows in Lake Formation
<a name="workflows"></a>

With AWS Lake Formation, you can import your data using *workflows*. A workflow defines the data source and schedule to import data into your data lake. It is a container for AWS Glue crawlers, jobs, and triggers that are used to orchestrate the processes to load and update the data lake. 

**Topics**
+ [Blueprints and workflows in Lake Formation](workflows-about.md)
+ [Creating a workflow](workflows-creating.md)
+ [Running a workflow](workflows-running.md)

# Blueprints and workflows in Lake Formation
<a name="workflows-about"></a>

A workflow encapsulates a complex multi-job extract, transform, and load (ETL) activity. Workflows generate AWS Glue crawlers, jobs, and triggers to orchestrate the loading and update of data. Lake Formation executes and tracks a workflow as a single entity. You can configure a workflow to run on demand or on a schedule.

**Note**  
Spark parquet writer doesn't support special characters in column names. This is a technical limitation of the writer itself, not a configuration issue.

Workflows that you create in Lake Formation are visible in the AWS Glue console as a directed acyclic graph (DAG). Each DAG node is a job, crawler, or trigger. To monitor progress and troubleshoot, you can track the status of each node in the workflow.

When a Lake Formation workflow has completed, the user who ran the workflow is granted the Lake Formation `SELECT` permission on the Data Catalog tables that the workflow creates. 

You can also create workflows in AWS Glue. However, because Lake Formation enables you to create a workflow from a blueprint, creating workflows is much simpler and more automated in Lake Formation. Lake Formation provides the following types of blueprints:
+ **Database snapshot** – Loads or reloads data from all tables into the data lake from a JDBC source. You can exclude some data from the source based on an exclude pattern.
+ **Incremental database** – Loads only new data into the data lake from a JDBC source, based on previously set bookmarks. You specify the individual tables in the JDBC source database to include. For each table, you choose the bookmark columns and bookmark sort order to keep track of data that has previously been loaded. The first time that you run an incremental database blueprint against a set of tables, the workflow loads all data from the tables and sets bookmarks for the next incremental database blueprint run. You can therefore use an incremental database blueprint instead of the database snapshot blueprint to load all data, provided that you specify each table in the data source as a parameter.
+ **Log file** – bulk loads data from log file sources, including AWS CloudTrail, Elastic Load Balancing logs, and Application Load Balancer logs.

Use the following table to help decide whether to use a database snapshot or incremental database blueprint.


| Use database snapshot when... | Use incremental database when... | 
| --- | --- | 
|  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/lake-formation/latest/dg/workflows-about.html)  |  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/lake-formation/latest/dg/workflows-about.html)  | 

**Note**  
Users cannot edit blue prints and workflows created by Lake Formation. 

# Creating a workflow
<a name="workflows-creating"></a>

Before you start, ensure that you have granted the required data permissions and data location permissions to the role `LakeFormationWorkflowRole`. This is so the workflow can create metadata tables in the Data Catalog and write data to target locations in Amazon S3. For more information, see [(Optional) Create an IAM role for workflows](initial-lf-config.md#iam-create-blueprint-role) and [Overview of Lake Formation permissions](lf-permissions-overview.md).

**Note**  
Lake Formation uses `GetTemplateInstance`, `GetTemplateInstances`, and `InstantiateTemplate` operations to create workflows from blueprints. These operations are not publicly available, and are used only internally for creating resources on your behalf. You receive CloudTrail events for creating workflows.

**To create a workflow from a blueprint**

1. Open the AWS Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/). Sign in as the data lake administrator or as a user who has data engineer permissions. For more information, see [Lake Formation personas and IAM permissions reference](permissions-reference.md).

1. In the navigation pane, choose **Blueprints**, and then choose **Use blueprint**.

1. On the **Use a blueprint** page, choose a tile to select the blueprint type.

1. Under **Import source**, specify the data source. 

   If you are importing from a JDBC source, specify the following:
   + ****Database connection****–Choose a connection from the list. Create additional connections using the AWS Glue console. The JDBC user name and password in the connection determine the database objects that the workflow has access to.
   + ****Source data path****–Enter *<database>*/*<schema>*/*<table>* or *<database>*/*<table>*, depending on the database product. Oracle Database and MySQL don’t support schema in the path. You can substitute the percent (%) character for *<schema>* or *<table>*. For example, for an Oracle database with a system identifier (SID) of `orcl`, enter `orcl/%` to import all tables that the user named in the connection has access to.
**Important**  
This field is case sensitive. The workflow will fail if there is a case mismatch for any of the components.

     If you specify a MySQL database, AWS Glue ETL uses the Mysql5 JDBC driver by default, so MySQL8 is not natively supported. You can edit the ETL job script to use a `customJdbcDriverS3Path` parameter as described in [JDBC connectionType Values](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc) in the *AWS Glue Developer Guide* to use a different JDBC driver that supports MySQL8.

   If you are importing from a log file, ensure that the role that you specify for the workflow (the "workflow role") has the required IAM permissions to access the data source. For example, to import AWS CloudTrail logs, the user must have the `cloudtrail:DescribeTrails` and `cloudtrail:LookupEvents` permissions to see the list of CloudTrail logs while creating the workflow, and the workflow role must have permissions on the CloudTrail location in Amazon S3.

1. Do one of the following:
   + For the **Database snapshot** blueprint type, optionally identify a subset of data to import by specifying one or more exclude patterns. These exclude patterns are Unix-style `glob` patterns. They are stored as a property of the tables that are created by the workflow.

     For details on the available exclude patterns, see [Include and Exclude Patterns](https://docs.aws.amazon.com/glue/latest/dg/define-crawler.html#crawler-data-stores-exclude) in the *AWS Glue Developer Guide*.
   + For the **Incremental database** blueprint type, specify the following fields. Add a row for each table to import.  
**Table name**  
Table to import. Must be all lower case.  
**Bookmark keys**  
Comma-delimited list of column names that define the bookmark keys. If blank, the primary key is used to determine new data. Case for each column must match the case as defined in the data source.  
The primary key qualifies as the default bookmark key only if it is sequentially increasing or decreasing (with no gaps). If you want to use the primary key as the bookmark key and it has gaps, you must name the primary key column as a bookmark key.  
**Bookmark order**  
When you choose **Ascending**, rows with values greater than bookmarked values are identified as new rows. When you choose **Descending**, rows with values less than bookmarked values are identified as new rows.  
**Partitioning scheme**  
(Optional) List of partitioning key columns, delimited by slashes (/). Example:` year/month/day`.  
![\[The Incremental data section of the console includes these fields: Table name, Bookmark keys, Bookmark order, Partitioning scheme. You can add or remove rows, where each row is for a different table.\]](http://docs.aws.amazon.com/lake-formation/latest/dg/images/incremental-data.png)

     For more information, see [Tracking Processed Data Using Job Bookmarks](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) in the *AWS Glue Developer Guide*.

1. Under **Import target**, specify the target database, target Amazon S3 location, and data format.

   Ensure that the workflow role has the required Lake Formation permissions on the database and Amazon S3 target location.
**Note**  
Currently, blueprints do not support encrypting data at the target.

1. Choose an import frequency.

   You can specify a `cron` expression with the **Custom** option.

1. Under **Import options**:

   1. Enter a workflow name.

   1. For role, choose the role `LakeFormationWorkflowRole`, which you created in [(Optional) Create an IAM role for workflows](initial-lf-config.md#iam-create-blueprint-role). 

   1. Optionally specify a table prefix. The prefix is prepended to the names of Data Catalog tables that the workflow creates.

1. Choose **Create**, and wait for the console to report that the workflow was successfully created.
**Tip**  
Did you get the following error message?  
`User: arn:aws:iam::<account-id>:user/<username> is not authorized to perform: iam:PassRole on resource:arn:aws:iam::<account-id>:role/<rolename>...`  
If so, check that you replaced *<account-id>* with a valid AWS account number in all policies.

**See also:**  
[Blueprints and workflows in Lake Formation](workflows-about.md)

# Running a workflow
<a name="workflows-running"></a>

You can run a workflow using the Lake Formation console, the AWS Glue console, or the AWS Glue Command Line Interface (AWS CLI), or API.

**To run a workflow (Lake Formation console)**

1. Open the AWS Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/). Sign in as the data lake administrator or as a user who has data engineer permissions. For more information, see [Lake Formation personas and IAM permissions reference](permissions-reference.md).

1. In the navigation pane, choose **Blueprints**.

1. On the **Blueprints** page, select the workflow. Then on the **Actions** menu, choose **Start**.

1. As the workflow runs, view its progress in the **Last run status** column. Choose the refresh button occasionally.

   The status goes from **RUNNING**, to **Discovering**, to **Importing**, to **COMPLETED**. 

   When the workflow is complete:
   + The Data Catalog has new metadata tables.
   + Your data is ingested into the data lake.

   If the workflow fails, do the following:

   1. Select the workflow. Choose **Actions**, and then choose **View graph**.

      The workflow opens in the AWS Glue console.

   1. Ensure that the workflow is selected, and choose the **History** tab.

   1. Under **History**, select the most recent run and choose **View run details**.

   1. Select a failed job or crawler in the dynamic (runtime) graph, and review the error message. Failed nodes are either red or yellow.

**See also:**  
[Blueprints and workflows in Lake Formation](workflows-about.md)