

# Connecting to data sources using Visual ETL jobs
Connecting to data sources

 While creating a new job, you can use connections to connect to data when editing visual ETL jobs in AWS Glue. You can do this by adding source nodes that use connectors to read in data, and target nodes to specify the location for writing out data. 

**Topics**
+ [

# Modifying properties of a data source node
](edit-jobs-source.md)
+ [

# Using Data Catalog tables for the data source
](edit-jobs-source-catalog-tables.md)
+ [

# Using a connector for the data source
](edit-jobs-source-connectors.md)
+ [

# Using files in Amazon S3 for the data source
](edit-jobs-source-s3-files.md)
+ [

# Using a streaming data source
](edit-jobs-source-streaming.md)
+ [

# References
](edit-jobs-source-references.md)

# Modifying properties of a data source node


To specify the data source properties, you first choose a data source node in the job diagram. Then, on the right side in the node details panel, you configure the node properties.

**To modify the properties of a data source node**

1. Go to the visual editor for a new or saved job.

1. Choose a data source node in the job diagram.

1. Choose the **Node properties** tab in the node details panel, and then enter the following information:
   + **Name**: (Optional) Enter a name to associate with the node in the job diagram. This name should be unique among all the nodes for this job.
   + **Node type**: The node type determines the action that is performed by the node. In the list of options for **Node type**, choose one of the values listed under the heading **Data source**.

1. Configure the **Data source properties** information. For more information, see the following sections:
   + [Using Data Catalog tables for the data source](edit-jobs-source-catalog-tables.md)
   + [Using a connector for the data source](edit-jobs-source-connectors.md)
   + [Using files in Amazon S3 for the data source](edit-jobs-source-s3-files.md)
   + [Using a streaming data source](edit-jobs-source-streaming.md)

1. (Optional) After configuring the node properties and data source properties, you can view the schema for your data source by choosing the **Output schema** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. If you have not specified an IAM role on the **Job details** tab, you are prompted to enter an IAM role here.

1. (Optional) After configuring the node properties and data source properties, you can preview the dataset from your data source by choosing the **Data preview** tab in the node details panel. The first time you choose this tab for any node in your job, you are prompted to provide an IAM role to access the data. There is a cost associated with using this feature, and billing starts as soon as you provide an IAM role.

# Using Data Catalog tables for the data source


For all data sources except Amazon S3 and connectors, a table must exist in the AWS Glue Data Catalog for the source type that you choose. AWS Glue does not create the Data Catalog table.

**To configure a data source node based on a Data Catalog table**

1. Go to the visual editor for a new or saved job.

1. Choose a data source node in the job diagram.

1. Choose the **Data source properties** tab, and then enter the following information:
   + **S3 source type**: (For Amazon S3 data sources only) Choose the option **Select a Catalog table** to use an existing AWS Glue Data Catalog table.
   + **Database**: Choose the database in the Data Catalog that contains the source table you want to use for this job. You can use the search field to search for a database by its name.
   + **Table**: Choose the table associated with the source data from the list. This table must already exist in theAWS Glue Data Catalog. You can use the search field to search for a table by its name.
   + **Partition predicate**: (For Amazon S3 data sources only) Enter a Boolean expression based on Spark SQL that includes only the partitioning columns. For example: `"(year=='2020' and month=='04')"`
   + **Temporary directory**: (For Amazon Redshift data sources only) Enter a path for the location of a working directory in Amazon S3 where your ETL job can write temporary intermediate results.
   + **Role associated with the cluster**: (For Amazon Redshift data sources only) Enter a role for your ETL job to use that contains permissions for Amazon Redshift clusters. For more information, see [Data source and data target permissions](getting-started-min-privs-job.md#getting-started-min-privs-data).

# Using a connector for the data source


If you select a connector for the **Node type**, follow the instructions at [Authoring jobs with custom connectors](job-authoring-custom-connectors.md) to finish configuring the data source properties.

# Using files in Amazon S3 for the data source


If you choose Amazon S3 as your data source, then you can choose either:
+ A Data Catalog database and table.
+ A bucket, folder, or file in Amazon S3.

If you use an Amazon S3 bucket as your data source, AWS Glue detects the schema of the data at the specified location from one of the files, or by using the file you specify as a sample file. Schema detection occurs when you use the **Infer schema** button. If you change the Amazon S3 location or the sample file, then you must choose **Infer schema** again to perform the schema detection using the new information.

**To configure a data source node that reads directly from files in Amazon S3**

1. Go to the visual editor for a new or saved job.

1. Choose a data source node in the job diagram for an Amazon S3 source.

1. Choose the **Data source properties** tab, and then enter the following information:
   + **S3 source type**: (For Amazon S3 data sources only) Choose the option **S3 location**.
   + **S3 URL**: Enter the path to the Amazon S3 bucket, folder, or file that contains the data for your job. You can choose **Browse S3** to select the path from the locations available to your account. 
   + **Recursive**: Choose this option if you want AWS Glue to read data from files in child folders at the S3 location. 

     If the child folders contain partitioned data, AWS Glue doesn't add any partition information that's specified in the folder names to the Data Catalog. For example, consider the following folders in Amazon S3:

     ```
     S3://sales/year=2019/month=Jan/day=1
     S3://sales/year=2019/month=Jan/day=2
     ```

     If you choose **Recursive** and select the `sales` folder as your S3 location, then AWS Glue reads the data in all the child folders, but doesn't create partitions for year, month or day.
   + **Data format**: Choose the format that the data is stored in. You can choose JSON, CSV, or Parquet. The value you select tells the AWS Glue job how to read the data from the source file.
**Note**  
If you don't select the correct format for your data, AWS Glue might infer the schema correctly, but the job won't be able to correctly parse the data from the source file.

     You can enter additional configuration options, depending on the format you choose. 
     + **JSON** (JavaScript Object Notation)
       + **JsonPath**: Enter a JSON path that points to an object that is used to define a table schema. JSON path expressions always refer to a JSON structure in the same way as XPath expression are used in combination with an XML document. The "root member object" in the JSON path is always referred to as `$`, even if it's an object or array. The JSON path can be written in dot notation or bracket notation.

         For more information about the JSON path, see [JsonPath](https://github.com/json-path/JsonPath) on the GitHub website.
       + **Records in source files can span multiple lines**: Choose this option if a single record can span multiple lines in the CSV file.
     + **CSV** (comma-separated values)
       + **Delimiter**: Enter a character to denote what separates each column entry in the row, for example, `;` or `,`.
       + **Escape character**: Enter a character that is used as an escape character. This character indicates that the character that immediately follows the escape character should be taken literally, and should not be interpreted as a delimiter.
       + **Quote character**: Enter the character that is used to group separate strings into a single value. For example, you would choose **Double quote (")** if you have values such as `"This is a single value"` in your CSV file.
       + **Records in source files can span multiple lines**: Choose this option if a single record can span multiple lines in the CSV file.
       + **First line of source file contains column headers**: Choose this option if the first row in the CSV file contains column headers instead of data.
     + **Parquet** (Apache Parquet columnar storage)

       There are no additional settings to configure for data stored in Parquet format.
     + **Apache Hudi**

       There are no additional settings to configure for data stored in Apache Hudi format.
     + **Delta Lake**

       There are no additional settings to configure for data stored in Delta Lake format.
     + **Excel**

       There are no additional settings to configure for data stored in Excel format.
   + **Partition predicate**: To partition the data that is read from the data source, enter a Boolean expression based on Spark SQL that includes only the partitioning columns. For example: `"(year=='2020' and month=='04')"`
   + **Advanced options**: Expand this section if you want AWS Glue to detect the schema of your data based on a specific file. 
     + **Schema inference**: Choose the option **Choose a sample file from S3** if you want to use a specific file instead of letting AWS Glue choose a file. Schema inference is not available for the Excel source.
     + **Auto-sampled file**: Enter the path to the file in Amazon S3 to use for inferring the schema.

     If you're editing a data source node and change the selected sample file, choose **Reload schema** to detect the schema by using the new sample file.

1. Choose the **Infer schema** button to detect the schema from the sources files in Amazon S3. If you change the Amazon S3 location or the sample file, you must choose **Infer schema** again to infer the schema using the new information.

# Using a streaming data source


You can create streaming extract, transform, and load (ETL) jobs that run continuously and consume data from streaming sources in Amazon Kinesis Data Streams, Apache Kafka, and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

**To configure properties for a streaming data source**

1. Go to the visual graph editor for a new or saved job.

1. Choose a data source node in the graph for Kafka or Kinesis Data Streams.

1. Choose the **Data source properties** tab, and then enter the following information:

------
#### [ Kinesis ]
   + **Kinesis source type**: Choose the option **Stream details** to use direct access to the streaming source or choose **Data Catalog table** to use the information stored there instead.

     If you choose **Stream details**, specify the following additional information.
     + **Location of data stream**: Choose whether the stream is associated with the current user, or if it is associated with a different user.
     + **Region**: Choose the AWS Region where the stream exists. This information is used to construct the ARN for accessing the data stream.
     + **Stream ARN**: Enter the Amazon Resource Name (ARN) for the Kinesis data stream. If the stream is located within the current account, you can choose the stream name from the drop-down list. You can use the search field to search for a data stream by its name or ARN.
     + **Data format**: Choose the format used by the data stream from the list. 

       AWS Glue automatically detects the schema from the streaming data.

     If you choose **Data Catalog table**, specify the following additional information.
     + **Database**: (Optional) Choose the database in the AWS Glue Data Catalog that contains the table associated with your streaming data source. You can use the search field to search for a database by its name. 
     + **Table**: (Optional) Choose the table associated with the source data from the list. This table must already exist in the AWS Glue Data Catalog. You can use the search field to search for a table by its name. 
     + **Detect schema**: Choose this option to have AWS Glue detect the schema from the streaming data, rather than using the schema information in a Data Catalog table. This option is enabled automatically if you choose the **Stream details** option.
   + **Starting position**: By default, the ETL job uses the **Earliest** option, which means it reads data starting with the oldest available record in the stream. You can instead choose **Latest**, which indicates the ETL job should start reading from just after the most recent record in the stream.
   + **Window size**: By default, your ETL job processes and writes out data in 100-second windows. This allows data to be processed efficiently and permits aggregations to be performed on data that arrives later than expected. You can modify this window size to increase timeliness or aggregation accuracy. 

     AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that has been read. 
   + **Connection options**: Expand this section to add key-value pairs to specify additional connection options. For information about what options you can specify here, see ["connectionType": "kinesis"](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-kinesis) in the *AWS Glue Developer Guide*.

------
#### [ Kafka ]
   + **Apache Kafka source**: Choose the option **Stream details** to use direct access to the streaming source or choose **Data Catalog table** to use the information stored there instead.

     If you choose **Data Catalog table**, specify the following additional information.
     + **Database**: (Optional) Choose the database in the AWS Glue Data Catalog that contains the table associated with your streaming data source. You can use the search field to search for a database by its name. 
     + **Table**: (Optional) Choose the table associated with the source data from the list. This table must already exist in the AWS Glue Data Catalog. You can use the search field to search for a table by its name. 
     + **Detect schema**: Choose this option to have AWS Glue detect the schema from the streaming data, rather than storing the schema information in a Data Catalog table. This option is enabled automatically if you choose the **Stream details** option.

     If you choose **Stream details**, specify the following additional information.
     + **Connection name**: Choose the AWS Glue connection that contains the access and authentication information for the Kafka data stream. You must use a connection with Kafka streaming data sources. If a connection doesn't exist, you can use the AWS Glue console to create a connection for your Kafka data stream.
     + **Topic name**: Enter the name of the topic to read from.
     + **Data format**: Choose the format to use when reading data from the Kafka event stream. 
   + **Starting position**: By default, the ETL job uses the **Earliest** option, which means it reads data starting with the oldest available record in the stream. You can instead choose **Latest**, which indicates the ETL job should start reading from just after the most recent record in the stream.
   + **Window size**: By default, your ETL job processes and writes out data in 100-second windows. This allows data to be processed efficiently and permits aggregations to be performed on data that arrives later than expected. You can modify this window size to increase timeliness or aggregation accuracy. 

     AWS Glue streaming jobs use checkpoints rather than job bookmarks to track the data that has been read. 
   + **Connection options**: Expand this section to add key-value pairs to specify additional connection options. For information about what options you can specify here, see ["connectionType": "kafka"](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-kafka) in the *AWS Glue Developer Guide*.

------

**Note**  
Data previews are not currently supported for streaming data sources.

# References


 **Best Practices** 
+  [ Build an ETL service pipeline to load data incrementally from Amazon S3 to Amazon Redshift using AWS Glue](https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.html) 

 **ETL programming** 
+  [Connection types and options for ETL in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-connections.html) 
+  [ JDBC connectionType values ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc) 
+  [ Advanced options for moving data to and from Amazon Redshift](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-redshift.html) 