# Amazon Q data integration in AWS Glue
<a name="q"></a>

Amazon Q data integration in AWS Glue is a new generative AI capability of AWS Glue that enables data engineers and ETL developers to build data integration jobs using natural language. Engineers and developers can ask Amazon Q to author jobs, troubleshoot issues, and answer questions about AWS Glue and data integration.

## What is Amazon Q?
<a name="q-what-is-amazon-q"></a>

**Note**  
Powered by Amazon Bedrock: AWS implements [automated abuse detection](https://docs.aws.amazon.com/bedrock/latest/userguide/abuse-detection.html). Because Amazon Q data integration is built on Amazon Bedrock, users can take full advantage of the controls implemented in Amazon Bedrock to enforce safety, security, and the responsible use of artificial intelligence (AI).

Amazon Q is a generative artificial intelligence (AI) powered conversational assistant that can help you understand, build, extend, and operate AWS applications. The model that powers Amazon Q has been augmented with high quality AWS content to get you more complete, actionable, and referenced answers to accelerate your building on AWS. For more information, see [What is Amazon Q?](https://docs.aws.amazon.com/amazonq/latest/aws-builder-use-ug/what-is.html)

## What is Amazon Q data integration in AWS Glue?
<a name="q-key-features"></a>

Amazon Q data integration in AWS Glue includes the following capabilities:
+ **Chat** – Amazon Q data integration in AWS Glue can answer natural language questions in English about AWS Glue and data integration domains like AWS Glue source and destination connectors, AWS Glue ETL jobs, Data Catalog, crawlers and AWS Lake Formation, and other feature documentation, and best practices. Amazon Q data integration in AWS Glue responds with step-by-step instructions, and includes references to its information sources.
+ **Data integration code generation** – Amazon Q data integration in AWS Glue can answer questions about AWS Glue ETL scripts, and generate new code given a natural language question in English.
+ **Troubleshoot** – Amazon Q data integration in AWS Glue is purpose built to help you understand errors in AWS Glue jobs and provides step-by-step instructions, to root cause and resolve your issues.

**Note**  
Amazon Q data integration in AWS Glue does not use the context of your conversation to inform future responses for the duration of your conversation. Each conversation with Amazon Q data integration in AWS Glue is independent of your prior or future conversations.

## Working with Amazon Q data integration in AWS Glue?
<a name="q-working-with"></a>

In the Amazon Q panel you can request Amazon Q generate code for an AWS Glue ETL script, or answer a question on AWS Glue features or troubleshooting an error. The response is an ETL script in PySpark with step-by-step instructions to customize the script, review and execute it. For questions, the response is generated based on the data integration knowledge base with a summary and source URL for references.

For example, you can ask Amazon Q to "*Please provide a Glue script that reads from Snowflake, renames the fields, and writes to Redshift*" and in response, Amazon Q data integration in AWS Glue will return an AWS Glue job script that can perform the requested action. You can review the generated code to ensure that it fulfills the requested intent. If satisfied, you can deploy it as an AWS Glue job in production. You can troubleshoot jobs by asking the integration to explain errors and failures, and to propose solutions. Amazon Q can answer questions about AWS Glue or data integration best practices.

![\[An example of using Amazon Q data integration in AWS Glue.\]](http://docs.aws.amazon.com/glue/latest/dg/images/q-chat-experience-1.gif)


The following are example questions that demonstrate how Amazon Q data integration in AWS Glue can help you build on AWS Glue:

AWS Glue ETL code generation:
+ Write an AWS Glue script that reads JSON from S3, transforms fields using apply mapping and writes to Amazon Redshift
+ How do I write an AWS Glue script for reading from DynamoDB, applying the DropNullFields transform and writing to S3 as Parquet?
+ Give me an AWS Glue script that reads from MySQL, drops some fields based on my business logic, and writes to Snowflake
+ Write an AWS Glue job to read from DynamoDB and write to S3 as JSON
+ Help me develop an AWS Glue script for AWS Glue Data Catalog to S3
+ Write an AWS Glue job to read JSON from S3, drop nulls and write to Redshift

AWS Glue feature explanations:
+ How do I use AWS Glue Data Quality?
+ How to use AWS Glue job bookmarks?
+ How do I enable AWS Glue autoscaling?
+ What is the difference between AWS Glue dynamic frames and Spark data frames?
+ What are the different types of connections supported by AWS Glue?

AWS Glue troubleshooting:
+ How to troubleshoot Out Of Memory (OOM) errors on AWS Glue jobs?
+ What are some error messages you may see when setting up AWS Glue Data Quality and how can you fix them?
+ How do I fix an AWS Glue job with the error Amazon S3 access denied?
+ How do I resolve issues with data shuffle on AWS Glue jobs?

## Best practices for interacting with Amazon Q data integration
<a name="q-best-practices"></a>

The following are best practices for interacting with Amazon Q data integration:
+ When interacting with Amazon Q data integration, ask specific questions, iterate when you have complex requests, and verify the answers for accuracy.
+ When providing data integration prompts in natural language, be as specific as possible to help the assistant understand exactly what you need. Instead of asking "extract data from S3," provide more details like “write an AWS Glue script that extracts JSON files from S3.”
+ Review the generated script before running it to ensure accuracy. If the generated script has errors or does not match your intent, provide instructions to the assistant on how to correct it.
+ Generative AI technology is new and there can be mistakes, sometimes called hallucinations, in the responses. Test and review all code for errors and vulnerabilities before using it in your environment or workload.

## Amazon Q data integration in AWS Glue service improvement
<a name="q-service-improvement"></a>

To help Amazon Q data integration in AWS Glue provide the most relevant information about AWS services, we may use certain content from Amazon Q, such as questions that you ask Amazon Q and its responses, for service improvement.

For information about what content we use and how to opt out, see [Amazon Q Developer service improvement](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/service-improvement.html) in the *Amazon Q Developer User Guide*.

## Considerations
<a name="q-considerations"></a>

Consider the following items before you use Amazon Q data integration in AWS Glue:
+ Currently, the code generation only works with PySpark kernel. The generated code is for AWS Glue jobs based on Python Spark.
+ For information about the supported combinations of code generation abilities of Amazon Q data integration in AWS Glue, see [Supported code generation abilities](q-supported-actions.md).

# Setting up Amazon Q data integration in AWS Glue
<a name="q-setting-up"></a>

The following sections provide information setting up Amazon Q data integration in AWS Glue.

**Topics**
+ [Configuring IAM permissions](q-setting-up-permissions.md)

# Configuring IAM permissions
<a name="q-setting-up-permissions"></a>

This topic describes the IAM permissions that you configure for the Amazon Q chat experience, and the AWS Glue Studio notebook experience.

**Topics**
+ [Configuring IAM permissions for Amazon Q chat](#q-setting-up-permissions-amazon-q-chat)
+ [Configuring IAM permissions for AWS Glue Studio notebooks](#q-setting-up-permissions-notebooks)

## Configuring IAM permissions for Amazon Q chat
<a name="q-setting-up-permissions-amazon-q-chat"></a>

Granting permissions to the APIs used by Amazon Q data integration in AWS Glue requires appropriate AWS Identity and Access Management (IAM) permissions. You can obtain permissions by attaching the following custom AWS policy to your IAM identity (such as a user, role, or group):

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartCompletion",
        "glue:GetCompletion"
      ],
      "Resource": [
        "arn:aws:glue:*:*:completion/*"
      ]
    }
  ]
}
```

------

## Configuring IAM permissions for AWS Glue Studio notebooks
<a name="q-setting-up-permissions-notebooks"></a>

To enable Amazon Q data integration in AWS Glue Studio notebooks, ensure the following permission is attached to the notebook IAM role:

**Note**  
The `codewhisperer` prefix is a legacy name from a service that merged with Amazon Q Developer. For more information, see [ Amazon Q Developer rename - Summary of changes](https://docs.aws.amazon.com/amazonq/latest/qdeveloper-ug/service-rename.html).

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "glue:StartCompletion",
        "glue:GetCompletion"
      ],
      "Resource": [
        "arn:aws:glue:*:*:completion/*"
      ]
    },
    {
      "Sid": "AmazonQDeveloperPermissions",
      "Effect": "Allow",
      "Action": [
        "codewhisperer:GenerateRecommendations"
      ],
      "Resource": "*"
    }
  ]
}
```

------

**Note**  
Amazon Q data integration in AWS Glue does not have APIs available through the AWS SDK that you can use programmatically. The following two APIs are used in the IAM policy for enabling this experience through the Amazon Q chat panel or AWS Glue Studio notebooks: `StartCompletion` and `GetCompletion`.

### Assigning permissions
<a name="q-assigning-permissions"></a>

To provide access, add permissions to your users, groups, or roles:
+ Users and groups in AWS IAM Identity Center: Create a permission set. Follow the instructions in [Create a permission set](https://docs.aws.amazon.com/singlesignon/latest/userguide/howtocreatepermissionset.html) in the *AWS IAM Identity Center User Guide*.
+ Users managed in IAM through an identity provider: Create a role for identity federation. Follow the instructions in [Creating a role for a third-party identity provider (federation)](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-idp.html) in the *IAM User Guide*.
+ IAM users:
  + Create a role that your user can assume. Follow the instructions in [Creating a role for an IAM user](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-user.html) in the *IAM User Guide*.
  + (Not recommended) Attach a policy directly to a user or add a user to a user group. Follow the instructions in [Adding permissions to a user (console)](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_users_change-permissions.html#users_change_permissions-add-console) in the *IAM User Guide*.

# Supported code generation abilities
<a name="q-supported-actions"></a>

 The following are the combinations of the code generation abilities of Amazon Q data integration.


| Sources and Targets | Transformation | 
| --- | --- | 
| S3 with the following format types: json, csv, parquet, hudi, delta | Drop | 
| AWS Glue Data Catalog | Aggregate | 
| Redlake | DropDuplicates | 
| Amazon DynamoDB | Join | 
| MySQL | Filter | 
| Oracle | RenameColumns | 
| PostgresSQL | FillNull | 
| Microsoft SQL Server | DropNull | 
| Amazon DocumentDB / MongoDB | WithColumns | 
| Snowflake | SQL Query | 
| Google BigQuery | Union | 
| Teradata | Select | 
| Amazon OpenSearch Service |  | 
| Vertica |  | 
| SAP HANA |  | 
| Amazon Redshift |  | 

# Example interactions
<a name="q-using-example-interactions"></a>

Amazon Q data integration in AWS Glue allows you enter your question in the Amazon Q panel. You can enter a question regarding data integration functionality provided by AWS Glue. A detailed answer, together with reference documents, will be returned.

Another use case is generating AWS Glue ETL job scripts. You can ask a question regarding how to perform a data extract, transform, load job. A generated PySpark script will be returned.

**Topics**
+ [Amazon Q chat interactions](#q-using-example-interactions)
+ [AWS Glue Studio notebook interactions](#q-using-example-interactions-notebooks)

## Amazon Q chat interactions
<a name="q-using-example-interactions"></a>

On the AWS Glue console, start authoring a new job, and ask Amazon Q: *"Create a Glue ETL flow connect to two Glue catalog tables venue and event in my database glue\$1db, join the results on the venue's venueid and event's e\$1venueid, and then filter on venue state with condition as venuestate=='DC' and write to s3://amzn-s3-demo-bucket/codegen/BDB-9999/output/ in CSV format.""*

![\[An example of asking Amazon Q data integration in AWS Glue for a generated ETL script.\]](http://docs.aws.amazon.com/glue/latest/dg/images/Q-SIDEPANEL-GS.gif)


 You will notice that the code is generated. With this response, you can learn and understand how you can author AWS Glue code for your purpose. You can copy/paste the generated code to the script editor and configure placeholders. After you configure an IAM role and AWS Glue connections on the job, save and run the job. When the job is complete, you can verify the summary data is persisted to Amazon S3 as expected and can be used by your downstream workloads. 

## AWS Glue Studio notebook interactions
<a name="q-using-example-interactions-notebooks"></a>

**Note**  
 The Amazon Q Data integration experience in AWS Glue Studio notebook still focuses on DynamicFrame-based data integration flow. 

Add a new cell and enter your comment to describe what you want to achieve. After you press **Tab** and **Enter**, the recommended code is shown.

First intent is to extract the data: *"Give me code that reads a Glue Data Catalog table"*, followed by *"Give me code to apply a filter transform with star\$1rating>3"* and *"Give me code that writes the frame into S3 as Parquet"*.

![\[An example of using an AWS Glue Studio notebook to ask Amazon Q data integration in AWS Glue for a generated ETL script.\]](http://docs.aws.amazon.com/glue/latest/dg/images/q-notebook-experience-1.gif)


![\[An example of using an AWS Glue Studio notebook to ask Amazon Q data integration in AWS Glue for a generated ETL script.\]](http://docs.aws.amazon.com/glue/latest/dg/images/q-notebook-experience-2.gif)


![\[An example of using an AWS Glue Studio notebook to ask Amazon Q data integration in AWS Glue for a generated ETL script.\]](http://docs.aws.amazon.com/glue/latest/dg/images/q-notebook-experience-3.gif)


Similar to the Amazon Q chat experience, the code is recommended. If you press **Tab**, then the recommended code is chosen.

You can run each cell by filling in the appropriate options for your sources in the generated code. At any point in the runs, you can also preview a sample of your dataset by using the `show()` method.

 You can run the notebook as a job, either programmatically or by choosing **Run**. 

### Complex prompts
<a name="q-using-example-interactions-notebooks-complex-prompt"></a>

You can generate a full script with a single complex prompt. *"I have JSON data in S3 and data in Oracle that needs combining. Please provide a Glue script that reads from both sources, does a join, and then writes results to Redshift."*

![\[An example of using an AWS Glue Studio notebook to ask Amazon Q data integration in AWS Glue for a generated ETL script.\]](http://docs.aws.amazon.com/glue/latest/dg/images/q-notebook-experience-4.gif)


You may notice that, on the notebook, Amazon Q data integration in AWS Glue generated the same code snippet that was generated in the Amazon Q chat.

You can run the notebook as a job, either by choosing **Run** or programmatically.

# Using context awareness with Amazon Q Data Integration in AWS Glue
<a name="q-context-awareness"></a>

 You can now create data processing jobs more efficiently with query based context-aware and PySpark DataFrame code generation in Amazon Q Data Integration. For example, you can use this prompt to generate PySpark code: “create a job to load sales data from Redshift table ‘analytics.salesorder’ using connection ‘erp\$1conn’, filter order\$1amount below 50 dollars, and save to Amazon S3 in parquet format.” 

 Amazon Q will generate the script based on your prompt and setup data integration workflow setup with the details provided from your question, such as connection configurations, schema details, database/table names, and column specifications for transformations. Sensitive information, such as connection option passwords, continue to be redacted. 

 If required information is not provided from your prompt, Amazon Q will put placeholders, which you have to update the generated code with the appropriate values before running the code. 

 The following are examples on how to use context awareness. 

## Example: interactions
<a name="q-context-awareness-examples"></a>

### Reading an Amazon S3 file and writing to an Amazon S3 bucket
<a name="q-context-awareness-examples-generated-script-1"></a>

 Prompt: `Create an AWS Glue spark job that reads a file s3://amzn-s3-demo-bucket-input-bucket-name/input-filename.csv and saves it into the parquet file under directory s3://amzn-s3-demo-bucket-output-bucket-name/output-s3-prefix/` 

```
import sys
from awsglue.transforms import *
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()

# Script generated for node S3DataSource
S3DataSource_dsource1 = spark.read.format("csv").load(
    "s3://amzn-s3-demo-bucket-input-bucket-name/input-filename.csv"
)

# Script generated for node S3DataSink
S3DataSource_dsource1.write.format("parquet").mode("append").save(
    "s3://amzn-s3-demo-bucket-output-bucket-name/output-s3-prefix/"
)
```

![\[An example of asking Amazon Q data integration in AWS Glue for a generated ETL script.\]](http://docs.aws.amazon.com/glue/latest/dg/images/context-awareness-lakehouse-example.gif)


### Get data from Lakehouse and write to database
<a name="q-context-awareness-examples-generated-script-2"></a>

 Prompt: `write an ETL script to read from a Lakehouse table my-table in database my-database and write it to a RDS MySQL table my-target-table` 

 For fields where you did not provide information (for example, the connectionName required is for MySQL data sink and default with a placehoder <connection-name> in the generated code), a placeholder is kept for you to fill in the required information before run the script. 

 Generated script: 

```
import sys
from awsglue.transforms import *
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from connectivity.adapter import CatalogConnectionHelper

sc = SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()

# Script generated for node S3DataSource
S3DataSource_dsource1 = spark.read.format("parquet").load(
    "s3://amzn-lakehouse-demo-bucket/my-database/my-table"
)

# Script generated for node ConnectionV2DataSink
ConnectionV2DataSink_dsink1_additional_options = {"dbtable": "my-target-table"}
CatalogConnectionHelper(spark).write(
    S3DataSource_dsource1,
    "mysql",
    "<connection-name>",
    ConnectionV2DataSink_dsink1_additional_options,
)
```

![\[An example of asking Amazon Q data integration in AWS Glue for a generated ETL script.\]](http://docs.aws.amazon.com/glue/latest/dg/images/context-awareness-example-interactions.gif)


### Example: Full ETL workflow
<a name="q-context-awareness-complex-example"></a>

 The following example demonstrates how you can ask AWS Glue to create a AWS Glue script to complete a full ETL workflow with the following prompt: `Create a AWS Glue ETL Script read from two AWS Glue Data Catalog tables venue and event in my database glue_db_4fthqih3vvk1if, join the results on the field venueid, filter on venue state with condition as venuestate=='DC' after joining the results and write output to an Amazon S3 S3 location s3://amz-s3-demo-bucket/output/ in CSV format`. 

 The workflow contains reading from different data sources (two AWS Glue Data Catalog tables), and a couple of transforms after the reading by join the result from two readings, filter based on some condition and write the transformed output to an Amazon S3 destination in CSV format. 

 The generated job will fill in the detailed information to the data source, transform and sink operation with corresponding information extracted from user question as below. 

```
import sys
from awsglue.transforms import *
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession.builder.getOrCreate()

# Script generated for node CatalogDataSource
CatalogDataSource_dsource1 = spark.sql("select * from `glue_db_4fthqih3vvk1if`.`venue`")

# Script generated for node CatalogDataSource
CatalogDataSource_dsource2 = spark.sql("select * from `glue_db_4fthqih3vvk1if`.`event`")

# Script generated for node JoinTransform
JoinTransform_transform1 = CatalogDataSource_dsource1.join(
    CatalogDataSource_dsource2,
    (CatalogDataSource_dsource1["venueid"] == CatalogDataSource_dsource2["venueid"]),
    "inner",
)

# Script generated for node FilterTransform
FilterTransform_transform2 = JoinTransform_transform1.filter("venuestate=='DC'")

# Script generated for node S3DataSink
FilterTransform_transform2.write.format("csv").mode("append").save(
    "s3://amz-s3-demo-bucket/output//output/"
)
```

![\[An example of asking Amazon Q data integration in AWS Glue for a generated ETL script.\]](http://docs.aws.amazon.com/glue/latest/dg/images/context-awareness-complex-example.gif)


## Limitations
<a name="q-context-awareness-limitations"></a>
+  Context carryover: 
  +  The context-awareness feature only carries over context from the previous user query within the same conversation. It does not retain context beyond the immediate preceding query. 
+  Support for node configurations: 
  +  Currently, context-awareness supports only a subset of required configurations for various nodes. 
  +  Support for optional fields is planned in upcoming releases. 
+  Availability: 
  +  Context-awareness and DataFrame support are available in Q Chat and SageMaker Unified Studio notebooks. However, these features are not yet available in AWS Glue Studio notebooks.