# Developing and testing AWS Glue job scripts locally
<a name="aws-glue-programming-etl-libraries"></a>

When you develop and test your AWS Glue for Spark job scripts, there are multiple available options:
+ AWS Glue Studio console
  + Visual editor
  + Script editor
  + AWS Glue Studio notebook
+ Interactive sessions
  + Jupyter notebook
+ Docker image
  + Local development
  + Remote development

You can choose any of the above options based on your requirements.

If you prefer no code or less code experience, the AWS Glue Studio visual editor is a good choice.

If you prefer an interactive notebook experience, AWS Glue Studio notebook is a good choice. For more information, see [Using Notebooks with AWS Glue Studio and AWS Glue](https://docs.aws.amazon.com/glue/latest/ug/notebooks-chapter.html). If you want to use your own local environment, interactive sessions is a good choice. For more information, see [Using interactive sessions with AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-chapter.html).

If you prefer local/remote development experience, the Docker image is a good choice. This helps you to develop and test AWS Glue for Spark job scripts anywhere you prefer without incurring AWS Glue cost.

If you prefer local development without Docker, installing the AWS Glue ETL library directory locally is a good choice.

## Developing using AWS Glue Studio
<a name="develop-using-studio"></a>

The AWS Glue Studio visual editor is a graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on AWS Glue's Apache Spark-based serverless ETL engine. You can inspect the schema and data results in each step of the job. For more information, see the [AWS Glue Studio User Guide](https://docs.aws.amazon.com/glue/latest/ug/what-is-glue-studio.html).

## Developing using interactive sessions
<a name="develop-using-interactive-sessions"></a>

Interactive sessions allow you to build and test applications from the environment of your choice. For more information, see [Using interactive sessions with AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-chapter.html).

# Develop and test AWS Glue jobs locally using a Docker image
<a name="develop-local-docker-image"></a>

 For a production-ready data platform, the development process and CI/CD pipeline for AWS Glue jobs is a key topic. You can flexibly develop and test AWS Glue jobs in a Docker container. AWS Glue hosts Docker images on Docker Hub to set up your development environment with additional utilities. You can use your preferred IDE, notebook, or REPL using AWS Glue ETL library. This topic describes how to develop and test AWS Glue version 5.0 jobs in a Docker container using a Docker image.

## Available Docker images
<a name="develop-local-available-docker-images-ecr"></a>

 The following Docker images are available for AWS Glue on [Amazon ECR:](https://gallery.ecr.aws/glue/aws-glue-libs). 
+  For AWS Glue version 5.0: `public.ecr.aws/glue/aws-glue-libs:5` 
+ For AWS Glue version 4.0: `public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01`
+ For AWS Glue version 3.0: `public.ecr.aws/glue/aws-glue-libs:glue_libs_3.0.0_image_01`
+ For AWS Glue version 2.0: `public.ecr.aws/glue/aws-glue-libs:glue_libs_2.0.0_image_01`

**Note**  
 AWS Glue Docker images are compatible with both x86\$164 and arm64. 

 In this example, we use `public.ecr.aws/glue/aws-glue-libs:5` and run the container on a local machine (Mac, Windows, or Linux). This container image has been tested for AWS Glue version 5.0 Spark jobs. The image contains the following: 
+  Amazon Linux 2023 
+  AWS Glue ETL Library 
+  Apache Spark 3.5.4 
+  Open table format libraries; Apache Iceberg 1.7.1, Apache Hudi 0.15.0, and Delta Lake 3.3.0 
+  AWS Glue Data Catalog Client 
+  Amazon Redshift connector for Apache Spark 
+  Amazon DynamoDB connector for Apache Hadoop 

 To set up your container, pull the image from ECR Public Gallery and then run the container. This topic demonstrates how to run your container with the following methods, depending on your requirements: 
+ `spark-submit`
+ REPL shell `(pyspark)`
+ `pytest`
+ Visual Studio Code

## Prerequisites
<a name="develop-local-docker-image-prereq"></a>

Before you start, make sure that Docker is installed and the Docker daemon is running. For installation instructions, see the Docker documentation for [Mac](https://docs.docker.com/docker-for-mac/install/) or [Linux](https://docs.docker.com/engine/install/). The machine running the Docker hosts the AWS Glue container. Also make sure that you have at least 7 GB of disk space for the image on the host running the Docker.

 For more information about restrictions when developing AWS Glue code locally, see [ Local development restrictions ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#local-dev-restrictions). 

### Configuring AWS
<a name="develop-local-docker-image-config-aws-credentials"></a>

To enable AWS API calls from the container, set up AWS credentials by following steps. In the following sections, we will use this AWS named profile.

1.  [ Create an AWS named profile ](https://docs.aws.amazon.com//cli/latest/userguide/cli-configure-files.html). 

1.  Open `cmd` on Windows or a terminal on Mac/Linux and run the following command in a terminal: 

   ```
   PROFILE_NAME="<your_profile_name>"
   ```

In the following sections, we use this AWS named profile.

### 
<a name="develop-local-docker-pull-image-from-ecr-public"></a>

 If you’re running Docker on Windows, choose the Docker icon (right-click) and choose **Switch to Linux containers** before pulling the image. 

Run the following command to pull the image from ECR Public:

```
docker pull public.ecr.aws/glue/aws-glue-libs:5 
```

## Run the container
<a name="develop-local-docker-image-setup-run"></a>

You can now run a container using this image. You can choose any of following based on your requirements.

### spark-submit
<a name="develop-local-docker-image-setup-run-spark-submit"></a>

You can run an AWS Glue job script by running the `spark-submit` command on the container. 

1.  Write your script and save it as `sample.py` in the example below and save it under the `/local_path_to_workspace/src/` directory using the following commands: 

   ```
   $ WORKSPACE_LOCATION=/local_path_to_workspace
   $ SCRIPT_FILE_NAME=sample.py
   $ mkdir -p ${WORKSPACE_LOCATION}/src
   $ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME}
   ```

1.  These variables are used in the docker run command below. The sample code (sample.py) used in the spark-submit command below is included in the appendix at the end of this topic. 

    Run the following command to execute the `spark-submit` command on the container to submit a new Spark application: 

   ```
   $ docker run -it --rm \
       -v ~/.aws:/home
       /hadoop/.aws \
       -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
       -e AWS_PROFILE=$PROFILE_NAME \
       --name glue5_spark_submit \
       public.ecr.aws/glue/aws-glue-libs:5 \
       spark-submit /home/hadoop/workspace/src/$SCRIPT_FILE_NAME
   ```

1. (Optionally) Configure `spark-submit` to match your environment. For example, you can pass your dependencies with the `--jars` configuration. For more information, consult [Dynamically Loading Spark Properties](https://spark.apache.org/docs/latest/configuration.html) in the Spark documentation. 

### REPL shell (Pyspark)
<a name="develop-local-docker-image-setup-run-repl-shell"></a>

 You can run REPL (`read-eval-print loops`) shell for interactive development. Run the following command to execute the PySpark command on the container to start the REPL shell: 

```
$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pyspark \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark
```

 You will see the following output: 

```
Python 3.11.6 (main, Jan  9 2025, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.4-amzn-0
      /_/

Using Python version 3.11.6 (main, Jan  9 2025 00:00:00)
Spark context Web UI available at None
Spark context available as 'sc' (master = local[*], app id = local-1740643079929).
SparkSession available as 'spark'.
>>>
```

 With this REPL shell, you can code and test interactively. 

### Pytest
<a name="develop-local-docker-image-setup-run-pytest"></a>

 For unit testing, you can use `pytest` for AWS Glue Spark job scripts. Run the following commands for preparation. 

```
$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=sample.py
$ UNIT_TEST_FILE_NAME=test_sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/tests
$ vim ${WORKSPACE_LOCATION}/tests/${UNIT_TEST_FILE_NAME}
```

 Run the following command to run `pytest` using `docker run`: 

```
$ docker run -i --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pytest \
    public.ecr.aws/glue/aws-glue-libs:5 \
    -c "python3 -m pytest --disable-warnings"
```

 Once `pytest` finishes executing unit tests, your output will look something like this: 

```
============================= test session starts ==============================
platform linux -- Python 3.11.6, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/hadoop/workspace
plugins: integration-mark-0.2.0
collected 1 item

tests/test_sample.py .                                                   [100%]

======================== 1 passed, 1 warning in 34.28s =========================
```

### Setting up the container to use Visual Studio Code
<a name="develop-local-docker-image-setup-visual-studio"></a>

 To set up the container with Visual Studio Code, complete the following steps: 

1. Install Visual Studio Code.

1. Install [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python).

1. Install [Visual Studio Code Remote - Containers](https://code.visualstudio.com/docs/remote/containers)

1. Open the workspace folder in Visual Studio Code.

1. Press `Ctrl+Shift+P` (Windows/Linux) or `Cmd+Shift+P` (Mac).

1. Type `Preferences: Open Workspace Settings (JSON)`.

1. Press Enter.

1. Paste the following JSON and save it.

   ```
   {
       "python.defaultInterpreterPath": "/usr/bin/python3.11",
       "python.analysis.extraPaths": [
           "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip:/usr/lib/spark/python/:/usr/lib/spark/python/lib/",
       ]
   }
   ```

 To set up the container: 

1. Run the Docker container.

   ```
   $ docker run -it --rm \
       -v ~/.aws:/home/hadoop/.aws \
       -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
       -e AWS_PROFILE=$PROFILE_NAME \
       --name glue5_pyspark \
       public.ecr.aws/glue/aws-glue-libs:5 \
       pyspark
   ```

1. Start Visual Studio Code.

1.  Choose **Remote Explorer** on the left menu, and choose `amazon/aws-glue-libs:glue_libs_4.0.0_image_01`. 

1.  Right-click and choose **Attach in Current Window**.   
![\[When right-click, a window with the option to Attach in Current Window is presented.\]](http://docs.aws.amazon.com/glue/latest/dg/images/vs-code-other-containers.png)

1.  If the following dialog appears, choose **Got it**.   
![\[A window warning with message "Attaching to a container may execute arbitrary code".\]](http://docs.aws.amazon.com/glue/latest/dg/images/vs-code-warning-got-it.png)

1. Open `/home/handoop/workspace/`.  
![\[A window drop-down with the option 'workspace' is highlighted.\]](http://docs.aws.amazon.com/glue/latest/dg/images/vs-code-open-workspace.png)

1.  Create a AWS Glue PySpark script and choose **Run**. 

   You will see the successful run of the script.  
![\[The successful run of the script.\]](http://docs.aws.amazon.com/glue/latest/dg/images/vs-code-run-successful-script.png)

## Changes between AWS Glue 4.0 and AWS Glue 5.0 Docker image
<a name="develop-local-docker-glue4-glue5-changes"></a>

 The major changes between the AWS Glue 4.0 and AWS Glue 5.0 Docker image: 
+  In AWS Glue 5.0, there is a single container image for both batch and streaming jobs. This differs from Glue 4.0, where there was one image for batch and another for streaming. 
+  In AWS Glue 5.0, the default user name of the container is `hadoop`. In AWS Glue 4.0, the default user name was `glue_user`. 
+  In AWS Glue 5.0, several additional libraries including JupyterLab and Livy have been removed from the image. You can manually install them. 
+  In AWS Glue 5.0, all of Iceberg, Hudi and Delta libraries are pre-loaded by default, and the environment variable `DATALAKE_FORMATS` is no longer needed. Prior to AWS Glue 4.0, the environment variable `DATALAKE_FORMATS` environment variable was used to specify which specific table formats should be loaded. 

 The above list is specific to the Docker image. To learn more about AWS Glue 5.0 updates, see [Introducing AWS Glue 5.0 for Apache Spark ](https://aws.amazon.com/blogs/big-data/introducing-aws-glue-5-0-for-apache-spark/) and [Migrating AWS Glue for Spark jobs to AWS Glue version 5.0](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-50.html). 

## Considerations
<a name="develop-local-docker-considerations"></a>

 Keep in mind that the following features are not supported when using the AWS Glue container image to develop job scripts locally. 
+  [Job bookmarks](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) 
+  AWS Glue Parquet writer ([ Using the Parquet format in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-parquet-home.html)) 
+  [ FillMissingValues transform ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-fillmissingvalues.html) 
+  [FindMatches transform](https://docs.aws.amazon.com/glue/latest/dg/machine-learning.html#find-matches-transform) 
+  [ Vectorized SIMD CSV reader ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-csv-home.html#aws-glue-programming-etl-format-simd-csv-reader) 
+  The property [ customJdbcDriverS3Path ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc) for loading JDBC driver from Amazon S3 path 
+  [AWS Glue Data Quality](https://docs.aws.amazon.com/glue/latest/dg/glue-data-quality.html) 
+  [Sensitive Data Detection](https://docs.aws.amazon.com/glue/latest/dg/detect-PII.html) 
+  AWS Lake Formation permission-based credential vending 

## Appendix: Adding JDBC drivers and Java libraries
<a name="develop-local-docker-image-appendix"></a>

 To add JDBC driver not currently available in the container, you can create a new directory under your workspace with JAR files you need and mount the directory to `/opt/spark/jars/` in docker run command. JAR files found under `/opt/spark/jars/` within the container are automatically added to Spark Classpath and will be available for use during job run. 

 For example, use the following docker run command to add JDBC driver jars to PySpark REPL shell. 

```
docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -v $WORKSPACE_LOCATION/jars/:/opt/spark/jars/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_jdbc \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark
```

 As highlighted in **Considerations**, `customJdbcDriverS3Path` connection option cannot be used to import a custom JDBC driver from Amazon S3 in AWS Glue container images.