Using Python libraries with AWS Glue
You can install additional Python modules and libraries for use with AWS Glue ETL. For AWS Glue 2.0 and above, AWS Glue uses the Python Package Installer (pip3)
to install additional modules used by AWS Glue ETL. AWS Glue provides multiple options to bring the additional Python modules to your AWS Glue job environment.
You can use the --additional-python-modules parameter to bring in new modules using zip files containing bundled Python wheels (also known as "zip of wheels", available for AWS Glue 5.0 and above), individual Python wheel files, requirements files (requirements.txt, available for AWS Glue 5.0 and above), or a list of comma-separated Python modules. It could also be used to change the version of the python modules provided in the AWS Glue environment (see Python modules already provided in AWS Glue for more details).
Topics
Installing additional Python modules with pip in AWS Glue 2.0 or later
AWS Glue uses the Python Package Installer (pip3) to install additional modules to be used by
AWS Glue ETL. You can use the --additional-python-modules parameter with a list of
comma-separated Python modules to add a new module or change the version of an existing module. You can install
built wheel artifacts either through a zip of wheels or a standalone wheel artifact by uploading the file to Amazon S3, then including the path to the Amazon S3 object
in your list of modules. For more information about setting job parameters, see Using job parameters in AWS Glue jobs.
You can pass additional options to pip3 with the --python-modules-installer-option parameter. For
example, you could pass --only-binary to force pip to install only pre-built artifacts for the packages specified by
--additional-python-modules. For more examples, see Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0
Best Practices for Python Dependency Management
For production workloads, AWS Glue recommend packaging all your Python dependencies as wheel files in a single zip artifact. This approach provides:
-
Deterministic execution: Exact control over which package versions are installed
-
Reliability: No dependency on external package repositories during job execution
-
Performance: Single download operation instead of multiple network calls
-
Offline installation: Works in private VPC environments without internet access
Important Considerations
Under the AWS shared responsibility model
-
Security updates: Regularly updating packages to address security vulnerabilities
-
Version compatibility: Ensuring packages are compatible with your AWS Glue version
-
Testing: Validating that your packaged dependencies work correctly in the Glue environment
If you have minimal dependencies, you may consider using individual wheel files instead.
AWS Glue 5.0 and above supports packaging multiple wheel files into a single zip artifact containing bundled Python wheels for more reliable and deterministic dependency management. To use this approach, create a zip file containing all your wheel dependencies and their transitive dependencies with the .gluewheels.zip suffix, upload it to Amazon S3, and reference it using the --additional-python-modules parameter. Be sure to add --no-index to the --python-modules-installer-option job parameter. With this configuration, the zip of wheels file essentially acts as a local index for pip to resolve dependencies from at runtime. This eliminates dependencies on external package repositories like PyPI during job execution, providing greater stability and consistency for production workloads. For example:
--additional-python-modules s3://amzn-s3-demo-bucket/path/to/zip-of-wheels-1.0.0.gluewheels.zip --python-modules-installer-option --no-index
For Instructions on how to create a zip of wheels file, see Appendix A: Creating a Zip of Wheels Artifact.
AWS Glue supports installing custom Python packages using wheel (.whl) files stored in Amazon S3. To include wheel files in your AWS Glue jobs,
provide a comma-separated list of your wheel files stored in s3 to the --additional-python-modules job parameter.
For example:
--additional-python-modules s3://amzn-s3-demo-bucket/path/to/package-1.0.0-py3-none-any.whl,s3://your-bucket/path/to/another-package-2.1.0-cp311-cp311-linux_x86_64.whl
This approach also supports when you need custom distributions, or packages with native dependencies that are pre-compiled for the correct
operating system. For more examples, see
Building Python modules from a wheel for Spark ETL workloads using AWS Glue 2.0
In AWS Glue 5.0+, you can provide the defacto-standard requirements.txt to manage Python library dependencies.
To do that, provide following two job parameters:
-
Key:
--python-modules-installer-optionValue:
-r -
Key:
--additional-python-modulesValue:
s3://path_to_requirements.txt
AWS Glue 5.0 nodes initially load python libraries specified in requirements.txt.
Here's a sample requirements.txt:
awswrangler==3.9.1 elasticsearch==8.15.1 PyAthena==3.9.0 PyMySQL==1.1.1 PyYAML==6.0.2 pyodbc==5.2.0 pyorc==0.9.0 redshift-connector==2.1.3 scipy==1.14.1 scikit-learn==1.5.2 SQLAlchemy==2.0.36
Important
Use this option with caution, especially in production workloads. Pulling dependencies from PyPI at runtime is highly risky because you cannot be sure what artifact pip resolves to. Using unpinned library versions is especially risky since it pulls latest version of the python modules, which can introduce breaking changes or bring in incompatible python module. This could result in job failure due to python installation failure in AWS Glue job environment. While pinning library version increases stability, pip resolution is still not fully deterministic, so similar issues can arise. As a best practice, AWS Glue recommends using frozen artifacts such as zip of wheels or individual wheel files (see (Recommended) Installing additional Python libraries in AWS Glue 5.0 or above using Zip of Wheels for more details).
Important
If you do not pin the versions of your transitive dependencies, a primary dependency may pull incompatible transitive dependency versions. As a best practice, all library versions should be pinned for increased consistency in AWS Glue jobs. Even better, AWS Glue recommends packaging your dependencies into a zip of wheels file to ensure maximum consistency and reliability for your production workloads.
To update or to add a new Python module AWS Glue allows passing --additional-python-modules parameter with a list of comma-separated
Python modules as values. For example to update/ add scikit-learn module use the following key/value: "--additional-python-modules",
"scikit-learn==0.21.3". You have two options to directly configure the python modules.
-
Pinned Python Module
"--additional-python-modules", "scikit-learn==0.21.3,ephem==4.1.6" -
Unpinned Python Module: (Not recommended for Production Workloads)
"--additional-python-modules", "scikit-learn>==0.20.0,ephem>=4.0.0"OR
"--additional-python-modules", "scikit-learn,ephem"
Important
Use this option with caution, especially in production workloads. Pulling dependencies from PyPI at runtime is highly risky because you cannot be sure what artifact pip resolves to. Using unpinned library versions is especially risky since it pulls latest version of the python modules, which can introduce breaking changes or bring in incompatible python module. This could result in job failure due to python installation failure in AWS Glue job environment. While pinning library version increases stability, pip resolution is still not fully deterministic, so similar issues can arise. As a best practice, AWS Glue recommends using frozen artifacts such as zip of wheels or individual wheel files (see (Recommended) Installing additional Python libraries in AWS Glue 5.0 or above using Zip of Wheels for more details).
Important
If you do not pin the versions of your transitive dependencies, a primary dependency may pull incompatible transitive dependency versions. As a best practice, all library versions should be pinned for increased consistency in AWS Glue jobs. Even better, AWS Glue recommends packaging your dependencies into a zip of wheels file to ensure maximum consistency and reliability for your production workloads.
Including Python files with PySpark native features
AWS Glue uses PySpark to include Python files in AWS Glue ETL jobs. You will want to use
--additional-python-modules to manage your dependencies when available. You can use the
--extra-py-files job parameter to include Python files. Dependencies must be hosted in Amazon S3 and the
argument value should be a comma delimited list of Amazon S3 paths with no spaces. This functionality behaves like the
Python dependency management you would use with Spark. For more information on Python dependency management in
Spark, see Using PySpark Native Features--extra-py-files is useful
in cases where your additional code is not packaged, or when you are migrating a Spark program with an existing
toolchain for managing dependencies. For your dependency tooling to be maintainable, you will
have to bundle your dependencies before submitting.
Programming scripts that use visual transforms
When you create a AWS Glue job using the AWS Glue Studio visual interface, you can transform your data with managed data transform nodes and custom visual transforms. For more information about managed data transform nodes, see Transform data with AWS Glue managed transforms. For more information about custom visual transforms, see Transform data with custom visual transforms . Scripts using visual transforms can only be generated when your job Language is set to use Python.
When generating a AWS Glue job using visual transforms, AWS Glue Studio will include these transforms in the runtime
environment using the --extra-py-files parameter in the job configuration. For more information about
job parameters, see Using job parameters in AWS Glue jobs. When making changes to a generated
script or runtime environment, you will need to preserve this job configuration for your script to run
successfully.
Zipping libraries for inclusion
Unless a library is contained in a single .py file, it should be
packaged in a .zip archive. The package directory should be at the root
of the archive, and must contain an __init__.py file for the package.
Python will then be able to import the package in the normal way.
If your library only consists of a single Python module in one .py
file, you do not need to place it in a .zip file.
Loading Python libraries in AWS Glue Studio notebooks
To specify Python libraries in AWS Glue Studio notebooks, see Installing additional Python modules.
Loading Python libraries in a development endpoint in AWS Glue 0.9/1.0
If you are using different library sets for different ETL scripts,
you can either set up a separate development endpoint for each set,
or you can overwrite the library .zip file(s) that your
development endpoint loads every time you switch scripts.
You can use the console to specify one or more library .zip files for
a development endpoint when you create it. After assigning a name and
an IAM role, choose Script Libraries and job parameters
(optional) and enter the full Amazon S3 path to your library
.zip file in the Python library path box.
For example:
s3://bucket/prefix/site-packages.zip
If you want, you can specify multiple full paths to files, separating them with commas but no spaces, like this:
s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip
If you update these .zip files later, you can use the console
to re-import them into your development endpoint. Navigate to the developer
endpoint in question, check the box beside it, and choose Update
ETL libraries from the Action menu.
In a similar way, you can specify library files using the AWS Glue APIs.
When you create a development endpoint by calling CreateDevEndpoint action (Python: create_dev_endpoint),
you can specify one or more full paths to libraries in the ExtraPythonLibsS3Path
parameter, in a call that looks this:
dep = glue.create_dev_endpoint(
EndpointName="testDevEndpoint",
RoleArn="arn:aws:iam::123456789012",
SecurityGroupIds="sg-7f5ad1ff",
SubnetId="subnet-c12fdba4",
PublicKey="ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQCtp04H/y...",
NumberOfNodes=3,
ExtraPythonLibsS3Path="s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip")
When you update a development endpoint, you can also update the libraries it loads
using a DevEndpointCustomLibraries object
and setting the UpdateEtlLibraries parameter to True
when calling UpdateDevEndpoint (update_dev_endpoint).
Using Python libraries in a job or JobRun
When you are creating a new Job on the console, you can specify one or more library .zip files by choosing Script Libraries and job parameters (optional) and entering the full Amazon S3 library path(s) in the same way you would when creating a development endpoint:
s3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip
If you are calling CreateJob (create_job),
you can specify one or more full paths to default libraries using the --extra-py-files
default parameter, like this:
job = glue.create_job(Name='sampleJob',
Role='Glue_DefaultRole',
Command={'Name': 'glueetl',
'ScriptLocation': 's3://my_script_bucket/scripts/my_etl_script.py'},
DefaultArguments={'--extra-py-files': 's3://bucket/prefix/lib_A.zip,s3://bucket_B/prefix/lib_X.zip'})
Then when you are starting a JobRun, you can override the default library setting with a different one:
runId = glue.start_job_run(JobName='sampleJob',
Arguments={'--extra-py-files': 's3://bucket/prefix/lib_B.zip'})
Proactively analyze Python dependencies
To proactively identify potential dependency issues before deploying to AWS Glue, you can use the dependency analysis tool to validate your Python packages against your target AWS Glue environment.
AWS provides an open-source Python dependency analyzer tool specifically designed for AWS Glue environments. This tool is available in the AWS Glue samples repository and can be used locally to validate your dependencies before deployment.
This analysis helps ensure your dependencies follow the recommended practice of pinning all library versions for consistent production deployments.
For more details, please see the tool's
README
The AWS Glue Python Dependency Analyzer helps identify unpinned dependencies and version conflicts by simulating pip installation with platform-specific constraints that match your target AWS Glue environment.
# Analyze a single Glue job python glue_dependency_analyzer.py -j my-glue-job # Analyze multiple jobs with specific AWS configuration python glue_dependency_analyzer.py -j job1 -j job2 --aws-profile production --aws-region us-west-2
The tool will flag:
-
Unpinned dependencies that could install different versions across job runs
-
Version conflicts between packages
-
Dependencies not available for your target AWS Glue environment
Amazon Q Developer is a generative artificial intelligence (AI) powered conversational assistant that can help you understand, build, extend, and operate AWS applications. You can download it by following the instructions in the Getting started guide for Amazon Q.
Amazon Q Developer can be used to analyze and fix job failures due to python dependency. We suggest using the following prompt by replacing the job <Job-Name> placeholder with the name of your glue job.
I have an AWS Glue job named <Job-Name> that has failed due to Python module installation conflicts. Please assist in diagnosing and resolving this issue using the following systematic approach. Proceed once sufficient information is available. Objective: Implement a fix that addresses the root cause module while minimizing disruption to the existing working environment. Step 1: Root Cause Analysis • Retrieve the most recent failed job run ID for the specified Glue job • Extract error logs from CloudWatch Logs using the job run ID as a log stream prefix • Analyze the logs to identify: • The recently added or modified Python module that triggered the dependency conflict • The specific dependency chain causing the installation failure • Version compatibility conflicts between required and existing modules Step 2: Baseline Configuration Identification • Locate the last successful job run ID prior to the dependency failure • Document the Python module versions that were functioning correctly in that baseline run • Establish the compatible version constraints for conflicting dependencies Step 3: Targeted Resolution Implementation • Apply pinning by updating the job's additional_python_modules parameter • Pin only the root cause module and its directly conflicting dependencies to compatible versions, and do not remove python modules unless necessary • Preserve flexibility for non-conflicting modules by avoiding unnecessary version constraints • Deploy the configuration changes with minimal changes to the existing configuration and execute a validation test run. Do not change the Glue versions. Implementation Example: Scenario: Recently added pandas==2.0.0 to additional_python_modules Error: numpy version conflict (pandas 2.0.0 requires numpy>=1.21, but existing job code requires numpy<1.20) Resolution: Update additional_python_modules to "pandas==1.5.3,numpy==1.19.5" Rationale: Use pandas 1.5.3 (compatible with numpy 1.19.5) and pin numpy to last known working version Expected Outcome: Restore job functionality with minimal configuration changes while maintaining system stability.
The prompt instructs Q to:
-
Fetch the latest failed job run ID
-
Find associated logs and details
-
Find successful job runs to detect any changed Python packages
-
Make any configuration fixes and trigger another test run
Python modules already provided in AWS Glue
To change the version of these provided modules, provide new versions with the --additional-python-modules job parameter.
Appendix A: Creating a Zip of Wheels Artifact
We demonstrate by example how to create a zip of wheels artifact. The example shown downloads the packages cryptography and scipy into a zip of wheels artifact and copies the zip of wheels to an Amazon S3 location.
-
You must run the commands to create the zip of wheels in an Amazon Linux environment similar to Glue's environment. See Appendix B: AWS Glue environment details. Glue 5.1 uses AL2023 with python version 3.11. Create Dockerfile that will build this environment:
FROM --platform=linux/amd64 public.ecr.aws/amazonlinux/amazonlinux:2023-minimal # Install Python 3.11, pip, and zip utility RUN dnf install -y python3.11 pip zip && \ dnf clean all WORKDIR /build -
Create a requirements.txt file
cryptography scipy -
Build and spin up docker container
# Build docker image docker build --platform linux/amd64 -t glue-wheel-builder . # Spin up container docker run --platform linux/amd64 -v $(pwd)/requirements.txt:/input/requirements.txt:ro -v $(pwd):/output -it glue-wheel-builder bash -
Run the following commands in the docker image
# Create a directory for the wheels mkdir wheels # Copy requirements.txt into wheels directory cp /input/requirements.txt wheels/ # Download the wheels with the correct platform and Python version pip3 download \ -r wheels/requirements.txt \ --dest wheels/ \ --platform manylinux2014_x86_64 \ --python-version 311 \ --only-binary=:all: # Package the wheels into a zip archive with the .gluewheels.zip suffix zip -r mylibraries-1.0.0.gluewheels.zip wheels/ # Copy zip to output cp mylibraries-1.0.0.gluewheels.zip /output/ # Exit the container exit -
Upload zip of wheels to Amazon S3 location
aws s3 cp mylibraries-1.0.0.gluewheels.zip s3://amzn-s3-demo-bucket/example-prefix/ -
Optional cleanup
rm mylibraries-1.0.0.gluewheels.zip rm Dockerfile rm requirements.txt -
Run the Glue job with the following job args:
--additional-python-modules s3://amzn-s3-demo-bucket/example-prefix/mylibraries-1.0.0.gluewheels.zip --python-modules-installer-option --no-index
Appendix B: AWS Glue environment details
| AWS Glue version | Python version | Base image | glibc version | Compatible platform tags |
|---|---|---|---|---|
| 5.1 | 3.11 | Amazon Linux 2023 (AL2023) |
2.34 |
manylinux_2_34_x86_64 manylinux_2_28_x86_64 manylinux2014_x86_64 |
| 5.0 | 3.11 | Amazon Linux 2023 (AL2023) |
2.34 |
manylinux_2_34_x86_64 manylinux_2_28_x86_64 manylinux2014_x86_64 |
| 4.0 | 3.10 | Amazon Linux 2 (AL2) |
2.26 | manylinux2014_x86_64 |
| 3.0 | 3.7 | Amazon Linux 2 (AL2) |
2.26 | manylinux2014_x86_64 |
| 2.0 | 3.7 | Amazon Linux AMI (AL1) |
2.17 | manylinux2014_x86_64 |
Under the AWS shared responsibility model
AWS Glue does not support compiling native code in the job environment. However, AWS Glue jobs run within an Amazon-managed Linux environment. You may be able to provide your native dependencies in a compiled form through a Python wheel file. Please refer to above table for AWS Glue version compatibility details.
Important
Using incompatible dependencies can result in runtime issues, particularly for libraries with native extensions that must match the target environment's architecture and system libraries. Each AWS Glue version runs on a specific Python version with pre-installed libraries and system configurations.