@Stability(Experimental) package software.amazon.awscdk.services.glue.alpha

AWS Glue Construct Library

---

cdk-constructs: Experimental

The APIs of higher level constructs in this module are experimental and under active development. They are subject to non-backward compatible changes or removal in any future version. These are not subject to the Semantic Versioning model and breaking changes will be announced in the release notes. This means that while you may use them, you may need to update your source code when upgrading to a newer version of this package.

This module is part of the AWS Cloud Development Kit project.

README

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

The Glue L2 construct has convenience methods working backwards from common use cases and sets required parameters to defaults that align with recommended best practices for each job type. It also provides customers with a balance between flexibility via optional parameter overrides, and opinionated interfaces that discouraging anti-patterns, resulting in reduced time to develop and deploy new resources.

References

Glue Launch Announcement
Glue Documentation
Glue L1 (CloudFormation) Constructs
Prior version of the @aws-cdk/aws-glue-alpha module

Create a Glue Job

A Job encapsulates a script that connects to data sources, processes them, and then writes output to a data target. There are four types of Glue Jobs: Spark (ETL and Streaming), Python Shell, Ray, and Flex Jobs. Most of the required parameters for these jobs are common across all types, but there are a few differences depending on the languages supported and features provided by each type. For all job types, the L2 defaults to AWS best practice recommendations, such as:

Use of Secrets Manager for Connection JDBC strings
Glue job autoscaling
Default parameter values for Glue job creation

This iteration of the L2 construct introduces breaking changes to the existing glue-alpha-module, but these changes streamline the developer experience, introduce new constants for defaults, and replacing synth-time validations with interface contracts for enforcement of the parameter combinations that Glue supports. As an opinionated construct, the Glue L2 construct does not allow developers to create resources that use non-current versions of Glue or deprecated language dependencies (e.g. deprecated versions of Python). As always, L1s allow you to specify a wider range of parameters if you need or want to use alternative configurations.

Optional and required parameters for each job are enforced via interface rather than validation; see Glue's public documentation for more granular details.

Spark Jobs

ETL Jobs

ETL jobs support pySpark and Scala languages, for which there are separate but similar constructors. ETL jobs default to the G2 worker type, but you can override this default with other supported worker type values (G1, G2, G4 and G8). ETL jobs defaults to Glue version 4.0, which you can override to 3.0. The following ETL features are enabled by default: —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log. You can find more details about version, worker type and other features in Glue's public documentation.

Reference the pyspark-etl-jobs.test.ts and scalaspark-etl-jobs.test.ts unit tests for examples of required-only and optional job parameters when creating these types of jobs.

For the sake of brevity, examples are shown using the pySpark job variety.

Example with only required parameters:

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 PySparkEtlJob.Builder.create(stack, "PySparkETLJob")
         .role(role)
         .script(script)
         .jobName("PySparkETLJob")
         .build();

Example with optional override parameters:

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 PySparkEtlJob.Builder.create(stack, "PySparkETLJob")
         .jobName("PySparkETLJobCustomName")
         .description("This is a description")
         .role(role)
         .script(script)
         .glueVersion(GlueVersion.V5_1)
         .continuousLogging(ContinuousLoggingProps.builder().enabled(false).build())
         .workerType(WorkerType.G_2X)
         .maxConcurrentRuns(100)
         .timeout(Duration.hours(2))
         .connections(List.of(Connection.fromConnectionName(stack, "Connection", "connectionName")))
         .securityConfiguration(SecurityConfiguration.fromSecurityConfigurationName(stack, "SecurityConfig", "securityConfigName"))
         .tags(Map.of(
                 "FirstTagName", "FirstTagValue",
                 "SecondTagName", "SecondTagValue",
                 "XTagName", "XTagValue"))
         .numberOfWorkers(2)
         .maxRetries(2)
         .build();

Streaming Jobs

Streaming jobs are similar to ETL jobs, except that they perform ETL on data streams using the Apache Spark Structured Streaming framework. Some Spark job features are not available to Streaming ETL jobs. They support Scala and pySpark languages. PySpark streaming jobs default Python 3.9, which you can override with any non-deprecated version of Python. It defaults to the G2 worker type and Glue 4.0, both of which you can override. The following best practice features are enabled by default: —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log.

Reference the pyspark-streaming-jobs.test.ts and scalaspark-streaming-jobs.test.ts unit tests for examples of required-only and optional job parameters when creating these types of jobs.

Example with only required parameters:

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 PySparkStreamingJob.Builder.create(stack, "ImportedJob").role(role).script(script).build();

Example with optional override parameters:

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 PySparkStreamingJob.Builder.create(stack, "PySparkStreamingJob")
         .jobName("PySparkStreamingJobCustomName")
         .description("This is a description")
         .role(role)
         .script(script)
         .glueVersion(GlueVersion.V5_1)
         .continuousLogging(ContinuousLoggingProps.builder().enabled(false).build())
         .workerType(WorkerType.G_2X)
         .maxConcurrentRuns(100)
         .timeout(Duration.hours(2))
         .connections(List.of(Connection.fromConnectionName(stack, "Connection", "connectionName")))
         .securityConfiguration(SecurityConfiguration.fromSecurityConfigurationName(stack, "SecurityConfig", "securityConfigName"))
         .tags(Map.of(
                 "FirstTagName", "FirstTagValue",
                 "SecondTagName", "SecondTagValue",
                 "XTagName", "XTagValue"))
         .numberOfWorkers(2)
         .maxRetries(2)
         .build();

Flex Jobs

The flexible execution class is appropriate for non-urgent jobs such as pre-production jobs, testing, and one-time data loads. Flexible jobs default to Glue version 3.0 and worker type G_2X. The following best practice features are enabled by default: —enable-metrics, —enable-spark-ui, —enable-continuous-cloudwatch-log

Reference the pyspark-flex-etl-jobs.test.ts and scalaspark-flex-etl-jobs.test.ts unit tests for examples of required-only and optional job parameters when creating these types of jobs.

Example with only required parameters:

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 PySparkFlexEtlJob.Builder.create(stack, "ImportedJob").role(role).script(script).build();

Example with optional override parameters:

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 PySparkEtlJob.Builder.create(stack, "pySparkEtlJob")
         .jobName("pySparkEtlJob")
         .description("This is a description")
         .role(role)
         .script(script)
         .glueVersion(GlueVersion.V5_1)
         .continuousLogging(ContinuousLoggingProps.builder().enabled(false).build())
         .workerType(WorkerType.G_2X)
         .maxConcurrentRuns(100)
         .timeout(Duration.hours(2))
         .connections(List.of(Connection.fromConnectionName(stack, "Connection", "connectionName")))
         .securityConfiguration(SecurityConfiguration.fromSecurityConfigurationName(stack, "SecurityConfig", "securityConfigName"))
         .tags(Map.of(
                 "FirstTagName", "FirstTagValue",
                 "SecondTagName", "SecondTagValue",
                 "XTagName", "XTagValue"))
         .numberOfWorkers(2)
         .maxRetries(2)
         .build();

Python Shell Jobs

Python shell jobs support a Python version that depends on the AWS Glue version you use. These can be used to schedule and run tasks that don't require an Apache Spark environment. Python shell jobs default to Python 3.9 and a MaxCapacity of 0.0625. Python 3.9 supports pre-loaded analytics libraries using the library-set=analytics flag, which is enabled by default.

Reference the pyspark-shell-job.test.ts unit tests for examples of required-only and optional job parameters when creating these types of jobs.

Example with only required parameters:

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 PythonShellJob.Builder.create(stack, "ImportedJob").role(role).script(script).build();

Example with optional override parameters:

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 PythonShellJob.Builder.create(stack, "PythonShellJob")
         .jobName("PythonShellJobCustomName")
         .description("This is a description")
         .pythonVersion(PythonVersion.TWO)
         .maxCapacity(MaxCapacity.DPU_1)
         .role(role)
         .script(script)
         .glueVersion(GlueVersion.V2_0)
         .continuousLogging(ContinuousLoggingProps.builder().enabled(false).build())
         .workerType(WorkerType.G_2X)
         .maxConcurrentRuns(100)
         .timeout(Duration.hours(2))
         .connections(List.of(Connection.fromConnectionName(stack, "Connection", "connectionName")))
         .securityConfiguration(SecurityConfiguration.fromSecurityConfigurationName(stack, "SecurityConfig", "securityConfigName"))
         .tags(Map.of(
                 "FirstTagName", "FirstTagValue",
                 "SecondTagName", "SecondTagValue",
                 "XTagName", "XTagValue"))
         .numberOfWorkers(2)
         .maxRetries(2)
         .build();

Ray Jobs

Glue Ray jobs use worker type Z.2X and Glue version 4.0. These are not overrideable since these are the only configuration that Glue Ray jobs currently support. The runtime defaults to Ray2.4 and min workers defaults to 3.

Reference the ray-job.test.ts unit tests for examples of required-only and optional job parameters when creating these types of jobs.

Example with only required parameters:

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 RayJob.Builder.create(stack, "ImportedJob").role(role).script(script).build();

Example with optional override parameters:

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 RayJob.Builder.create(stack, "ImportedJob")
         .role(role)
         .script(script)
         .jobName("RayCustomJobName")
         .description("This is a description")
         .workerType(WorkerType.Z_2X)
         .numberOfWorkers(5)
         .runtime(Runtime.RAY_TWO_FOUR)
         .maxRetries(3)
         .maxConcurrentRuns(100)
         .timeout(Duration.hours(2))
         .connections(List.of(Connection.fromConnectionName(stack, "Connection", "connectionName")))
         .securityConfiguration(SecurityConfiguration.fromSecurityConfigurationName(stack, "SecurityConfig", "securityConfigName"))
         .tags(Map.of(
                 "FirstTagName", "FirstTagValue",
                 "SecondTagName", "SecondTagValue",
                 "XTagName", "XTagValue"))
         .build();

Metrics Control

By default, Glue jobs enable CloudWatch metrics (--enable-metrics) and observability metrics (--enable-observability-metrics) for monitoring and debugging. You can disable these metrics to reduce CloudWatch costs:

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 
 // Disable both metrics for cost optimization
 // Disable both metrics for cost optimization
 PySparkEtlJob.Builder.create(stack, "CostOptimizedJob")
         .role(role)
         .script(script)
         .enableMetrics(false)
         .enableObservabilityMetrics(false)
         .build();
 
 // Selective control - keep observability, disable profiling
 // Selective control - keep observability, disable profiling
 PySparkEtlJob.Builder.create(stack, "SelectiveJob")
         .role(role)
         .script(script)
         .enableMetrics(false)
         .build();

This feature is available for all Spark job types (ETL, Streaming, Flex) and Ray jobs.

Enable Job Run Queuing

AWS Glue job queuing monitors your account level quotas and limits. If quotas or limits are insufficient to start a Glue job run, AWS Glue will automatically queue the job and wait for limits to free up. Once limits become available, AWS Glue will retry the job run. Glue jobs will queue for limits like max concurrent job runs per account, max concurrent Data Processing Units (DPU), and resource unavailable due to IP address exhaustion in Amazon Virtual Private Cloud (Amazon VPC).

Enable job run queuing by setting the jobRunQueuingEnabled property to true.

 import software.amazon.awscdk.*;
 import software.amazon.awscdk.services.iam.*;
 Stack stack;
 IRole role;
 Code script;
 
 PySparkEtlJob.Builder.create(stack, "PySparkETLJob")
         .role(role)
         .script(script)
         .jobName("PySparkETLJob")
         .jobRunQueuingEnabled(true)
         .build();

Uploading scripts from the CDK app repository to S3

Similar to other L2 constructs, the Glue L2 automates uploading / updating scripts to S3 via an optional fromAsset parameter pointing to a script in the local file structure. You provide the existing S3 bucket and path to which you'd like the script to be uploaded.

Reference the unit tests for examples of repo and S3 code target examples.

Workflow Triggers

You can use Glue workflows to create and visualize complex extract, transform, and load (ETL) activities involving multiple crawlers, jobs, and triggers. Standalone triggers are an anti-pattern, so you must create triggers from within a workflow using the L2 construct.

Within a workflow object, there are functions to create different types of triggers with actions and predicates. You then add those triggers to jobs.

StartOnCreation defaults to true for all trigger types, but you can override it if you prefer for your trigger not to start on creation.

Reference the workflow-triggers.test.ts unit tests for examples of creating workflows and triggers.

1. On-Demand Triggers

On-demand triggers can start glue jobs or crawlers. This construct provides convenience functions to create on-demand crawler or job triggers. The constructor takes an optional description parameter, but abstracts the requirement of an actions list using the job or crawler objects using conditional types.

2. Scheduled Triggers

You can create scheduled triggers using cron expressions. This construct provides daily, weekly, and monthly convenience functions, as well as a custom function that allows you to create your own custom timing using the existing event Schedule class without having to build your own cron expressions. The L2 extracts the expression that Glue requires from the Schedule object. The constructor takes an optional description and a list of jobs or crawlers as actions.

3. Notify Event Triggers

There are two types of notify event triggers: batching and non-batching. For batching triggers, you must specify BatchSize. For non-batching triggers, BatchSize defaults to 1. For both triggers, BatchWindow defaults to 900 seconds, but you can override the window to align with your workload's requirements.

4. Conditional Triggers

Conditional triggers have a predicate and actions associated with them. The trigger actions are executed when the predicateCondition is true.

Connection Properties

A Connection allows Glue jobs, crawlers and development endpoints to access certain types of data stores.

Secrets Management You must specify JDBC connection credentials in Secrets Manager and provide the Secrets Manager Key name as a property to the job connection.
Networking - the CDK determines the best fit subnet for Glue connection configuration The prior version of the glue-alpha-module requires the developer to specify the subnet of the Connection when it’s defined. Now, you can still specify the specific subnet you want to use, but are no longer required to. You are only required to provide a VPC and either a public or private subnet selection. Without a specific subnet provided, the L2 leverages the existing EC2 Subnet Selection library to make the best choice selection for the subnet.

 SecurityGroup securityGroup;
 Subnet subnet;
 
 Connection.Builder.create(this, "MyConnection")
         .type(ConnectionType.NETWORK)
         // The security groups granting AWS Glue inbound access to the data source within the VPC
         .securityGroups(List.of(securityGroup))
         // The VPC subnet which contains the data source
         .subnet(subnet)
         .build();

For RDS Connection by JDBC, it is recommended to manage credentials using AWS Secrets Manager. To use Secret, specify SECRET_ID in properties like the following code. Note that in this case, the subnet must have a route to the AWS Secrets Manager VPC endpoint or to the AWS Secrets Manager endpoint through a NAT gateway.

 SecurityGroup securityGroup;
 Subnet subnet;
 DatabaseCluster db;
 
 Connection.Builder.create(this, "RdsConnection")
         .type(ConnectionType.JDBC)
         .securityGroups(List.of(securityGroup))
         .subnet(subnet)
         .properties(Map.of(
                 "JDBC_CONNECTION_URL", String.format("jdbc:mysql://%s/databasename", db.getClusterEndpoint().getSocketAddress()),
                 "JDBC_ENFORCE_SSL", "false",
                 "SECRET_ID", db.getSecret().getSecretName()))
         .build();

If you need to use a connection type that doesn't exist as a static member on ConnectionType, you can instantiate a ConnectionType object, e.g: new glue.ConnectionType('NEW_TYPE').

See Adding a Connection to Your Data Store and Connection Structure documentation for more information on the supported data stores and their configurations.

SecurityConfiguration

A SecurityConfiguration is a set of security properties that can be used by AWS Glue to encrypt data at rest.

 SecurityConfiguration.Builder.create(this, "MySecurityConfiguration")
         .cloudWatchEncryption(CloudWatchEncryption.builder()
                 .mode(CloudWatchEncryptionMode.KMS)
                 .build())
         .jobBookmarksEncryption(JobBookmarksEncryption.builder()
                 .mode(JobBookmarksEncryptionMode.CLIENT_SIDE_KMS)
                 .build())
         .s3Encryption(S3Encryption.builder()
                 .mode(S3EncryptionMode.KMS)
                 .build())
         .build();

By default, a shared KMS key is created for use with the encryption configurations that require one. You can also supply your own key for each encryption config, for example, for CloudWatch encryption:

 Key key;
 
 SecurityConfiguration.Builder.create(this, "MySecurityConfiguration")
         .cloudWatchEncryption(CloudWatchEncryption.builder()
                 .mode(CloudWatchEncryptionMode.KMS)
                 .kmsKey(key)
                 .build())
         .build();

See documentation for more info for Glue encrypting data written by Crawlers, Jobs, and Development Endpoints.

Database

A Database is a logical grouping of Tables in the Glue Catalog.

 Database.Builder.create(this, "MyDatabase")
         .databaseName("my_database")
         .description("my_database_description")
         .build();

Table

A Glue table describes a table of data in S3: its structure (column names and types), location of data (S3 objects with a common prefix in a S3 bucket), and format for the files (Json, Avro, Parquet, etc.):

 Database myDatabase;
 
 S3Table.Builder.create(this, "MyTable")
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build(), Column.builder()
                 .name("col2")
                 .type(Schema.array(Schema.STRING))
                 .comment("col2 is an array of strings")
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();

By default, a S3 bucket will be created to store the table's data but you can manually pass the bucket and s3Prefix:

 Bucket myBucket;
 Database myDatabase;
 
 S3Table.Builder.create(this, "MyTable")
         .bucket(myBucket)
         .s3Prefix("my-table/")
         // ...
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();

Glue tables can be configured to contain user-defined properties, to describe the physical storage of table data, through the storageParameters property:

 Database myDatabase;
 
 S3Table.Builder.create(this, "MyTable")
         .storageParameters(List.of(StorageParameter.skipHeaderLineCount(1), StorageParameter.compressionType(CompressionType.GZIP), StorageParameter.custom("separatorChar", ",")))
         // ...
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();

Glue tables can also be configured to contain user-defined table properties through the parameters property:

 Database myDatabase;
 
 S3Table.Builder.create(this, "MyTable")
         .parameters(Map.of(
                 "key1", "val1",
                 "key2", "val2"))
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();

Partition Keys

To improve query performance, a table can specify partitionKeys on which data is stored and queried separately. For example, you might partition a table by year and month to optimize queries based on a time window:

 Database myDatabase;
 
 S3Table.Builder.create(this, "MyTable")
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .partitionKeys(List.of(Column.builder()
                 .name("year")
                 .type(Schema.SMALL_INT)
                 .build(), Column.builder()
                 .name("month")
                 .type(Schema.SMALL_INT)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();

Partition Indexes

Another way to improve query performance is to specify partition indexes. If no partition indexes are present on the table, AWS Glue loads all partitions of the table and filters the loaded partitions using the query expression. The query takes more time to run as the number of partitions increase. With an index, the query will try to fetch a subset of the partitions instead of loading all partitions of the table.

The keys of a partition index must be a subset of the partition keys of the table. You can have a maximum of 3 partition indexes per table. To specify a partition index, you can use the partitionIndexes property:

 Database myDatabase;
 
 S3Table.Builder.create(this, "MyTable")
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .partitionKeys(List.of(Column.builder()
                 .name("year")
                 .type(Schema.SMALL_INT)
                 .build(), Column.builder()
                 .name("month")
                 .type(Schema.SMALL_INT)
                 .build()))
         .partitionIndexes(List.of(PartitionIndex.builder()
                 .indexName("my-index") // optional
                 .keyNames(List.of("year"))
                 .build())) // supply up to 3 indexes
         .dataFormat(DataFormat.JSON)
         .build();

Alternatively, you can call the addPartitionIndex() function on a table:

 Table myTable;
 
 myTable.addPartitionIndex(PartitionIndex.builder()
         .indexName("my-index")
         .keyNames(List.of("year"))
         .build());

Partition Filtering

If you have a table with a large number of partitions that grows over time, consider using AWS Glue partition indexing and filtering.

 Database myDatabase;
 
 S3Table.Builder.create(this, "MyTable")
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .partitionKeys(List.of(Column.builder()
                 .name("year")
                 .type(Schema.SMALL_INT)
                 .build(), Column.builder()
                 .name("month")
                 .type(Schema.SMALL_INT)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .enablePartitionFiltering(true)
         .build();

Glue Connections

Glue connections allow external data connections to third party databases and data warehouses. However, these connections can also be assigned to Glue Tables, allowing you to query external data sources using the Glue Data Catalog.

Whereas S3Table will point to (and if needed, create) a bucket to store the tables' data, ExternalTable will point to an existing table in a data source. For example, to create a table in Glue that points to a table in Redshift:

 Connection myConnection;
 Database myDatabase;
 
 ExternalTable.Builder.create(this, "MyTable")
         .connection(myConnection)
         .externalDataLocation("default_db_public_example") // A table in Redshift
         // ...
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();

Encryption

You can enable encryption on a Table's data:

S3Managed - (default) Server side encryption (SSE-S3) with an Amazon S3-managed key.

 Database myDatabase;
 
 S3Table.Builder.create(this, "MyTable")
         .encryption(TableEncryption.S3_MANAGED)
         // ...
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();

Kms - Server-side encryption (SSE-KMS) with an AWS KMS Key managed by the account owner.

 Database myDatabase;
 
 // KMS key is created automatically
 // KMS key is created automatically
 S3Table.Builder.create(this, "MyTable")
         .encryption(TableEncryption.KMS)
         // ...
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();
 
 // with an explicit KMS key
 // with an explicit KMS key
 S3Table.Builder.create(this, "MyTable")
         .encryption(TableEncryption.KMS)
         .encryptionKey(new Key(this, "MyKey"))
         // ...
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();

KmsManaged - Server-side encryption (SSE-KMS), like Kms, except with an AWS KMS Key managed by the AWS Key Management Service.

 Database myDatabase;
 
 S3Table.Builder.create(this, "MyTable")
         .encryption(TableEncryption.KMS_MANAGED)
         // ...
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();

ClientSideKms - Client-side encryption (CSE-KMS) with an AWS KMS Key managed by the account owner.

 Database myDatabase;
 
 // KMS key is created automatically
 // KMS key is created automatically
 S3Table.Builder.create(this, "MyTable")
         .encryption(TableEncryption.CLIENT_SIDE_KMS)
         // ...
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();
 
 // with an explicit KMS key
 // with an explicit KMS key
 S3Table.Builder.create(this, "MyTable")
         .encryption(TableEncryption.CLIENT_SIDE_KMS)
         .encryptionKey(new Key(this, "MyKey"))
         // ...
         .database(myDatabase)
         .columns(List.of(Column.builder()
                 .name("col1")
                 .type(Schema.STRING)
                 .build()))
         .dataFormat(DataFormat.JSON)
         .build();

Note: you cannot provide a Bucket when creating the S3Table if you wish to use server-side encryption (KMS, KMS_MANAGED or S3_MANAGED).

Types

A table's schema is a collection of columns, each of which have a name and a type. Types are recursive structures, consisting of primitive and complex types:

 Database myDatabase;
 
 S3Table.Builder.create(this, "MyTable")
         .columns(List.of(Column.builder()
                 .name("primitive_column")
                 .type(Schema.STRING)
                 .build(), Column.builder()
                 .name("array_column")
                 .type(Schema.array(Schema.INTEGER))
                 .comment("array<integer>")
                 .build(), Column.builder()
                 .name("map_column")
                 .type(Schema.map(Schema.STRING, Schema.TIMESTAMP))
                 .comment("map<string,string>")
                 .build(), Column.builder()
                 .name("struct_column")
                 .type(Schema.struct(List.of(Column.builder()
                         .name("nested_column")
                         .type(Schema.DATE)
                         .comment("nested comment")
                         .build())))
                 .comment("struct<nested_column:date COMMENT 'nested comment'>")
                 .build()))
         // ...
         .database(myDatabase)
         .dataFormat(DataFormat.JSON)
         .build();

Public FAQ

What are we launching today?

We’re launching new features to an AWS CDK Glue L2 Construct to provide best-practice defaults and convenience methods to create Glue Jobs, Connections, Triggers, Workflows, and the underlying permissions and configuration.

Why should I use this Construct?

Developers should use this Construct to reduce the amount of boilerplate code and complexity each individual has to navigate, and make it easier to create best-practice Glue resources.

What’s not in scope?

Glue Crawlers and other resources that are now managed by the AWS LakeFormation team are not in scope for this effort. Developers should use existing methods to create these resources, and the new Glue L2 construct assumes they already exist as inputs. While best practice is for application and infrastructure code to be as close as possible for teams using fully-implemented DevOps mechanisms, in practice these ETL scripts are likely managed by a data science team who know Python or Scala and don’t necessarily own or manage their own infrastructure deployments. We want to meet developers where they are, and not assume that all of the code resides in the same repository, Developers can automate this themselves via the CDK, however, if they do own both.

Validating Glue version and feature use per AWS region at synth time is also not in scope. AWS’ intention is for all features to eventually be propagated to all Global regions, so the complexity involved in creating and updating region- specific configuration to match shifting feature sets does not out-weigh the likelihood that a developer will use this construct to deploy resources to a region without a particular new feature to a region that doesn’t yet support it without researching or manually attempting to use that feature before developing it via IaC. The developer will, of course, still get feedback from the underlying Glue APIs as CloudFormation deploys the resources similar to the current CDK L1 Glue experience.

Related Packages

Package

Description

software.amazon.awscdk.services.glue

AWS Glue Construct Library
Class

Description

$Module

Action

(experimental) Represents a trigger action.

Action.Builder

A builder for Action

Action.Jsii$Proxy

An implementation for Action

AssetCode

(experimental) Job Code from a local file.

ClassificationString

(experimental) Classification string given to tables with this data format.

CloudWatchEncryption

(experimental) CloudWatch Logs encryption configuration.

CloudWatchEncryption.Builder

A builder for CloudWatchEncryption

CloudWatchEncryption.Jsii$Proxy

An implementation for CloudWatchEncryption

CloudWatchEncryptionMode

(experimental) Encryption mode for CloudWatch Logs.

Code

(experimental) Represents a Glue Job's Code assets (an asset can be a scripts, a jar, a python file or any other file).

CodeConfig

(experimental) Result of binding Code into a Job.

CodeConfig.Builder

A builder for CodeConfig

CodeConfig.Jsii$Proxy

An implementation for CodeConfig

Column

(experimental) A column of a table.

Column.Builder

A builder for Column

Column.Jsii$Proxy

An implementation for Column

ColumnCountMismatchHandlingAction

(experimental) Identifies if the file contains less or more values for a row than the number of columns specified in the external table definition.

CompressionType

(experimental) The compression type.

Condition

(experimental) Represents a trigger condition.

Condition.Builder

A builder for Condition

Condition.Jsii$Proxy

An implementation for Condition

ConditionalTriggerOptions

(experimental) Properties for configuring a Condition (Predicate) based Glue Trigger.

ConditionalTriggerOptions.Builder

A builder for ConditionalTriggerOptions

ConditionalTriggerOptions.Jsii$Proxy

An implementation for ConditionalTriggerOptions

ConditionLogicalOperator

(experimental) Represents the logical operator for evaluating a single condition in the Glue Trigger API.

Connection

(experimental) An AWS Glue connection to a data source.

Connection.Builder

(experimental) A fluent builder for Connection.

ConnectionOptions

(experimental) Base Connection Options.

ConnectionOptions.Builder

A builder for ConnectionOptions

ConnectionOptions.Jsii$Proxy

An implementation for ConnectionOptions

ConnectionProps

(experimental) Construction properties for Connection.

ConnectionProps.Builder

A builder for ConnectionProps

ConnectionProps.Jsii$Proxy

An implementation for ConnectionProps

ConnectionType

(experimental) The type of the glue connection.

ContinuousLoggingProps

(experimental) Properties for enabling Continuous Logging for Glue Jobs.

ContinuousLoggingProps.Builder

A builder for ContinuousLoggingProps

ContinuousLoggingProps.Jsii$Proxy

An implementation for ContinuousLoggingProps

CrawlerState

(experimental) Represents the state of a crawler for a condition in the Glue Trigger API.

CustomScheduledTriggerOptions

(experimental) Properties for configuring a custom-scheduled Glue Trigger.

CustomScheduledTriggerOptions.Builder

A builder for CustomScheduledTriggerOptions

CustomScheduledTriggerOptions.Jsii$Proxy

An implementation for CustomScheduledTriggerOptions

DailyScheduleTriggerOptions

(experimental) Properties for configuring a daily-scheduled Glue Trigger.

DailyScheduleTriggerOptions.Builder

A builder for DailyScheduleTriggerOptions

DailyScheduleTriggerOptions.Jsii$Proxy

An implementation for DailyScheduleTriggerOptions

Database

(experimental) A Glue database.

Database.Builder

(experimental) A fluent builder for Database.

DatabaseProps

Example:

DatabaseProps.Builder

A builder for DatabaseProps

DatabaseProps.Jsii$Proxy

An implementation for DatabaseProps

DataFormat

(experimental) Defines the input/output formats and ser/de for a single DataFormat.

DataFormat.Builder

(experimental) A fluent builder for DataFormat.

DataFormatProps

(experimental) Properties of a DataFormat instance.

DataFormatProps.Builder

A builder for DataFormatProps

DataFormatProps.Jsii$Proxy

An implementation for DataFormatProps

DataQualityRuleset

(experimental) A Glue Data Quality ruleset.

DataQualityRuleset.Builder

(experimental) A fluent builder for DataQualityRuleset.

DataQualityRulesetProps

(experimental) Construction properties for DataQualityRuleset.

DataQualityRulesetProps.Builder

A builder for DataQualityRulesetProps

DataQualityRulesetProps.Jsii$Proxy

An implementation for DataQualityRulesetProps

DataQualityTargetTable

(experimental) Properties of a DataQualityTargetTable.

EventBatchingCondition

(experimental) Represents event trigger batch condition.

EventBatchingCondition.Builder

A builder for EventBatchingCondition

EventBatchingCondition.Jsii$Proxy

An implementation for EventBatchingCondition

ExecutionClass

(experimental) The ExecutionClass whether the job is run with a standard or flexible execution class.

ExternalTable

(experimental) A Glue table that targets an external data location (e.g.

ExternalTable.Builder

(experimental) A fluent builder for ExternalTable.

ExternalTableProps

Example:

ExternalTableProps.Builder

A builder for ExternalTableProps

ExternalTableProps.Jsii$Proxy

An implementation for ExternalTableProps

GlueVersion

(experimental) AWS Glue version determines the versions of Apache Spark and Python that are available to the job.

IConnection

(experimental) Interface representing a created or an imported Connection.

IConnection.Jsii$Default

Internal default implementation for IConnection.

IConnection.Jsii$Proxy

A proxy class which represents a concrete javascript instance of this type.

IDatabase

IDatabase.Jsii$Default

Internal default implementation for IDatabase.

IDatabase.Jsii$Proxy

A proxy class which represents a concrete javascript instance of this type.

IDataQualityRuleset

IDataQualityRuleset.Jsii$Default

Internal default implementation for IDataQualityRuleset.

IDataQualityRuleset.Jsii$Proxy

A proxy class which represents a concrete javascript instance of this type.

IJob

(experimental) Interface representing a new or an imported Glue Job.

IJob.Jsii$Default

Internal default implementation for IJob.

IJob.Jsii$Proxy

A proxy class which represents a concrete javascript instance of this type.

InputFormat

(experimental) Absolute class name of the Hadoop InputFormat to use when reading table files.

InvalidCharHandlingAction

(experimental) Specifies the action to perform when query results contain invalid UTF-8 character values.

ISecurityConfiguration

(experimental) Interface representing a created or an imported SecurityConfiguration.

ISecurityConfiguration.Jsii$Default

Internal default implementation for ISecurityConfiguration.

ISecurityConfiguration.Jsii$Proxy

A proxy class which represents a concrete javascript instance of this type.

ITable

ITable.Jsii$Default

Internal default implementation for ITable.

ITable.Jsii$Proxy

A proxy class which represents a concrete javascript instance of this type.

IWorkflow

(experimental) The base interface for Glue Workflow.

IWorkflow.Jsii$Default

Internal default implementation for IWorkflow.

IWorkflow.Jsii$Proxy

A proxy class which represents a concrete javascript instance of this type.

Job

(experimental) A Glue Job.

JobAttributes

(experimental) A subset of Job attributes are required for importing an existing job into a CDK project.

JobAttributes.Builder

A builder for JobAttributes

JobAttributes.Jsii$Proxy

An implementation for JobAttributes

JobBase

(experimental) A base class is needed to be able to import existing Jobs into a CDK app to reference as part of a larger stack or construct.

JobBookmarksEncryption

(experimental) Job bookmarks encryption configuration.

JobBookmarksEncryption.Builder

A builder for JobBookmarksEncryption

JobBookmarksEncryption.Jsii$Proxy

An implementation for JobBookmarksEncryption

JobBookmarksEncryptionMode

(experimental) Encryption mode for Job Bookmarks.

JobLanguage

(experimental) Runtime language of the Glue job.

JobProps

(experimental) JobProps will be used to create new Glue Jobs using this L2 Construct.

JobProps.Builder

A builder for JobProps

JobProps.Jsii$Proxy

An implementation for JobProps

JobState

(experimental) Job states emitted by Glue to CloudWatch Events.

JobType

(experimental) The job type.

MaxCapacity

(experimental) The number of AWS Glue data processing units (DPUs) that can be allocated when this job runs.

MetricType

(experimental) The Glue CloudWatch metric type.

NotifyEventTriggerOptions

(experimental) Properties for configuring an Event Bridge based Glue Trigger.

NotifyEventTriggerOptions.Builder

A builder for NotifyEventTriggerOptions

NotifyEventTriggerOptions.Jsii$Proxy

An implementation for NotifyEventTriggerOptions

NumericOverflowHandlingAction

(experimental) Specifies the action to perform when ORC data contains an integer (for example, BIGINT or int64) that is larger than the column definition (for example, SMALLINT or int16).

OnDemandTriggerOptions

(experimental) Properties for configuring an on-demand Glue Trigger.

OnDemandTriggerOptions.Builder

A builder for OnDemandTriggerOptions

OnDemandTriggerOptions.Jsii$Proxy

An implementation for OnDemandTriggerOptions

OrcColumnMappingType

(experimental) Specifies how to map columns when the table uses ORC data format.

OutputFormat

(experimental) Absolute class name of the Hadoop OutputFormat to use when writing table files.

PartitionIndex

(experimental) Properties of a Partition Index.

PartitionIndex.Builder

A builder for PartitionIndex

PartitionIndex.Jsii$Proxy

An implementation for PartitionIndex

Predicate

(experimental) Represents a trigger predicate.

Predicate.Builder

A builder for Predicate

Predicate.Jsii$Proxy

An implementation for Predicate

PredicateLogical

PySparkEtlJob

(experimental) PySpark ETL Jobs class.

PySparkEtlJob.Builder

(experimental) A fluent builder for PySparkEtlJob.

PySparkEtlJobProps

(experimental) Properties for creating a Python Spark ETL job.

PySparkEtlJobProps.Builder

A builder for PySparkEtlJobProps

PySparkEtlJobProps.Jsii$Proxy

An implementation for PySparkEtlJobProps

PySparkFlexEtlJob

(experimental) Flex Jobs class.

PySparkFlexEtlJob.Builder

(experimental) A fluent builder for PySparkFlexEtlJob.

PySparkFlexEtlJobProps

(experimental) Properties for PySparkFlexEtlJob.

PySparkFlexEtlJobProps.Builder

A builder for PySparkFlexEtlJobProps

PySparkFlexEtlJobProps.Jsii$Proxy

An implementation for PySparkFlexEtlJobProps

PySparkStreamingJob

(experimental) Python Spark Streaming Jobs class.

PySparkStreamingJob.Builder

(experimental) A fluent builder for PySparkStreamingJob.

PySparkStreamingJobProps

(experimental) Properties for creating a Python Spark ETL job.

PySparkStreamingJobProps.Builder

A builder for PySparkStreamingJobProps

PySparkStreamingJobProps.Jsii$Proxy

An implementation for PySparkStreamingJobProps

PythonShellJob

(experimental) Python Shell Jobs class.

PythonShellJob.Builder

(experimental) A fluent builder for PythonShellJob.

PythonShellJobProps

(experimental) Properties for creating a Python Shell job.

PythonShellJobProps.Builder

A builder for PythonShellJobProps

PythonShellJobProps.Jsii$Proxy

An implementation for PythonShellJobProps

PythonVersion

(experimental) Python version.

RayJob

(experimental) Ray Jobs class.

RayJob.Builder

(experimental) A fluent builder for RayJob.

RayJobProps

(experimental) Properties for creating a Ray Glue job.

RayJobProps.Builder

A builder for RayJobProps

RayJobProps.Jsii$Proxy

An implementation for RayJobProps

Runtime

(experimental) AWS Glue runtime determines the runtime engine of the job.

S3Code

(experimental) Glue job Code from an S3 bucket.

S3Encryption

(experimental) S3 encryption configuration.

S3Encryption.Builder

A builder for S3Encryption

S3Encryption.Jsii$Proxy

An implementation for S3Encryption

S3EncryptionMode

(experimental) Encryption mode for S3.

S3Table

(experimental) A Glue table that targets a S3 dataset.

S3Table.Builder

(experimental) A fluent builder for S3Table.

S3TableProps

Example:

S3TableProps.Builder

A builder for S3TableProps

S3TableProps.Jsii$Proxy

An implementation for S3TableProps

ScalaSparkEtlJob

(experimental) Spark ETL Jobs class.

ScalaSparkEtlJob.Builder

(experimental) A fluent builder for ScalaSparkEtlJob.

ScalaSparkEtlJobProps

(experimental) Properties for creating a Scala Spark ETL job.

ScalaSparkEtlJobProps.Builder

A builder for ScalaSparkEtlJobProps

ScalaSparkEtlJobProps.Jsii$Proxy

An implementation for ScalaSparkEtlJobProps

ScalaSparkFlexEtlJob

(experimental) Spark ETL Jobs class.

ScalaSparkFlexEtlJob.Builder

(experimental) A fluent builder for ScalaSparkFlexEtlJob.

ScalaSparkFlexEtlJobProps

(experimental) Flex Jobs class.

ScalaSparkFlexEtlJobProps.Builder

A builder for ScalaSparkFlexEtlJobProps

ScalaSparkFlexEtlJobProps.Jsii$Proxy

An implementation for ScalaSparkFlexEtlJobProps

ScalaSparkStreamingJob

(experimental) Scala Streaming Jobs class.

ScalaSparkStreamingJob.Builder

(experimental) A fluent builder for ScalaSparkStreamingJob.

ScalaSparkStreamingJobProps

(experimental) Properties for creating a Scala Spark ETL job.

ScalaSparkStreamingJobProps.Builder

A builder for ScalaSparkStreamingJobProps

ScalaSparkStreamingJobProps.Jsii$Proxy

An implementation for ScalaSparkStreamingJobProps

Schema

Example:

SecurityConfiguration

(experimental) A security configuration is a set of security properties that can be used by AWS Glue to encrypt data at rest.

SecurityConfiguration.Builder

(experimental) A fluent builder for SecurityConfiguration.

SecurityConfigurationProps

(experimental) Constructions properties of SecurityConfiguration.

SecurityConfigurationProps.Builder

A builder for SecurityConfigurationProps

SecurityConfigurationProps.Jsii$Proxy

An implementation for SecurityConfigurationProps

SerializationLibrary

(experimental) Serialization library to use when serializing/deserializing (SerDe) table records.

SparkExtraCodeProps

(experimental) Code props for different Code assets used by different types of Spark jobs.

SparkExtraCodeProps.Builder

A builder for SparkExtraCodeProps

SparkExtraCodeProps.Jsii$Proxy

An implementation for SparkExtraCodeProps

SparkJob

(experimental) Base class for different types of Spark Jobs.

SparkJobProps

(experimental) Common properties for different types of Spark jobs.

SparkJobProps.Builder

A builder for SparkJobProps

SparkJobProps.Jsii$Proxy

An implementation for SparkJobProps

SparkUILoggingLocation

(experimental) The Spark UI logging location.

SparkUILoggingLocation.Builder

A builder for SparkUILoggingLocation

SparkUILoggingLocation.Jsii$Proxy

An implementation for SparkUILoggingLocation

SparkUIProps

(experimental) Properties for enabling Spark UI monitoring feature for Spark-based Glue jobs.

SparkUIProps.Builder

A builder for SparkUIProps

SparkUIProps.Jsii$Proxy

An implementation for SparkUIProps

StorageParameter

(experimental) A storage parameter.

StorageParameters

(experimental) The storage parameter keys that are currently known, this list is not exhaustive and other keys may be used.

SurplusBytesHandlingAction

(experimental) Specifies how to handle data being loaded that exceeds the length of the data type defined for columns containing VARBYTE data.

SurplusCharHandlingAction

(experimental) Specifies how to handle data being loaded that exceeds the length of the data type defined for columns containing VARCHAR, CHAR, or string data.

Table

Deprecated.
Use S3Table instead.

Table.Builder

(experimental) A fluent builder for Table.

TableAttributes

Example:

TableAttributes.Builder

A builder for TableAttributes

TableAttributes.Jsii$Proxy

An implementation for TableAttributes

TableBase

(experimental) A Glue table.

TableBaseProps

Example:

TableBaseProps.Builder

A builder for TableBaseProps

TableBaseProps.Jsii$Proxy

An implementation for TableBaseProps

TableEncryption

(experimental) Encryption options for a Table.

TableProps

Example:

TableProps.Builder

A builder for TableProps

TableProps.Jsii$Proxy

An implementation for TableProps

TriggerOptions

(experimental) Properties for configuring a Glue Trigger.

TriggerOptions.Builder

A builder for TriggerOptions

TriggerOptions.Jsii$Proxy

An implementation for TriggerOptions

TriggerSchedule

(experimental) Represents a trigger schedule.

Type

(experimental) Represents a type of a column in a table schema.

Type.Builder

A builder for Type

Type.Jsii$Proxy

An implementation for Type

WeeklyScheduleTriggerOptions

(experimental) Properties for configuring a weekly-scheduled Glue Trigger.

WeeklyScheduleTriggerOptions.Builder

A builder for WeeklyScheduleTriggerOptions

WeeklyScheduleTriggerOptions.Jsii$Proxy

An implementation for WeeklyScheduleTriggerOptions

WorkerType

(experimental) The type of predefined worker that is allocated when a job runs.

Workflow

(experimental) This module defines a construct for creating and managing AWS Glue Workflows and Triggers.

Workflow.Builder

(experimental) A fluent builder for Workflow.

WorkflowAttributes

(experimental) Properties for importing a Workflow using its attributes.

WorkflowAttributes.Builder

A builder for WorkflowAttributes

WorkflowAttributes.Jsii$Proxy

An implementation for WorkflowAttributes

WorkflowBase

(experimental) Base abstract class for Workflow.

WorkflowProps

(experimental) Properties for defining a Workflow.

WorkflowProps.Builder

A builder for WorkflowProps

WorkflowProps.Jsii$Proxy

An implementation for WorkflowProps

WriteParallel

(experimental) Specifies how to handle data being loaded that exceeds the length of the data type defined for columns containing VARCHAR, CHAR, or string data.

Package software.amazon.awscdk.services.glue.alpha

AWS Glue Construct Library

README

References

Create a Glue Job

Spark Jobs

ETL Jobs

Streaming Jobs

Flex Jobs

Python Shell Jobs

Ray Jobs

Metrics Control

Enable Job Run Queuing

Uploading scripts from the CDK app repository to S3

Workflow Triggers

1. On-Demand Triggers

2. Scheduled Triggers

3. Notify Event Triggers

4. Conditional Triggers

Connection Properties

SecurityConfiguration

Database

Table

Partition Keys

Partition Indexes

Partition Filtering

Glue Connections

Encryption

Types

Public FAQ

What are we launching today?

Why should I use this Construct?

What’s not in scope?