

# Developing blueprints in AWS Glue
<a name="developing-blueprints"></a>

As an AWS Glue developer, you can create and publish blueprints that data analysts can use to generate workflows.

**Topics**
+ [Overview of developing blueprints](developing-blueprints-overview.md)
+ [Prerequisites for developing blueprints](developing-blueprints-prereq.md)
+ [Writing the blueprint code](developing-blueprints-code.md)
+ [Sample blueprint project](developing-blueprints-sample.md)
+ [Testing a blueprint](developing-blueprints-testing.md)
+ [Publishing a blueprint](developing-blueprints-publishing.md)
+ [AWS Glue blueprint classes reference](developing-blueprints-code-classes.md)
+ [Blueprint samples](developing-blueprints-samples.md)

**See also**  
[Overview of blueprints in AWS Glue](blueprints-overview.md)

# Overview of developing blueprints
<a name="developing-blueprints-overview"></a>

The first step in your development process is to identify a common use case that would benefit from a blueprint. A typical use case involves a recurring ETL problem that you believe should be solved in a general manner. Next, design a blueprint that implements the generalized use case, and define the blueprint input parameters that together can define a specific use case from the generalized use case.

A blueprint consists of a project that contains a blueprint parameter configuration file and a script that defines the *layout* of the workflow to generate. The layout defines the jobs and crawlers (or *entities* in blueprint script terminology) to create.

You do not directly specify any triggers in the layout script. Instead you write code to specify the dependencies between the jobs and crawlers that the script creates. AWS Glue generates the triggers based on your dependency specifications. The output of the layout script is a workflow object, which contains specifications for all workflow entities.

You build your workflow object using the following AWS Glue blueprint libraries:
+ `awsglue.blueprint.base_resource` – A library of base resources used by the libraries.
+ `awsglue.blueprint.workflow` – A library for defining a `Workflow` class.
+ `awsglue.blueprint.job` – A library for defining a `Job` class.
+ `awsglue.blueprint.crawler` – A library for defining a `Crawler` class.

The only other libraries that are supported for layout generation are those libraries that are available for the Python shell.

Before publishing your blueprint, you can use methods defined in the blueprint libraries to test the blueprint locally.

When you're ready to make the blueprint available to data analysts, you package the script, the parameter configuration file, and any supporting files, such as additional scripts and libraries, into a single deployable asset. You then upload the asset to Amazon S3 and ask an administrator to register it with AWS Glue.

For information about more sample blueprint projects, see [Sample blueprint project](developing-blueprints-sample.md) and [Blueprint samples](developing-blueprints-samples.md).

# Prerequisites for developing blueprints
<a name="developing-blueprints-prereq"></a>

To develop blueprints, you should be familiar with using AWS Glue and writing scripts for Apache Spark ETL jobs or Python shell jobs. In addition, you must complete the following setup tasks.
+ Download four AWS Python libraries to use in your blueprint layout scripts.
+ Set up the AWS SDKs.
+ Set up the AWS CLI.

## Download the Python libraries
<a name="prereqs-get-libes"></a>

Download the following libraries from GitHub, and install them into your project:
+ [https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/base\$1resource.py](https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/base_resource.py)
+ [https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/workflow.py](https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/workflow.py)
+ [https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/crawler.py](https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/crawler.py)
+ [https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/job.py](https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/awsglue/blueprint/job.py)

## Set up the AWS Java SDK
<a name="prereqs-java-preview-sdk"></a>

For the AWS Java SDK, you must add a `jar` file that includes the API for blueprints.

1. If you haven't already done so, set up the AWS SDK for Java.
   + For Java 1.x, follow the instructions in [Set up the AWS SDK for Java](https://docs.aws.amazon.com/sdk-for-java/v1/developer-guide/setup-install.html) in the *AWS SDK for Java Developer Guide*.
   + For Java 2.x, follow the instructions in [Setting up the AWS SDK for Java 2.x](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/setup.html) in the *AWS SDK for Java 2.x Developer Guide*.

1. Download the client `jar` file that has access to the APIs for blueprints.
   + For Java 1.x: s3://awsglue-custom-blueprints-preview-artifacts/awsglue-java-sdk-preview/AWSGlueJavaClient-1.11.x.jar
   + For Java 2.x: s3://awsglue-custom-blueprints-preview-artifacts/awsglue-java-sdk-v2-preview/AwsJavaSdk-Glue-2.0.jar

1. Add the client `jar` to the front of the Java classpath to override the AWS Glue client provided by the AWS Java SDK.

   ```
   export CLASSPATH=<path-to-preview-client-jar>:$CLASSPATH
   ```

1. (Optional) Test the SDK with the following Java application. The application should output an empty list.

   Replace `accessKey` and `secretKey` with your credentials, and replace `us-east-1` with your Region.

   ```
   import com.amazonaws.auth.AWSCredentials;
   import com.amazonaws.auth.AWSCredentialsProvider;
   import com.amazonaws.auth.AWSStaticCredentialsProvider;
   import com.amazonaws.auth.BasicAWSCredentials;
   import com.amazonaws.services.glue.AWSGlue;
   import com.amazonaws.services.glue.AWSGlueClientBuilder;
   import com.amazonaws.services.glue.model.ListBlueprintsRequest;
   
   public class App{
       public static void main(String[] args) {
           AWSCredentials credentials = new BasicAWSCredentials("accessKey", "secretKey");
           AWSCredentialsProvider provider = new AWSStaticCredentialsProvider(credentials);
           AWSGlue glue = AWSGlueClientBuilder.standard().withCredentials(provider)
                   .withRegion("us-east-1").build();
           ListBlueprintsRequest request = new ListBlueprintsRequest().withMaxResults(2);
           System.out.println(glue.listBlueprints(request));
       }
   }
   ```

## Set up the AWS Python SDK
<a name="prereqs-python-preview-sdk"></a>

The following steps assume that you have Python version 2.7 or later, or version 3.9 or later installed on your computer.

1. Download the following boto3 wheel file. If prompted to open or save, save the file. s3://awsglue-custom-blueprints-preview-artifacts/aws-python-sdk-preview/boto3-1.17.31-py2.py3-none-any.whl

1. Download the following botocore wheel file: s3://awsglue-custom-blueprints-preview-artifacts/aws-python-sdk-preview/botocore-1.20.31-py2.py3-none-any.whl

1. Check your Python version.

   ```
   python --version
   ```

1. Depending on your Python version, enter the following commands (for Linux):
   + For Python 2.7 or later.

     ```
     python3 -m pip install --user virtualenv
     source env/bin/activate
     ```
   + For Python 3.9 or later.

     ```
     python3 -m venv python-sdk-test
     source python-sdk-test/bin/activate
     ```

1. Install the botocore wheel file.

   ```
   python3 -m pip install <download-directory>/botocore-1.20.31-py2.py3-none-any.whl
   ```

1. Install the boto3 wheel file.

   ```
   python3 -m pip install <download-directory>/boto3-1.17.31-py2.py3-none-any.whl
   ```

1. Configure your credentials and default region in the `~/.aws/credentials` and `~/.aws/config` files. For more information, see [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) in the *AWS Command Line Interface User Guide*.

1. (Optional) Test your setup. The following commands should return an empty list.

   Replace `us-east-1` with your Region.

   ```
   $ python
   >>> import boto3
   >>> glue = boto3.client('glue', 'us-east-1')
   >>> glue.list_blueprints()
   ```

## Set up the preview AWS CLI
<a name="prereqs-setup-cli"></a>

1. If you haven't already done so, install and/or update the AWS Command Line Interface (AWS CLI) on your computer. The easiest way to do this is with `pip`, the Python installer utility:

   ```
   pip install awscli --upgrade --user
   ```

   You can find complete installation instructions for the AWS CLI here: [Installing the AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/userguide/installing.html).

1. Download the AWS CLI wheel file from: s3://awsglue-custom-blueprints-preview-artifacts/awscli-preview-build/awscli-1.19.31-py2.py3-none-any.whl

1. Install the AWS CLI wheel file.

   ```
   python3 -m pip install awscli-1.19.31-py2.py3-none-any.whl
   ```

1. Run the `aws configure` command. Configure your AWS credentials (including access key, and secret key) and AWS Region. You can find information on configuring the AWS CLI here: [Configuring the AWS CLI](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html).

1. Test the AWS CLI. The following command should return an empty list.

   Replace `us-east-1` with your Region.

   ```
   aws glue list-blueprints --region us-east-1
   ```

# Writing the blueprint code
<a name="developing-blueprints-code"></a>

Each blueprint project that you create must contain at a minimum the following files:
+ A Python layout script that defines the workflow. The script contains a function that defines the entities (jobs and crawlers) in a workflow, and the dependencies between them.
+ A configuration file, `blueprint.cfg`, which defines:
  + The full path of the workflow layout definition function.
  + The parameters that the blueprint accepts.

**Topics**
+ [Creating the blueprint layout script](developing-blueprints-code-layout.md)
+ [Creating the configuration file](developing-blueprints-code-config.md)
+ [Specifying blueprint parameters](developing-blueprints-code-parameters.md)

# Creating the blueprint layout script
<a name="developing-blueprints-code-layout"></a>

The blueprint layout script must include a function that generates the entities in your workflow. You can name this function whatever you like. AWS Glue uses the configuration file to determine the fully qualified name of the function.

Your layout function does the following:
+ (Optional) Instantiates the `Job` class to create `Job` objects, and passes arguments such as `Command` and `Role`. These are job properties that you would specify if you were creating the job using the AWS Glue console or API.
+ (Optional) Instantiates the `Crawler` class to create `Crawler` objects, and passes name, role, and target arguments.
+ To indicate dependencies between the objects (workflow entities), passes the `DependsOn` and `WaitForDependencies` additional arguments to `Job()` and `Crawler()`. These arguments are explained later in this section.
+ Instantiates the `Workflow` class to create the workflow object that is returned to AWS Glue, passing a `Name` argument, an `Entities` argument, and an optional `OnSchedule` argument. The `Entities` argument specifies all of the jobs and crawlers to include in the workflow. To see how to construct an `Entities` object, see the sample project later in this section.
+ Returns the `Workflow` object.

For definitions of the `Job`, `Crawler`, and `Workflow` classes, see [AWS Glue blueprint classes reference](developing-blueprints-code-classes.md).

The layout function must accept the following input arguments.


| Argument | Description | 
| --- | --- | 
| user\$1params | Python dictionary of blueprint parameter names and values. For more information, see [Specifying blueprint parameters](developing-blueprints-code-parameters.md). | 
| system\$1params | Python dictionary containing two properties: region and accountId. | 

Here is a sample layout generator script in a file named `Layout.py`:

```
import argparse
import sys
import os
import json
from awsglue.blueprint.workflow import *
from awsglue.blueprint.job import *
from awsglue.blueprint.crawler import *


def generate_layout(user_params, system_params):

    etl_job = Job(Name="{}_etl_job".format(user_params['WorkflowName']),
                  Command={
                      "Name": "glueetl",
                      "ScriptLocation": user_params['ScriptLocation'],
                      "PythonVersion": "2"
                  },
                  Role=user_params['PassRole'])
    post_process_job = Job(Name="{}_post_process".format(user_params['WorkflowName']),
                            Command={
                                "Name": "pythonshell",
                                "ScriptLocation": user_params['ScriptLocation'],
                                "PythonVersion": "2"
                            },
                            Role=user_params['PassRole'],
                            DependsOn={
                                etl_job: "SUCCEEDED"
                            },
                            WaitForDependencies="AND")
    sample_workflow = Workflow(Name=user_params['WorkflowName'],
                            Entities=Entities(Jobs=[etl_job, post_process_job]))
    return sample_workflow
```

The sample script imports the required blueprint libraries and includes a `generate_layout` function that generates a workflow with two jobs. This is a very simple script. A more complex script could employ additional logic and parameters to generate a workflow with many jobs and crawlers, or even a variable number of jobs and crawlers.

## Using the DependsOn argument
<a name="developing-blueprints-code-layout-depends-on"></a>

The `DependsOn` argument is a dictionary representation of a dependency that this entity has on other entities within the workflow. It has the following form. 

```
DependsOn = {dependency1 : state, dependency2 : state, ...}
```

The keys in this dictionary represent the object reference, not the name, of the entity, while the values are strings that correspond to the state to watch for. AWS Glue infers the proper triggers. For the valid states, see [Condition Structure](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-jobs-trigger.html#aws-glue-api-jobs-trigger-Condition).

For example, a job might depend on the successful completion of a crawler. If you define a crawler object named `crawler2` as follows:

```
crawler2 = Crawler(Name="my_crawler", ...)
```

Then an object depending on `crawler2` would include a constructor argument such as: 

```
DependsOn = {crawler2 : "SUCCEEDED"}
```

For example:

```
job1 = Job(Name="Job1", ..., DependsOn = {crawler2 : "SUCCEEDED", ...})
```

If `DependsOn` is omitted for an entity, that entity depends on the workflow start trigger.

## Using the WaitForDependencies argument
<a name="developing-blueprints-code-layout-wait-for-dependencies"></a>

The `WaitForDependencies` argument defines whether a job or crawler entity should wait until *all* entities on which it depends complete or until *any* completes.

The allowable values are "`AND`" or "`ANY`".

## Using the OnSchedule argument
<a name="developing-blueprints-code-layout-on-schedule"></a>

The `OnSchedule` argument for the `Workflow` class constructor is a `cron` expression that defines the starting trigger definition for a workflow.

If this argument is specified, AWS Glue creates a schedule trigger with the corresponding schedule. If it isn't specified, the starting trigger for the workflow is an on-demand trigger.

# Creating the configuration file
<a name="developing-blueprints-code-config"></a>

The blueprint configuration file is a required file that defines the script entry point for generating the workflow, and the parameters that the blueprint accepts. The file must be named `blueprint.cfg`.

Here is a sample configuration file.

```
{
    "layoutGenerator": "DemoBlueprintProject.Layout.generate_layout",
    "parameterSpec" : {
           "WorkflowName" : {
                "type": "String",
                "collection": false
           },
           "WorkerType" : {
                "type": "String",
                "collection": false,
                "allowedValues": ["G1.X", "G2.X"],
                "defaultValue": "G1.X"
           },
           "Dpu" : {
                "type" : "Integer",
                "allowedValues" : [2, 4, 6],
                "defaultValue" : 2
           },
           "DynamoDBTableName": {
                "type": "String",
                "collection" : false
           },
           "ScriptLocation" : {
                "type": "String",
                "collection": false
    	}
    }
}
```

The `layoutGenerator` property specifies the fully qualified name of the function in the script that generates the layout.

The `parameterSpec` property specifies the parameters that this blueprint accepts. For more information, see [Specifying blueprint parameters](developing-blueprints-code-parameters.md).

**Important**  
Your configuration file must include the workflow name as a blueprint parameter, or you must generate a unique workflow name in your layout script.

# Specifying blueprint parameters
<a name="developing-blueprints-code-parameters"></a>

The configuration file contains blueprint parameter specifications in a `parameterSpec` JSON object. `parameterSpec` contains one or more parameter objects.

```
"parameterSpec": {
    "<parameter_name>": {
      "type": "<parameter-type>",
      "collection": true|false, 
      "description": "<parameter-description>",
      "defaultValue": "<default value for the parameter if value not specified>"
      "allowedValues": "<list of allowed values>" 
    },
    "<parameter_name>": {    
       ...
    }
  }
```

The following are the rules for coding each parameter object:
+ The parameter name and `type` are mandatory. All other properties are optional.
+ If you specify the `defaultValue` property, the parameter is optional. Otherwise the parameter is mandatory and the data analyst who is creating a workflow from the blueprint must provide a value for it.
+ If you set the `collection` property to `true`, the parameter can take a collection of values. Collections can be of any data type.
+ If you specify `allowedValues`, the AWS Glue console displays a dropdown list of values for the data analyst to choose from when creating a workflow from the blueprint.

The following are the permitted values for `type`:


| Parameter data type | Notes | 
| --- | --- | 
| String | - | 
| Integer | - | 
| Double | - | 
| Boolean | Possible values are true and false. Generates a check box on the Create a workflow from <blueprint> page on the AWS Glue console. | 
| S3Uri | Complete Amazon S3 path, beginning with s3://. Generates a text field and Browse button on the Create a workflow from <blueprint> page. | 
| S3Bucket | Amazon S3 bucket name only. Generates a bucket picker on the Create a workflow from <blueprint> page. | 
| IAMRoleArn | Amazon Resource Name (ARN) of an AWS Identity and Access Management (IAM) role. Generates a role picker on the Create a workflow from <blueprint> page. | 
| IAMRoleName | Name of an IAM role. Generates a role picker on the Create a workflow from <blueprint> page. | 

# Sample blueprint project
<a name="developing-blueprints-sample"></a>

Data format conversion is a frequent extract, transform, and load (ETL) use case. In typical analytic workloads, column-based file formats like Parquet or ORC are preferred over text formats like CSV or JSON. This sample blueprint enables you to convert data from CSV/JSON/etc. into Parquet for files on Amazon S3. 

This blueprint takes a list of S3 paths defined by a blueprint parameter, converts the data to Parquet format, and writes it to the S3 location specified by another blueprint parameter. The layout script creates a crawler and job for each path. The layout script also uploads the ETL script in `Conversion.py` to an S3 bucket specified by another blueprint parameter. The layout script then specifies the uploaded script as the ETL script for each job. The ZIP archive for the project contains the layout script, the ETL script, and the blueprint configuration file.

For information about more sample blueprint projects, see [Blueprint samples](developing-blueprints-samples.md).

The following is the layout script, in the file `Layout.py`.

```
from awsglue.blueprint.workflow import *
from awsglue.blueprint.job import *
from awsglue.blueprint.crawler import *
import boto3

s3_client = boto3.client('s3')

# Ingesting all the S3 paths as Glue table in parquet format
def generate_layout(user_params, system_params):
    #Always give the full path for the file
    with open("ConversionBlueprint/Conversion.py", "rb") as f:
        s3_client.upload_fileobj(f, user_params['ScriptsBucket'], "Conversion.py")
    etlScriptLocation = "s3://{}/Conversion.py".format(user_params['ScriptsBucket'])    
    crawlers = []
    jobs = []
    workflowName = user_params['WorkflowName']
    for path in user_params['S3Paths']:
      tablePrefix = "source_" 
      crawler = Crawler(Name="{}_crawler".format(workflowName),
                        Role=user_params['PassRole'],
                        DatabaseName=user_params['TargetDatabase'],
                        TablePrefix=tablePrefix,
                        Targets= {"S3Targets": [{"Path": path}]})
      crawlers.append(crawler)
      transform_job = Job(Name="{}_transform_job".format(workflowName),
                         Command={"Name": "glueetl",
                                  "ScriptLocation": etlScriptLocation,
                                  "PythonVersion": "3"},
                         Role=user_params['PassRole'],
                         DefaultArguments={"--database_name": user_params['TargetDatabase'],
                                           "--table_prefix": tablePrefix,
                                           "--region_name": system_params['region'],
                                           "--output_path": user_params['TargetS3Location']},
                         DependsOn={crawler: "SUCCEEDED"},
                         WaitForDependencies="AND")
      jobs.append(transform_job)
    conversion_workflow = Workflow(Name=workflowName, Entities=Entities(Jobs=jobs, Crawlers=crawlers))
    return conversion_workflow
```

The following is the corresponding blueprint configuration file `blueprint.cfg`.

```
{
    "layoutGenerator": "ConversionBlueprint.Layout.generate_layout",
    "parameterSpec" : {
        "WorkflowName" : {
            "type": "String",
            "collection": false,
            "description": "Name for the workflow."
        },
        "S3Paths" : {
            "type": "S3Uri",
            "collection": true,
            "description": "List of Amazon S3 paths for data ingestion."
        },
        "PassRole" : {
            "type": "IAMRoleName",
            "collection": false,
            "description": "Choose an IAM role to be used in running the job/crawler"
        },
        "TargetDatabase": {
            "type": "String",
            "collection" : false,
            "description": "Choose a database in the Data Catalog."
        },
        "TargetS3Location": {
            "type": "S3Uri",
            "collection" : false,
            "description": "Choose an Amazon S3 output path: ex:s3://<target_path>/."
        },
        "ScriptsBucket": {
            "type": "S3Bucket",
            "collection": false,
            "description": "Provide an S3 bucket name(in the same AWS Region) to store the scripts."
        }
    }
}
```

The following script in the file `Conversion.py` is the uploaded ETL script. Note that it preserves the partitioning scheme during conversion. 

```
import sys
from pyspark.sql.functions import *
from pyspark.context import SparkContext
from awsglue.transforms import *
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.utils import getResolvedOptions
import boto3

args = getResolvedOptions(sys.argv, [
    'JOB_NAME',
    'region_name',
    'database_name',
    'table_prefix',
    'output_path'])
databaseName = args['database_name']
tablePrefix = args['table_prefix']
outputPath = args['output_path']

glue = boto3.client('glue', region_name=args['region_name'])

glue_context = GlueContext(SparkContext.getOrCreate())
spark = glue_context.spark_session
job = Job(glue_context)
job.init(args['JOB_NAME'], args)

def get_tables(database_name, table_prefix):
    tables = []
    paginator = glue.get_paginator('get_tables')
    for page in paginator.paginate(DatabaseName=database_name, Expression=table_prefix+"*"):
        tables.extend(page['TableList'])
    return tables

for table in get_tables(databaseName, tablePrefix):
    tableName = table['Name']
    partitionList = table['PartitionKeys']
    partitionKeys = []
    for partition in partitionList:
        partitionKeys.append(partition['Name'])

    # Create DynamicFrame from Catalog
    dyf = glue_context.create_dynamic_frame.from_catalog(
        name_space=databaseName,
        table_name=tableName,
        additional_options={
            'useS3ListImplementation': True
        },
        transformation_ctx='dyf'
    )

    # Resolve choice type with make_struct
    dyf = ResolveChoice.apply(
        frame=dyf,
        choice='make_struct',
        transformation_ctx='resolvechoice_' + tableName
    )

    # Drop null fields
    dyf = DropNullFields.apply(
        frame=dyf,
        transformation_ctx="dropnullfields_" + tableName
    )

    # Write DynamicFrame to S3 in glueparquet
    sink = glue_context.getSink(
        connection_type="s3",
        path=outputPath,
        enableUpdateCatalog=True,
        partitionKeys=partitionKeys
    )
    sink.setFormat("glueparquet")

    sink.setCatalogInfo(
        catalogDatabase=databaseName,
        catalogTableName=tableName[len(tablePrefix):]
    )
    sink.writeFrame(dyf)

job.commit()
```

**Note**  
Only two Amazon S3 paths can be supplied as an input to the sample blueprint. This is because AWS Glue triggers are limited to invoking only two crawler actions.

# Testing a blueprint
<a name="developing-blueprints-testing"></a>

While you develop your code, you should perform local testing to verify that the workflow layout is correct.

Local testing doesn't generate AWS Glue jobs, crawlers, or triggers. Instead, you run the layout script locally and use the `to_json()` and `validate()` methods to print objects and find errors. These methods are available in all three classes defined in the libraries. 

There are two ways to handle the `user_params` and `system_params` arguments that AWS Glue passes to your layout function. Your test-bench code can create a dictionary of sample blueprint parameter values and pass that to the layout function as the `user_params` argument. Or, you can remove the references to `user_params` and replace them with hardcoded strings.

If your code makes use of the `region` and `accountId` properties in the `system_params` argument, you can pass in your own dictionary for `system_params`.

**To test a blueprint**

1. Start a Python interpreter in a directory with the libraries, or load the blueprint files and the supplied libraries into your preferred integrated development environment (IDE).

1. Ensure that your code imports the supplied libraries.

1. Add code to your layout function to call `validate()` or `to_json()` on any entity or on the `Workflow` object. For example, if your code creates a `Crawler` object named `mycrawler`, you can call `validate()` as follows.

   ```
   mycrawler.validate()
   ```

   You can print `mycrawler` as follows:

   ```
   print(mycrawler.to_json())
   ```

   If you call `to_json` on an object, there is no need to also call `validate()`, because` to_json()` calls `validate()`. 

   It is most useful to call these methods on the workflow object. Assuming that your script names the workflow object `my_workflow`, validate and print the workflow object as follows.

   ```
   print(my_workflow.to_json())
   ```

   For more information about `to_json()` and `validate()`, see [Class methods](developing-blueprints-code-classes.md#developing-blueprints-code-methods).

   You can also import `pprint` and pretty-print the workflow object, as shown in the example later in this section.

1. Run the code, fix errors, and finally remove any calls to `validate()` or `to_json()`.

**Example**  
The following example shows how to construct a dictionary of sample blueprint parameters and pass it in as the `user_params` argument to layout function `generate_compaction_workflow`. It also shows how to pretty-print the generated workflow object.  

```
from pprint import pprint
from awsglue.blueprint.workflow import *
from awsglue.blueprint.job import *
from awsglue.blueprint.crawler import *
 
USER_PARAMS = {"WorkflowName": "compaction_workflow",
               "ScriptLocation": "s3://amzn-s3-demo-bucket/scripts/threaded-compaction.py",
               "PassRole": "arn:aws:iam::111122223333:role/GlueRole-ETL",
               "DatabaseName": "cloudtrial",
               "TableName": "ct_cloudtrail",
               "CoalesceFactor": 4,
               "MaxThreadWorkers": 200}
 
 
def generate_compaction_workflow(user_params: dict, system_params: dict) -> Workflow:
    compaction_job = Job(Name=f"{user_params['WorkflowName']}_etl_job",
                         Command={"Name": "glueetl",
                                  "ScriptLocation": user_params['ScriptLocation'],
                                  "PythonVersion": "3"},
                         Role="arn:aws:iam::111122223333:role/AWSGlueServiceRoleDefault",
                         DefaultArguments={"DatabaseName": user_params['DatabaseName'],
                                           "TableName": user_params['TableName'],
                                           "CoalesceFactor": user_params['CoalesceFactor'],
                                           "max_thread_workers": user_params['MaxThreadWorkers']})
 
    catalog_target = {"CatalogTargets": [{"DatabaseName": user_params['DatabaseName'], "Tables": [user_params['TableName']]}]}
 
    compacted_files_crawler = Crawler(Name=f"{user_params['WorkflowName']}_post_crawl",
                                      Targets = catalog_target,
                                      Role=user_params['PassRole'],
                                      DependsOn={compaction_job: "SUCCEEDED"},
                                      WaitForDependencies="AND",
                                      SchemaChangePolicy={"DeleteBehavior": "LOG"})
 
    compaction_workflow = Workflow(Name=user_params['WorkflowName'],
                                   Entities=Entities(Jobs=[compaction_job],
                                                     Crawlers=[compacted_files_crawler]))
    return compaction_workflow
 
generated = generate_compaction_workflow(user_params=USER_PARAMS, system_params={})
gen_dict = generated.to_json()
 
pprint(gen_dict)
```

# Publishing a blueprint
<a name="developing-blueprints-publishing"></a>

After you develop a blueprint, you must upload it to Amazon S3. You must have write permissions on the Amazon S3 bucket that you use to publish the blueprint. You must also make sure that the AWS Glue administrator, who will register the blueprint, has read access to the Amazon S3 bucket. For the suggested AWS Identity and Access Management (IAM) permissions policies for personas and roles for AWS Glue blueprints, see [Permissions for personas and roles for AWS Glue blueprints](blueprints-personas-permissions.md).

**To publish a blueprint**

1. Create the necessary scripts, resources, and blueprint configuration file.

1. Add all files to a ZIP archive and upload the ZIP file to Amazon S3. Use an S3 bucket that is in the same Region as the Region in which users will register and run the blueprint.

   You can create a ZIP file from the command line using the following command.

   ```
   zip -r folder.zip folder
   ```

1. Add a bucket policy that grants read permissions to the AWS desired account. The following is a sample policy.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "AWS": "arn:aws:iam::111122223333:root"
         },
         "Action": "s3:GetObject",
         "Resource": "arn:aws:s3:::my-blueprints/*"
       }
     ]
   }
   ```

------

1. Grant the IAM `s3:GetObject` permission on the Amazon S3 bucket to the AWS Glue administrator or to whoever will be registering blueprints. For a sample policy to grant to administrators, see [AWS Glue administrator permissions for blueprints](blueprints-personas-permissions.md#bp-persona-admin).

After you have completed local testing of your blueprint, you may also want to test a blueprint on AWS Glue. To test a blueprint on AWS Glue, it must be registered. You can limit who sees the registered blueprint using IAM authorization, or by using separate testing accounts.

**See also:**  
[Registering a blueprint in AWS Glue](registering-blueprints.md)

# AWS Glue blueprint classes reference
<a name="developing-blueprints-code-classes"></a>

The libraries for AWS Glue blueprints define three classes that you use in your workflow layout script: `Job`, `Crawler`, and `Workflow`.

**Topics**
+ [Job class](#developing-blueprints-code-jobclass)
+ [Crawler class](#developing-blueprints-code-crawlerclass)
+ [Workflow class](#developing-blueprints-code-workflowclass)
+ [Class methods](#developing-blueprints-code-methods)

## Job class
<a name="developing-blueprints-code-jobclass"></a>

The `Job` class represents an AWS Glue ETL job.

**Mandatory constructor arguments**  
The following are mandatory constructor arguments for the `Job` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| Name | str | Name to assign to the job. AWS Glue adds a randomly generated suffix to the name to distinguish the job from those created by other blueprint runs. | 
| Role | str | Amazon Resource Name (ARN) of the role that the job should assume while executing. | 
| Command | dict | Job command, as specified in the [JobCommand structure](aws-glue-api-jobs-job.md#aws-glue-api-jobs-job-JobCommand) in the API documentation.  | 

**Optional constructor arguments**  
The following are optional constructor arguments for the `Job` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| DependsOn | dict | List of workflow entities that the job depends on. For more information, see [Using the DependsOn argument](developing-blueprints-code-layout.md#developing-blueprints-code-layout-depends-on). | 
| WaitForDependencies | str | Indicates whether the job should wait until all entities on which it depends complete before executing or until any completes. For more information, see [Using the WaitForDependencies argument](developing-blueprints-code-layout.md#developing-blueprints-code-layout-wait-for-dependencies). Omit if the job depends on only one entity. | 
| (Job properties) | - | Any of the job properties listed in [Job structure](aws-glue-api-jobs-job.md#aws-glue-api-jobs-job-Job) in the AWS Glue API documentation (except CreatedOn and LastModifiedOn). | 

## Crawler class
<a name="developing-blueprints-code-crawlerclass"></a>

The `Crawler` class represents an AWS Glue crawler.

**Mandatory constructor arguments**  
The following are mandatory constructor arguments for the `Crawler` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| Name | str | Name to assign to the crawler. AWS Glue adds a randomly generated suffix to the name to distinguish the crawler from those created by other blueprint runs. | 
| Role | str | ARN of the role that the crawler should assume while running. | 
| Targets | dict | Collection of targets to crawl. Targets class constructor arguments are defined in the [CrawlerTargets structure](aws-glue-api-crawler-crawling.md#aws-glue-api-crawler-crawling-CrawlerTargets) in the API documentation. All Targets constructor arguments are optional, but you must pass at least one.  | 

**Optional constructor arguments**  
The following are optional constructor arguments for the `Crawler` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| DependsOn | dict | List of workflow entities that the crawler depends on. For more information, see [Using the DependsOn argument](developing-blueprints-code-layout.md#developing-blueprints-code-layout-depends-on). | 
| WaitForDependencies | str | Indicates whether the crawler should wait until all entities on which it depends complete before running or until any completes. For more information, see [Using the WaitForDependencies argument](developing-blueprints-code-layout.md#developing-blueprints-code-layout-wait-for-dependencies). Omit if the crawler depends on only one entity. | 
| (Crawler properties) | - | Any of the crawler properties listed in [Crawler structure](aws-glue-api-crawler-crawling.md#aws-glue-api-crawler-crawling-Crawler) in the AWS Glue API documentation, with the following exceptions:[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/developing-blueprints-code-classes.html) | 

## Workflow class
<a name="developing-blueprints-code-workflowclass"></a>

The `Workflow` class represents an AWS Glue workflow. The workflow layout script returns a `Workflow` object. AWS Glue creates a workflow based on this object.

**Mandatory constructor arguments**  
The following are mandatory constructor arguments for the `Workflow` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| Name | str | Name to assign to the workflow. | 
| Entities | Entities | A collection of entities (jobs and crawlers) to include in the workflow. The Entities class constructor accepts a Jobs argument, which is a list of Job objects, and a Crawlers argument, which is a list of Crawler objects. | 

**Optional constructor arguments**  
The following are optional constructor arguments for the `Workflow` class.


| Argument name | Type | Description | 
| --- | --- | --- | 
| Description | str | See [Workflow structure](aws-glue-api-workflow.md#aws-glue-api-workflow-Workflow). | 
| DefaultRunProperties | dict | See [Workflow structure](aws-glue-api-workflow.md#aws-glue-api-workflow-Workflow). | 
| OnSchedule | str | A cron expression. | 

## Class methods
<a name="developing-blueprints-code-methods"></a>

All three classes include the following methods.

**validate()**  
Validates the properties of the object and if errors are found, outputs a message and exits. Generates no output if there are no errors. For the `Workflow` class, calls itself on every entity in the workflow.

**to\$1json()**  
Serializes the object to JSON. Also calls `validate()`. For the `Workflow` class, the JSON object includes job and crawler lists, and a list of triggers generated by the job and crawler dependency specifications.

# Blueprint samples
<a name="developing-blueprints-samples"></a>

There are a number of sample blueprint projects available on the [AWS Glue blueprint Github repository](https://github.com/awslabs/aws-glue-blueprint-libs/tree/master/samples). These samples are for reference only and are not intended for production use.

The titles of the sample projects are:
+ Compaction: this blueprint creates a job that compacts input files into larger chunks based on desired file size.
+ Conversion: this blueprint converts input files in various standard file formats into Apache Parquet format, which is optimized for analytic workloads.
+ Crawling Amazon S3 locations: this blueprint crawls multiple Amazon S3 locations to add metadata tables to the Data Catalog.
+ Custom connection to Data Catalog: this blueprint accesses data stores using AWS Glue custom connectors, reads the records, and populates the table definitions in the AWS Glue Data Catalog based on the record schema.
+ Encoding: this blueprint converts your non-UTF files into UTF encoded files.
+ Partitioning: this blueprint creates a partitioning job that places output files into partitions based on specific partition keys.
+ Importing Amazon S3 data into a DynamoDB table: this blueprint imports data from Amazon S3 into a DynamoDB table.
+ Standard table to governed: this blueprint imports an AWS Glue Data Catalog table into a Lake Formation table.