

# Building AWS Glue jobs with interactive sessions
Building AWS Glue jobs with interactive sessions

 Data engineers can author AWS Glue jobs faster and more easily than before using interactive sessions in AWS Glue. 

**Topics**
+ [

## Overview of AWS Glue interactive sessions
](#interactive-sessions-overview)
+ [

# Getting started with AWS Glue interactive sessions
](interactive-sessions.md)
+ [

# Configuring AWS Glue interactive sessions for Jupyter and AWS Glue Studio notebooks
](interactive-sessions-magics.md)
+ [

# Converting a script or notebook into an AWS Glue job
](interactive-sessions-convert.md)
+ [

# Working with streaming operations in AWS Glue interactive sessions
](interactive-sessions-streaming.md)
+ [

# AWS Glue interactive session pricing
](interactive-sessions-session-pricing.md)
+ [

# Developing and testing AWS Glue job scripts locally
](aws-glue-programming-etl-libraries.md)
+ [

# Development endpoints
](development.md)

## Overview of AWS Glue interactive sessions


 With AWS Glue interactive sessions, you can rapidly build, test, and run data preparation and analytics applications. Interactive sessions provides a programmatic and visual interface for building and testing extract, transform, and load (ETL) scripts for data preparation. Interactive sessions run Apache Spark analytics applications and provide on-demand access to a remote Spark runtime environment. AWS Glue transparently manages serverless Spark for these interactive sessions. 

 Interactive sessions are flexible, so you build and test your applications from the environment of your choice. You can create and work with interactive sessions through the AWS Command Line Interface and the API. You can use Jupyter-compatible notebooks to visually author and test your notebook scripts. Interactive sessions provide an open-source Jupyter kernel that integrates almost anywhere that Jupyter does, including integrating with IDEs such as PyCharm, IntelliJ, and VS Code. This enables you to author code in your local environment and run it seamlessly on the interactive sessions backend. 

 Using the interactive sessions API, customers can programmatically run applications that use Apache Spark analytics without having to manage Spark infrastructure. You can run one or more Spark statements within a single interactive session. 

 Interactive sessions therefore provide a faster, cheaper, more-flexible way to build and run data preparation and analytics applications. To learn how to use interactive sessions, see the documentation in this section. [ Magics supported by AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-magics.html#interactive-sessions-magics2) 

### Limitations

+ Job bookmarks are not supported in interactive sessions.
+  Creating notebook jobs using the AWS Command Line Interface is not supported. 
+  AWS Glue Studio notebooks do not support Scala. 

# Getting started with AWS Glue interactive sessions
Getting started with AWS Glue interactive sessions

These sections describe how to run AWS Glue interactive sessions locally.

## Prerequisites for setting up interactive sessions locally


The following are prerequisites for installing interactive sessions:
+ Supported Python versions are 3.6 - 3.10\$1. 
+  See sections below for MacOS/Linux and Windows instructions. 
+  Review the [interactive sessions pricing](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-session-pricing.html) documentation to understand the cost structure. 

## Installing Jupyter and AWS Glue interactive sessions Jupyter kernels


 Use the following to install the kernel locally. 

 The command, `install-glue-kernels`, installs the jupyter kernelspec for both pyspark and spark kernels and also installs logos in the right directory. 

```
pip3 install --upgrade jupyter boto3 aws-glue-sessions
```

```
install-glue-kernels
```

## Running Jupyter


 To run Jupyter Notebook, complete the following steps. 

1.  Run the following command to launch Jupyter Notebook. 

   ```
   jupyter notebook
   ```

1.  Choose **New**, and then choose one of the AWS Glue kernels to begin coding against AWS Glue. 

## Configuring session credentials and region


### MacOS/Linux instructions


 AWS Glue interactive sessions requires the same IAM permissions as AWS Glue Jobs and Dev Endpoints. Specify the role used with interactive sessions in one of two ways: 

1.  With the `%iam_role` and `%region` magics 

1.  With an additional line in `~/.aws/config` 

 **Configuring a session role with magic** 

 In the first cell, type `%iam_role <YourGlueServiceRole>` in the first cell executed. 

 **Configuring a session role with `~/.aws/config`** 

 AWS Glue Service Role for interactive sessions can either be specified in the notebook itself or stored alongside the AWS CLI config. If you have a role you typically use with AWS Glue Jobs this will be that role. If you do not have a role you use for AWS Glue jobs, please follow this guide, [ Configuring IAM permissions for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/configure-iam-for-glue.html) , to set one up. 

 To set this role as the default role for interactive sessions: 

1.  With a text editor, open `~/.aws/config`. 

1.  Look for the profile you use for AWS Glue. If you don't use a profile, use the `[Default]` profile. 

1.  Add a line in the profile for the role you intend to use like `glue_role_arn=<AWSGlueServiceRole>`. 

1.  [Optional]: If your profile does not have a default region set, I recommend adding one with `region=us-east-1`, replacing `us-east-1` with your desired region. 

1.  Save the config. 

 For more information, see [Interactive sessions with IAM](glue-is-security.md). 

### Windows instructions


 AWS Glue interactive sessions requires the same IAM permissions as AWS Glue Jobs and Dev Endpoints. Specify the role used with interactive sessions in one of two ways: 

1.  With the `%iam_role` and `%region` magics 

1.  With an additional line in `~/.aws/config` 

 **Configuring a session role with magic** 

 In the first cell, type `%iam_role <YourGlueServiceRole>` in the first cell executed. 

 ** Configuring a session role with `~/.aws/config` ** 

 AWS Glue Service Role for interactive sessions can either be specified in the notebook itself or stored alongside the AWS CLI config. If you have a role you typically use with AWS Glue Jobs this will be that role. If you do not have a role you use for AWS Glue jobs, please follow this guide, [ Setting up IAM permissions for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/configure-iam-for-glue.html) , to set one up. 

 To set this role as the default role for interactive sessions: 

1.  With a text editor, open `~/.aws/config`. 

1.  Look for the profile you use for AWS Glue. If you don't use a profile, use the `[Default]` profile. 

1.  Add a line in the profile for the role you intend to use like `glue_role_arn=<AWSGlueServiceRole>`. 

1.  [Optional]: If your profile does not have a default region set, I recommend adding one with `region=us-east-1`, replacing `us-east-1` with your desired region. 

1.  Save the config. 

 For more information, see [Interactive sessions with IAM](glue-is-security.md). 

## Upgrading from the interactive sessions preview


 The kernel was upgraded with new names when it was released with version 0.27. To clean up preview versions of the kernels run the following from a terminal or PowerShell. 

**Note**  
If you are a part of any other AWS Glue preview that requires a custom service model, removing the kernel will remove the custom service model.

```
# Remove Old Glue Kernels
jupyter kernelspec remove glue_python_kernel
jupyter kernelspec remove glue_scala_kernel

# Remove Custom Model
cd ~/.aws/models
rm -rf glue/
```

# Using interactive sessions with SageMaker AI Studio


 AWS Glue Interactive Sessions is an on-demand, serverless, Apache Spark runtime environment that data scientists and engineers can use to rapidly build, test, and run data preparation and analytics applications. You can initiate an AWS Glue interactive session by starting a Amazon SageMaker AI Studio Classic notebook. 

For more information, see [ Prepare Data using AWS Glue interactive sessions ](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-notebooks-glue.html). 

# Using interactive sessions with Microsoft Visual Studio Code


 **Prerequisites** 
+  Install AWS Glue interactive sessions and verify it works with Jupyter Notebook. 
+  Download and install Visual Studio Code with Jupyter. For details, see [Jupyter Notebook in VS Code](https://code.visualstudio.com/docs/datascience/jupyter-notebooks). 

**To get started with interactive sessions with VSCode**

1.  Disable Jupyter AutoStart in VS Code. 

    In Visual Studio Code, Jupyter kernels will auto-start which will prevent your magics from taking effect as the session will already be started. To disable **Auto Start** on Windows, go to **File** > **Preferences** > **Extensions** > **Jupyter** > right-click on Jupyter then choose **Extension Settings**. 

    On MacOS, go to **Code** > **Settings** > **Extensions** > **Jupyter** > right-click on Jupyter then choose **Extension Settings**. 

    Scroll down until you see **Jupyter: Disable Jupyter Auto Start**. Check the box "When true, disables Jupyter from being automatically started for you. You must instead run a cell to start Jupyter."   
![\[The screenshot shows the checkbox enabled in for the Jupyter Extension in VS Code.\]](http://docs.aws.amazon.com/glue/latest/dg/images/IS_vscode_step1.png)

1.  Go to File > New File > Save to save this file with name of your choice as an `.ipynb` extension or select **jupyter** under **select a language** and save the file.   
![\[The screenshot shows the file being saved with a new name.\]](http://docs.aws.amazon.com/glue/latest/dg/images/IS_vscode_step2.gif)

1.  Double-click on the file. The Jupyter shell will display and a notebook will be opened.   
![\[The screenshot shows the open notebook.\]](http://docs.aws.amazon.com/glue/latest/dg/images/IS_vscode_step3.png)

1.  On Windows, when you first create a file, by default no kernel is selected. Click on **Select Kernel** and a list of available kernels is displayed. Choose **Glue PySpark**. 

    On MacOS, If you do not see the **Glue PySpark** kernel, try the following steps: 

   1. Run a local Jupyter session to obtain the URL. 

      For example, run the following command to launch Jupyter Notebook.

      ```
      jupyter notebook
      ```

      When the notebook first runs, you will see a URL that looks like `http://localhost:8888/?token=3398XXXXXXXXXXXXXXXX`.

      Copy the URL.

   1. In VS Code, click the current kernel, then **Select Another Kernel...**, then select **Existing Jupyter Server...**. Paste the URL you copied from the step above.

      If you receive an error message, see the [ VS Code Jupyter wiki ](https://github.com/microsoft/vscode-jupyter/wiki/Connecting-to-a-remote-Jupyter-server-from-vscode.dev). 

   1. If successful, this will set the kernel to **Glue PySpark**.  
![\[The screenshot shows the Select Kernel button highlighted.\]](http://docs.aws.amazon.com/glue/latest/dg/images/IS_vscode_step4a.png)

    Choose the **Glue PySpark** or **Glue Spark** kernel (for Python and Scala respectively).   
![\[The screenshot shows the selection for AWS Glue PySpark.\]](http://docs.aws.amazon.com/glue/latest/dg/images/IS_vscode_step4b.png)

    If you don't see **AWS Glue PySpark** and **AWS Glue Spark** kernels in the drop-down list, please ensure you have installed the AWS Glue kernel in the step above, or that your `python.defaultInterpreterPath` setting in Visual Studio Code is correct. For more information, see [ python.defaultInterpreterPath setting description ](https://github.com/microsoft/vscode-python/wiki/Setting-descriptions#pythondefaultinterpreterpath). 

1.  Create an AWS Glue Interactive Session. Proceed to create a session in the same manner as you did in Jupyter Notebook. Specify any magics at the top of your first cell and run a statement of code. 

# Interactive sessions with IAM


 These sections describe security considerations for AWS Glue interactive sessions. 

**Topics**
+ [

## IAM principals used with interactive sessions
](#glue-is-security-iam-principals)
+ [

## Setting up a client principal
](#glue-is-client-principals)
+ [

## Setting up a runtime role
](#glue-is-runtime-role)
+ [

## Make your session private with TagOnCreate
](#glue-is-tagoncreate)
+ [

## IAM policy considerations
](#glue-is-security-iam-managed-policy)

## IAM principals used with interactive sessions


 You use two IAM principals used with AWS Glue interactive sessions. 
+  **Client principal**: The client principal (either a user or a role) authorizes API operations for interactive sessions from an AWS Glue client that's configured with the principal's identity-based credentials. For example, this could be an IAM role that you typically use to access the AWS Glue console. This could also be a role given to a user in IAM whose credentials are used for the AWS Command Line Interface, or an AWS Glue client used by the interactive sessions Jupyter kernel. 
+  **Runtime role**: The runtime role is an IAM role that the client principal passes to interactive sessions API operations. AWS Glue uses this role to run statements in your session. For example, this role could be the one used for running AWS Glue ETL jobs. 

   For more information, see [Setting up a runtime role](#glue-is-runtime-role). 

## Setting up a client principal


 You must attach an identity policy to the client principal to allow it to call the interactive sessions API. This role must have `iam:PassRole` access to the execution role that you would pass to the interactive sessions API, such as `CreateSession`. For example, you can attach the **AWSGlueConsoleFullAccess** managed policy to an IAM role which allows users in your account with the policy attached to access all the sessions created in your account (such as runtime statement or cancel statement). 

 If you would like to protect your session and make it private only to certain IAM roles, such as ones associated with the user who created the session then you can use AWS Glue Interactive Session's Tag Based Authorization Control called TagOnCreate. For more information, see [Make your session private with TagOnCreate](#glue-is-tagoncreate) on how an owner tag-based scoped down managed policy can make your session private with TagOnCreate. For more information on identity-based policies, see [ Identity-based policies for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/security_iam_service-with-iam.html#security_iam_service-with-iam-id-based-policies). 

## Setting up a runtime role


 You must pass an IAM role to the CreateSession API operation in order to allow AWS Glue to assume and run statements in interactive sessions. The role should have the same IAM permissions as those required to run a typical AWS Glue job. For example, you can create a service role using the **AWSGlueServiceRole** policy that allows AWS Glue to call AWS services on your behalf. If you use the AWS Glue console, it will automatically create a service role on your behalf or use an existing one. You can also create your own IAM role and attach your own IAM policy to allow similar permissions. 

 If you would like to protect your session and make it private only to the user who created the session then you can use AWS Glue Interactive Session's Tag Based Authorization Control called TagOnCreate. For more information, see [Make your session private with TagOnCreate](#glue-is-tagoncreate) on how an owner tag-based scoped down managed policy can make your session private with TagOnCreate. For more information on identity-based policies, see [Identity-based policies for AWS Glue](security_iam_service-with-iam.md#security_iam_service-with-iam-id-based-policies). If you are creating the execution role by yourself from the IAM console and you want to make your service private with TagOnCreate feature then follow the steps below. 

1.  Create an IAM role with role type set to `Glue`. 

1.  Attach this AWS Glue managed policy: *AwsGlueSessionUserRestrictedServiceRole* 

1.  Prefix the role name with the policy name *AwsGlueSessionUserRestrictedServiceRole*. For example, you can create a role with name *AwsGlueSessionUserRestrictedServiceRole-myrole* and attach AWS Glue managed policy *AwsGlueSessionUserRestrictedServiceRole*. 

1.  Attach a trust policy like following to allow AWS Glue to assume the role: 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": [
             "glue.amazonaws.com"
           ]
         },
         "Action": [
           "sts:AssumeRole"
         ]
       }
     ]
   }
   ```

------

 For an interactive sessions Jupyter kernel, you can specify the `iam_role` key in your AWS Command Line Interface profile. For more information, see [ Configuring sessions with \$1/.aws/config ](https://docs.aws.amazon.com/glue/latest/ug/interactive-sessions-magics.html#interactive-sessions-named-profiles). If you're interacting with interactive sessions using an AWS Glue notebook, then you can pass the execution role in the `%iam_role` magic in the first cell that you run. 

## Make your session private with TagOnCreate


 AWS Glue interactive sessions supports tagging and Tag Based Authorization (TBAC) for interactive sessions as a named resource. In addition to TBAC using TagResource and UntagResource APIs, AWS Glue interactive sessions supports the TagOnCreate feature to 'tag' a session with a given tag only during session creation with CreateSession operation. This also means those tags will be removed on DeleteSession, aka UntagOnDelete. 

 TagOnCreate offers a powerful security mechanism to make your session private to the creator of the session. For example, you can attach an IAM policy with "owner" RequestTag and value of \$1\$1aws:userId\$1 to a client principal (such as an user) in order to allow creating a session only if an "owner" tag with matching value of the callers userId is provided as userId tag in CreateSession request. This policy allows AWS Glue interactive sessions to create a session resource and tag the session with the userId tag only during session creation time. In addition to it you can scope down the access (like running statements) to your session only to the creator (aka owner tag with value \$1\$1aws:userId\$1) of the session by attaching an IAM policy with "owner" ResourceTag to the execution role you passed in during CreateSession. 

 In order to make it easier for you to use TagOnCreate feature to make a session private to the session creator, AWS Glue provides specialized managed policies and service roles. 

 If you want to create a AWS Glue Interactive Session using an IAM AssumeRole principal (that is, using credential vended by assuming an IAM role) and you want to make the session private to the creator, then use policies similar to the **AWSGlueSessionUserRestrictedNotebookPolicy** and **AWSGlueSessionUserRestrictedNotebookServiceRole** respectively. These policies allow AWS Glue to use \$1\$1aws:PrincipalTag\$1 to extract the owner tag value. This requires you to pass a userId tag with value \$1\$1aws:userId\$1 as SessionTag in the assume role credential. See [ ID session tags ](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_session-tags.html). If you are using an Amazon EC2 instance with an instance profile vending the credential and you want to create a session or interact with the session from within the Amazon EC2 instance , then you would require to pass a userId tag with value \$1\$1aws:userId\$1 as SessionTag in the assume role credential. 

 For example, If you are creating a session using an IAM AssumeRole principal credential and you want to make your service private with TagOnCreate feature then follow the steps below. 

1.  Create a runtime role yourself from the IAM console. Please attach this AWS Glue managed policy *AwsGlueSessionUserRestrictedNotebookServiceRole* and prefix the role name with the policy name *AwsGlueSessionUserRestrictedNotebookServiceRole*. For example, you can create a role with name *AwsGlueSessionUserRestrictedNotebookServiceRole-myrole* and attach AWS Glue managed policy *AwsGlueSessionUserRestrictedNotebookServiceRole*. 

1.  Attach a trust policy like below to allow AWS Glue to assume the above role. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": [
             "glue.amazonaws.com"
           ]
         },
         "Action": [
           "sts:AssumeRole"
         ]
       }
     ]
   }
   ```

------

1.  Create another role named with a prefix *AwsGlueSessionUserRestrictedNotebookPolicy* and attach the AWS Glue managed policy *AwsGlueSessionUserRestrictedNotebookPolicy* to make the session private. In addition to the managed policy please attach the following inline policy to allow iam:PassRole to the role you created in step 1. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Action": [
           "iam:PassRole"
         ],
         "Resource": [
           "arn:aws:iam::*:role/AwsGlueSessionUserRestrictedNotebookServiceRole*"
         ],
         "Condition": {
           "StringLike": {
             "iam:PassedToService": [
               "glue.amazonaws.com"
             ]
           }
         }
       }
     ]
   }
   ```

------

1.  Attach a trust policy like following to the above IAM AWS Glue to assume the role. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Principal": {
           "Service": [
             "glue.amazonaws.com"
           ]
         },
         "Action": [
           "sts:AssumeRole",
           "sts:TagSession"
         ]
       }
     ]
   }
   ```

------
**Note**  
 Optionally, you can use a single role (for example, notebook role) and attach both of the above managed policies *AwsGlueSessionUserRestrictedNotebookServiceRole* and *AwsGlueSessionUserRestrictedNotebookPolicy*. Also attach the additional inline policy to allow `iam:passrole` of your role to AWS Glue. And finally attach the above trust policy to allow `sts:AssumeRole` and `sts:TagSession`. 

### AWSGlueSessionUserRestrictedNotebookPolicy


 The AWSGlueSessionUserRestrictedNotebookPolicy provides access to create a AWS Glue Interactive Session from a notebook only if a tag key "owner" and value matching the AWS user id of the principal (user or Role). For more information, see [Where you can use policy variables ](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_variables.html#policy-vars-infotouse). This policy is attached to the principal (User or role) that creates AWS Glue Interactive Session notebooks from AWS Glue Studio. This policy also permits sufficient access to the AWS Glue Studio notebook to interact with the AWS Glue Studio Interactive Session resources that are created with the "owner" tag value matching the AWS user ID of the principal. This policy denies permission to change or remove "owner" tag from a AWS Glue session resource after the session is created. 

### AWSGlueSessionUserRestrictedNotebookServiceRole


 The **AWSGlueSessionUserRestrictedNotebookServiceRole** provides sufficient access to the AWS Glue Studio notebook to interact with the AWS Glue Interactive Session resources that are created with the "owner" tag value matching the AWS user ID of the principal (user or role) of the notebook creator. For more information, see [Where you can use policy variables ](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_variables.html#policy-vars-infotouse). This service-role policy is attached to the role that is passed as magic to a notebook or passed as execution role to the CreateSession API. This policy also permits to create a AWS Glue Interactive Session from a notebook only if a tag key "owner" and value matching the AWS user ID of the principal. This policy denies permission to change or remove "owner" tag from an AWS Glue session resource after the session is created. This policy also includes permissions for writing and reading from Amazon S3 buckets, writing CloudWatch logs, creating and deleting tags for Amazon EC2 resources used by AWS Glue. 

### Make your session private with user policies


You can attach the **AWSGlueSessionUserRestrictedPolicy** to IAM roles attached to each of the users in your account to restrict them from creating a session only with an owner tag with a value matching their own \$1\$1aws:userId\$1. Instead of using the **AWSGlueSessionUserRestrictedNotebookPolicy** and **AWSGlueSessionUserRestrictedNotebookServiceRole** you need to use policies similar to the **AWSGlueSessionUserRestrictedPolicy** and **AWSGlueSessionUserRestrictedServiceRole** respectively. For more information, see [Using-identity based policies ](https://docs.aws.amazon.com/glue/latest/dg/using-identity-based-policies.html). This policy scopes down the access to a session only to the creator, the \$1\$1aws:userId\$1 of the user who created the session with an owner tag bearing their own \$1\$1aws:userId\$1. If you have created the execution role yourself using the IAM console by following the steps in [Setting up a runtime role](#glue-is-runtime-role), then in addition to attaching the **AwsGlueSessionUserRestrictedPolicy** managed policy, also attach the following inline policy to each of the users in your account to allow `iam:PassRole` for the execution role you created earlier. 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "iam:PassRole"
      ],
      "Resource": [
        "arn:aws:iam::*:role/AwsGlueSessionUserRestrictedServiceRole*"
      ],
      "Condition": {
        "StringLike": {
          "iam:PassedToService": [
            "glue.amazonaws.com"
          ]
        }
      }
    }
  ]
}
```

------

#### AWSGlueSessionUserRestrictedPolicy


 The **AWSGlueSessionUserRestrictedPolicy** provides access to create an AWS Glue Interactive Session using the CreateSession API only if a tag key "owner" and value matching their AWS user ID is provided. This identity policy is attached to the user that invokes the CreateSession API. This policy also permits to interact with the AWS Glue Interactive Session resources that were created with a "owner" tag and value matching their AWS user id. This policy denies permission to change or remove "owner" tag from a AWS Glue session resource after the session is created. 

#### AWSGlueSessionUserRestrictedServiceRole


 The **AWSGlueSessionUserRestrictedServiceRole** provides full access to all AWS Glue resources except for sessions and allows users to create and use only the interactive sessions that are associated with the user. This policy also includes other permissions needed by AWS Glue to manage Glue resources in other AWS services. The policy also allows adding tags to AWS Glue resources in other AWS services. 

## IAM policy considerations


 Interactive sessions are IAM resources in AWS Glue. Because they are IAM resources, access and interaction to a session is governed by IAM policies. Based on the IAM policies attached to a client principal or execution role configured by an admin, a client principal (user or role) will be able to create new sessions and interact with its own sessions and other sessions. 

 If an admin has attached an IAM policy such as AWSGlueConsoleFullAccess or AWSGlueServiceRole that allows access to all AWS Glue resources in that account, a client principal will be able to collaborate with each other. For example, one user will be able to interact with sessions that are created by other users if policies allow this. 

 If you'd like to configure a policy tailored to your specific needs, see [ IAM documentation about configuring resources for a policy ](https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_identity-vs-resource.html). For example, in order to isolate sessions that belong to an user, you can use the TagOnCreate feature supported by AWS Glue Interactive sessions. See [Make your session private with TagOnCreate](#glue-is-tagoncreate). 

 Interactive sessions supports limiting session creation based on certain VPC conditions. See [Control policies that control settings using condition keys](security_iam_id-based-policy-examples.md#glue-identity-based-policy-condition-key-vpc). 

# Configuring AWS Glue interactive sessions for Jupyter and AWS Glue Studio notebooks


## Introduction to Jupyter Magics


 Jupyter Magics are commands that can be run at the beginning of a cell or as a whole cell body. Magics start with `%` for line-magics and `%%` for cell-magics. Line-magics such as `%region` and `%connections` can be run with multiple magics in a cell, or with code included in the cell body like the following example. 

```
%region us-east-2
%connections my_rds_connection
dy_f = glue_context.create_dynamic_frame.from_catalog(database='rds_tables', table_name='sales_table')
```

 Cell magics must use the entire cell and can have the command span multiple lines. An example of `%%sql` is below. 

```
%%sql
select * from rds_tables.sales_table
```

## Magics supported by AWS Glue interactive sessions for Jupyter
<a name="interactive-sessions-magics2"></a>

 The following are magics that you can use with AWS Glue interactive sessions for Jupyter notebooks. 

 **Sessions magics** 


| Name | Type | Description | 
| --- | --- | --- | 
|  %help  |  n/a  |  Return a list of descriptions and input types for all magic commands.  | 
| %profile | String | Specify a profile in your AWS configuration to use as the credentials provider. | 
| %region | String |  Specify the AWS Region; in which to initialize a session. Default from `~/.aws/configure.` Example: `%region us-west-1`  | 
| %idle\$1timeout | Int |   The number of minutes of inactivity after which a session will timeout after a cell has been executed. The default idle timeout value for Spark ETL sessions is the default timeout, 2880 minutes (48 hours). For other session types, consult documentation for that session type. Example: `%idle_timeout 3000`  | 
| %session\$1id | n/a | Return the session ID for the running session.  | 
| %session\$1id\$1prefix | String |   Define a string that will precede all session IDs in the format **[session\$1id\$1prefix]-[session\$1id].** If a session ID is not provided, a random UUID will be generated. This magic is not supported when you run a Jupyter Notebook in AWS Glue Studio.  Example: `%session_id_prefix 001`  | 
| %status |  | Return the status of the current AWS Glue session including its duration, configuration and executing user / role.  | 
| %stop\$1session  |  | Stop the current session. | 
| %list\$1sessions |  | Lists all currently running sessions by name and ID. | 
| %session\$1type | String |  Sets the session type to one of Streaming, ETL, or Ray.  Example: `%session_type Streaming`  | 
| %glue\$1version | String |  The version of AWS Glue to be used by this session.  Example: `%glue_version 3.0`  | 

 **Magics for selecting job types** 


| Name | Type | Description | 
| --- | --- | --- | 
| %streaming | String | Changes the session type to AWS Glue Streaming. | 
| %etl | String | Changes the session type to AWS Glue ETL. | 
| %glue\$1ray | String | Changes the session type to AWS Glue for Ray. See [Magics supported by AWS Glue Ray interactive sessions](https://docs.aws.amazon.com/glue/latest/dg/is-using-ray-configuration).  | 

 **AWS Glue for Spark config magics** 

 The `%%configure` magic is a json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. 


| Name | Type | Description | 
| --- | --- | --- | 
|  %%configure  |  Dictionary  |   Specify a JSON-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics.   For a list of parameters and examples on how to use `%%configure`, see [%%configure cell magic arguments](#interactive-sessions-magics-configure-arguments).   | 
| %iam\$1role | String |   Specify an IAM role ARN to execute your session with. Default from \$1/.aws/configure.   Example: `%iam_role AWSGlueServiceRole`  | 
| %number\$1of\$1workers | Int |  The number of workers of a defined worker\$1type that are allocated when a job runs. `worker_type` must be set too. The default `number_of_workers` is 5. Example: `%number_of_workers 2`  | 
| %additional\$1python\$1modules | List |  Comma separated list of additional Python modules to include in your cluster (can be from PyPI or S3). Example: `%additional_python_modules pandas, numpy`.  | 
| %%tags | String |   Adds tags to a session. Specify the tags within curly brackets \$1 \$1. Each tag name pair is enclosed in parentheses (" ") and separated by a comma (,).  <pre>%%tags<br />{"billing":"Data-Platform", "team":"analytics"}<br />                      </pre> Use the `%status` magic to view tags associated with the session. <pre>%status</pre> <pre>Session ID: <sessionId><br /> Status: READY<br /> Role: <example-role><br /> CreatedOn: 2023-05-26 11:12:17.056000-07:00<br /> GlueVersion: 3.0<br /> Job Type: glueetl<br /> Tags: {'owner':'example-owner', 'team':'analytics', 'billing':'Data-Platform'}<br /> Worker Type: G.4X<br /> Number of Workers: 5<br /> Region: us-west-2<br /> Applying the following default arguments:<br /> --glue_kernel_version 0.38.0<br /> --enable-glue-datacatalog true<br /> Arguments Passed: ['--glue_kernel_version: 0.38.0', '--enable-glue-datacatalog: true']                <br />                </pre>  | 
| %%assume\$1role | Dictionary |  Specify a json-formatted dictionary or an IAM role ARN string to create a session for cross-account access. Example with ARN: <pre>%%assume_role<br />{<br />  'arn:aws:iam::XXXXXXXXXXXX:role/AWSGlueServiceRole'<br />}<br />                </pre> Example with credentials: <pre> %%assume_role<br />{{<br />    "aws_access_key_id" = "XXXXXXXXXXXX",<br />    "aws_secret_access_key" = "XXXXXXXXXXXX",<br />    "aws_session_token" = "XXXXXXXXXXXX"<br />}}</pre>  | 

### %%configure cell magic arguments


 The `%%configure` magic is a json-formatted dictionary consisting of all configuration parameters for a session. Each parameter can be specified here or through individual magics. See below for examples for arguments supported by the `%%configure` cell magic. Use the `--` prefix for run arguments specified for the job. Example: 

```
%%configure
{
   "--user-jars-first": "true",
   "--enable-glue-datacatalog": "false"
}
```

 For more information on job parameters, see [Job parameters](aws-glue-programming-etl-glue-arguments.md). 

**Session Configuration**


| Parameter | Type | Description | 
| --- | --- | --- | 
| max\$1retries | Int | The maximum number of times to retry this job if it fails. <pre>%%configure<br />{<br />  "max_retries": "0"<br />}                      <br />                          </pre> | 
| max\$1concurrent\$1runs | Int | The maximum number of concurrent runs allowed for a job. Example: <pre>%%configure<br />{<br />  "max_concurrent_runs": "3"<br />}</pre> | 

**Session parameters**


| Parameter | Type | Description | 
| --- | --- | --- | 
| --enable-spark-ui | Boolean | Enable Spark UI to monitor and debug AWS Glue ETL jobs. <pre>%%configure<br />{<br />  "--enable-spark-ui": "true"<br />}</pre> | 
| --spark-event-logs-path | String | Specifies an Amazon S3 path. When using the Spark UI monitoring feature. Example: <pre>%%configure<br />{<br />  "--spark-event-logs-path": "s3://path/to/event/logs/"<br />}                           <br />                          </pre> | 
| --script\$1location | String | Specifies the S3 path to a script that executes a job. Example:<pre>%%configure <br />{<br />  "script_location": "s3://new-folder-here"<br />}                            <br />                          </pre> | 
| --SECURITY\$1CONFIGURATION | String | The name of a AWS Glue security configuration Example: <pre>%%configure<br />{<br />    "--security_configuration": {<br />"encryption_type": "kms",<br />"kms_key_id": "YOUR_KMS_KEY_ARN"<br />}<br />}<br />                  </pre>  | 
| --job-language | String | The script programming language. Accepts a value of 'scala' or 'python'. Default is 'python'. Example: <pre>%%configure <br />{<br />  "--job-language": "scala"<br />}                            <br />                  </pre>  | 
| --class | String | The Scala class that serves as the entry point for your Scala script. Default is null. Example: <pre>%%configure <br />{<br />  "--class": "className"<br />}                            <br />                  </pre>  | 
| --user-jars-first | Boolean | Prioritizes the customer's extra JAR files in the classpath. Default is null. Example: <pre>%%configure <br />{<br />  "--user-jars-first": "true"<br />}                            <br />                  </pre>  | 
| --use-postgres-driver | Boolean | Prioritizes the Postgres JDBC driver in the class path to avoid a conflict with the Amazon Redshift JDBC driver. Default is null. Example: <pre>%%configure <br />{<br />  "--use-postgres-driver": "true"<br />}                            <br />                  </pre>  | 
| --extra-files | List(string) | The Amazon S3 paths to additional files, such as configuration files that AWS Glue copies to the working directory of your script before executing it. Example: <pre>%%configure <br />{<br />  "--extra-files": "s3://path/to/additional/files/"<br />}                            <br />                  </pre>  | 
| --job-bookmark-option | String | Controls the behavior of a job bookmark. Accepts a value of 'job-bookmark-enable', 'job-bookmark-disable' or 'job-bookmark-pause'. Default is 'job-bookmark-disable'. Example: <pre>%%configure<br />{<br />  "--job-bookmark-option": "job-bookmark-enable"<br />}                            <br />                  </pre>  | 
| --TempDir | String | Specifies an Amazon S3 path to a bucket that can be used as a temporary directory for the job. Default is null. Example: <pre>%%configure <br />{<br />  "--TempDir": "s3://path/to/temp/dir"<br />}                            <br />                  </pre>  | 
| --enable-s3-parquet-optimized-committer | Boolean | Enables the EMRFS Amazon S3-optimized committer for writing Parquet data into Amazon S3. Default is 'true'. Example: <pre>%%configure <br />{<br />  "--enable-s3-parquet-optimized-committer": "false"<br />}                            <br />                  </pre>  | 
| --enable-rename-algorithm-v2 | Boolean | Sets the EMRFS rename algorithm version to version 2. Default is 'true'. Example: <pre>%%configure <br />{<br />  "--enable-rename-algorithm-v2": "true"<br />}                            <br />                  </pre>  | 
| --enable-glue-datacatalog | Boolean | Enables you to use the AWS Glue Data Catalog as an Apache Spark Hive metastore. Example: <pre>%%configure <br />{<br />  "--enable-glue-datacatalog": "true"<br />}                            <br />                  </pre>  | 
| --enable-metrics | Boolean | Enables the collection of metrics for job profiling for job run. Default is 'false'. Example: <pre>%%configure <br />{<br />  "--enable-metrics": "true"<br />}                            <br />                  </pre>  | 
| --enable-continuous-cloudwatch-log | Boolean | Enables real-time continuous logging for AWS Glue jobs. Default is 'false'. Example: <pre>%%configure <br />{<br />  "--enable-continuous-cloudwatch-log": "true"<br />}                            <br />                  </pre>  | 
| --enable-continuous-log-filter | Boolean | Specifies a standard filter or no filter when you create or edit a job enabled for continuous logging. Default is 'true'. Example: <pre>%%configure <br />{<br />  "--enable-continuous-log-filter": "true"<br />}                            <br />                  </pre>  | 
| --continuous-log-stream-prefix | String | Specifies a custom Amazon CloudWatch log stream prefix for a job enabled for continuous logging. Default is null. Example: <pre>%%configure <br />{<br />  "--continuous-log-stream-prefix": "prefix"<br />}                            <br />                  </pre>  | 
| --continuous-log-conversionPattern | String | Specifies a custom conversion log pattern for a job enabled for continuous logging. Default is null. Example: <pre>%%configure <br />{<br />  "--continuous-log-conversionPattern": "pattern"<br />}                      <br />                  </pre>  | 
| --conf | String | Controls Spark config parameters. It is for advanced use cases. Use --conf before each parameter. Example: <pre>%%configure<br />{<br />    "--conf": "spark.hadoop.hive.metastore.glue.catalogid=123456789012 --conf hive.metastore.client.factory.class=com.amazonaws.glue.catalog.metastore.AWSGlueDataCatalogHiveClientFactory --conf hive.metastore.schema.verification=false"<br />}       <br />        </pre>  | 
| timeout | Int | Determines the maximum amount of time that the Spark session should wait for a statement to complete before terminating it. <pre>%%configure <br />{<br />  "timeout": "30"<br />}</pre>  | 
| auto-scaling | Boolean | Determines whether or not to use auto-scaling. <pre>%%configure <br />{<br />  "––enable-auto-scaling": "true"<br />}</pre>  | 

### Spark jobs (ETL & streaming) magics



| Name | Type | Description | 
| --- | --- | --- | 
| %worker\$1type | String | Standard, G.1X, G.2X, G.4X, G.8X, G.12X, G.16X, R.1X, R.2X, R.4X, or R.8X. number\$1of\$1workers must be set too. The default worker\$1type is G.1X. | 
| %connections | List |  Specify a comma-separated list of connections to use in the session.   Example:  <pre>%connections my_rds_connection<br />                    dy_f = glue_context.create_dynamic_frame.from_catalog(database='rds_tables', table_name='sales_table')</pre>  | 
| %extra\$1py\$1files | List | Comma separated list of additional Python files from Amazon S3. | 
| %extra\$1jars | List | Comma-separated list of additional jars to include in the cluster. | 
| %spark\$1conf | String | Specify custom spark configurations for your session. For example, %spark\$1conf spark.serializer=org.apache.spark.serializer.KryoSerializer. | 

### Magics for Ray jobs



| Name | Type | Description | 
| --- | --- | --- | 
| %min\$1workers | Int |  The minimum number of workers that are allocated to a Ray job. Default: 1.  Example: `%min_workers 2`   | 
| %object\$1memory\$1head | Int | The percentage of free memory on the instance head node after a warm start. Minimum: 0. Maximum: 100. Example: `%object_memory_head 100`  | 
| %object\$1memory\$1worker | Int | The percentage of free memory on the instance worker nodes after a warm start. Minimum: 0. Maximum: 100. Example: `%object_memory_worker 100` | 

### Action magics



| Name | Type | Description | 
| --- | --- | --- | 
| %%sql | String |   Run SQL code. All lines after the initial `%%sql` magic will be passed as part of the SQL code.   Example: `%%sql select * from rds_tables.sales_table`  | 
| %matplot | Matplotlib figure |  Visualize your data using the matplotlib library. Example: <pre>import matplotlib.pyplot as plt<br /><br /># Set X-axis and Y-axis values<br />x = [5, 2, 8, 4, 9]<br />y = [10, 4, 8, 5, 2]<br />  <br /># Create a bar chart <br />plt.bar(x, y)<br />  <br /># Show the plot<br />%matplot plt      <br />                </pre>  | 
| %plotly | Plotly figure |  Visualize your data using the plotly library. Example: <pre>import plotly.express as px<br />                  <br />#Create a graphical figure<br />fig = px.line(x=["a","b","c"], y=[1,3,2], title="sample figure")<br /><br />#Show the figure<br />%plotly fig</pre>  | 

## Naming sessions


 AWS Glue interactive sessions are AWS resources and require a name. Names should be unique for each session and may be restricted by your IAM administrators. For more information, see [Interactive sessions with IAM](glue-is-security.md). The Jupyter kernel automatically generates unique session names for you. However sessions can be named manually in two ways: 

1.  Using the AWS Command Line Interface config file located at `~.aws/config`. See [Setting Up AWS Config with the AWS Command Line Interface](https://docs.aws.amazon.com/config/latest/developerguide/gs-cli.html). 

1.  Using the `%session_id_prefix` magics. See [Magics supported by AWS Glue interactive sessions for Jupyter](#interactive-sessions-supported-magics). 

 A session name is generated as follows: 
+ When the prefix and session\$1id are provided: the session name will be \$1prefix\$1-\$1UUID\$1.
+ When nothing is provided: the session name will be \$1UUID\$1.

Prefixing session names allows you to recognize your session when listing it in the AWS CLI or console.

## Specifying an IAM role for interactive sessions


 You must specify an AWS Identity and Access Management (IAM) role to use with AWS Glue ETL code that you run with interactive sessions. 

 The role requires the same IAM permissions as those required to run AWS Glue jobs. See [Create an IAM role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html) for more information on creating a role for AWS Glue jobs and interactive sessions. 

 IAM roles can be specified in two ways: 
+  Using the AWS Command Line Interface config file located at `~.aws/config` (Recommended). For more information, see [ Configuring sessions with \$1/.aws/config ](https://docs.aws.amazon.com/glue/latest/ug/interactive-sessions-magics.html#interactive-sessions-named-profiles). 
**Note**  
 When the `%profile` magic is used, the configuration for `glue_iam_role` of that profile is honored. 
+  Using the %iam\$1role magic. For more information, see [Magics supported by AWS Glue interactive sessions for Jupyter](#interactive-sessions-supported-magics). 

## Configuring sessions with named profiles


 AWS Glue interactive sessions uses the same credentials as the AWS Command Line Interface or boto3, and interactive sessions honors and works with named profiles like the AWS CLI found in `~/.aws/config` (Linux and MacOS) or `%USERPROFILE%\.aws\config` (Windows). For more information, see [ Using named profiles ](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-using-profiles). 

 Interactive sessions takes advantage of named profiles by allowing the AWS Glue Service Role and Session ID Prefix to be specified in a profile. To configure a profile role, add a line for the `iam_role` key and/or `session_id_prefix `to your named profile as shown below. The `session_id_prefix` does not require quotes. For example, if you want to add a ` session_id_prefix`, enter the value of the `session_id_prefix=myprefix`. 

```
[default]
region=us-east-1
aws_access_key_id=AKIAIOSFODNN7EXAMPLE 
aws_secret_access_key=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
glue_iam_role=arn:aws:iam::<AccountID>:role/<GlueServiceRole> 
session_id_prefix=<prefix_for_session_names>

[user1] 
region=eu-west-1
aws_access_key_id=AKIAI44QH8DHBEXAMPLE 
aws_secret_access_key=je7MtGbClwBF/2Zp9Utk/h3yCo8nvbEXAMPLEKEY
glue_iam_role=arn:aws:iam::<AccountID>:role/<GlueServiceRoleUser1> 
session_id_prefix=<prefix_for_session_names_for_user1>
```

 If you have a custom method of generating credentials, you can also configure your profile to use the `credential_process` parameter in your `~/.aws/config` file. For example: 

```
[profile developer]
region=us-east-1
credential_process = "/Users/Dave/generate_my_credentials.sh" --username helen
```

 You can find more details about sourcing credentials through the `credential_process` parameter here: [ Sourcing credentials with an external process](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-sourcing-external.html). 

 If a region or `iam_role` are not set in the profile that you are using, you must specify them using the `%region` and `%iam_role` magics in the first cell that you run. 

# Converting a script or notebook into an AWS Glue job


 There are two ways you can convert a script or notebook into an AWS Glue job: 
+  Use **nbconvert** to convert your Jupyter `.ipynb` notebook document file into a `.py` file. For more information, see [nbconvert: Convert Notebooks to other formats](https://nbconvert.readthedocs.io/en/latest/). 
+  Upload the file to AWS Glue Studio Notebooks. 
  +  In the AWS Glue Studio console, choose **Jobs** from the navigation menu. 
  +  In the **Create job** section, choose **Jupyter Notebook**. 
  +  In the **Options** section, choose **Upload and edit an existing notebook**. 
  +  Select **Choose file** to upload an `.ipynb` file. 

# Working with streaming operations in AWS Glue interactive sessions


## Switching streaming session type


 Use the AWS Glue interactive sessions configuration magic, `%streaming`, to define the job you are running and initialize a streaming interactive session. 

## Sampling input stream for interactive development


 One tool we have derived to help enhance the interactive experience in AWS Glue interactive sessions is the addition of a new method under `GlueContext` to obtain a snapshot of a stream in a static DynamicFrame. `GlueContext` allows you to inspect, interact and implement your workflow. 

 With the `GlueContext` class instance, you will be able to locate the method `getSampleStreamingDynamicFrame`. Required arguments for this method are: 
+  `dataFrame`: The Spark Streaming DataFrame 
+  `options`: See available options below 

 Available options include： 
+  **windowSize**: This is also called Microbatch Duration. This parameter will determine how long a streaming query will wait after previous batch was triggered. This parameter value must be smaller than `pollingTimeInMs`. 
+  **pollingTimeInMs**: The total length of time the method will run. It will fire off at least one micro batch to obtain sample records from the input stream. 
+  **recordPollingLimit**: This parameter helps you limit the total number of records you will poll from the stream. 
+  (Optional) You can also use `writeStreamFunction` to apply this custom function to every record sampling function. See below for examples in Scala and Python. 

****  
  

```
val sampleBatchFunction = (batchDF: DataFrame, batchId: Long) => {//Optional but you can replace your own forEachBatch function here}
val jsonString: String = s"""{"pollingTimeInMs": "10000", "windowSize": "5 seconds"}"""
val dynFrame = glueContext.getSampleStreamingDynamicFrame(YOUR_STREAMING_DF, JsonOptions(jsonString), sampleBatchFunction)
dynFrame.show()
```

```
def sample_batch_function(batch_df, batch_id):
       //Optional but you can replace your own forEachBatch function here
options = {
            "pollingTimeInMs": "10000",
            "windowSize": "5 seconds",
        }
glue_context.getSampleStreamingDynamicFrame(YOUR_STREAMING_DF, options, sample_batch_function)
```

**Note**  
 When the sampled `DynFrame` is empty, it could be caused by a few reasons:   
 The Streaming source is set to "Latest" and no new data has been ingested during the sampling period. 
 The polling time is not enough to process the records it ingested. Data won't show up unless the whole batch has been processed. 

## Running streaming applications in interactive sessions


 In AWS Glue interactive sessions, you can run a the AWS Glue streaming application like how you would create a streaming application in the AWS Glue Console. Since interactive sessions is session-based, encountering exceptions in the runtime does not cause the session to stop. We now have the added benefit of developing your batch function iteratively. For example: 

```
def batch_function(data_frame, batch_id):
    log.info(data_frame.count())
    invalid_method_call()
glueContext.forEachBatch(frame=streaming_df, batch_function = batch_function, options = {**})
```

 In the example above, we included an invalid usage of a method and unlike regular AWS Glue jobs which will exit the entire application, the user's coding context and definitions are fully preserved and the session is still operational. There is no need to bootstrap a new cluster and rerun all the preceding transformation. This allows you to focus on quickly iterating your batch function implementations to obtain desirable outcomes. 

 It is important to note that Interactive Session evaluates each statement in a blocking manner so that the session will only execute one statement at a time. Since streaming queries are continuous and never ending, sessions with active streaming queries won't be able to handle any follow up statements unless they are interrupted. You can issue the interruption command directly from Jupyter Notebook and our kernel will handle the cancellation for you. 

 Take the following sequence of statements which are waiting for execution as an example: 

```
Statement 1:
      val number = df.count() 
      #Spark Action with deterministic result
      Result: 5
      
Statement 2:
      streamingQuery.start().awaitTermination()
      #Spark Streaming Query that will be executing continously
      Result: Constantly updated with each microbatch
      
Statement 3:
      val number2 = df.count()
      #This will not be executed as previous statement will be running indefinitely
```

# AWS Glue interactive session pricing


 AWS charges for AWS Glue interactive sessions based on how long the session is active and the number of Data Processing Units (DPU) used. You are charged an hourly rate for the number of DPUs used to run your workloads, billed in increments of one second. AWS Glue interactive sessions assigns a default of 5 DPUs and requires a minimum of 2 DPUs. There is also a 1-minute minimum billing duration for each interactive session. To see the AWS Glue rates and pricing examples, or to estimate your costs using the AWS Pricing Calculator, see [AWS Glue pricing](https://aws.amazon.com/glue/pricing/). 

## Configure your AWS Glue interactive sessions


 You can use Jupyter magics in your AWS interactive session to modify your session and configuration parameters. Magics are short commands prefixed with `%` at the start of Jupyter cells that provide a quick and easy way to help you control your environment. For example, if you want to change the number of workers allocated to your job from the default 5 to 10, you can specify `%number_of_workers 10`. If you want to configure your session to stop after 10 minutes of idle time instead of the default 2880, you can specify `%idle_timeout 10`. 

 For the complete list of AWS magics available, see [Configuring AWS interactive sessions for Jupyter and AWS Glue Studio notebooks](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-magics.html). 

# Developing and testing AWS Glue job scripts locally
Developing and testing locally

When you develop and test your AWS Glue for Spark job scripts, there are multiple available options:
+ AWS Glue Studio console
  + Visual editor
  + Script editor
  + AWS Glue Studio notebook
+ Interactive sessions
  + Jupyter notebook
+ Docker image
  + Local development
  + Remote development

You can choose any of the above options based on your requirements.

If you prefer no code or less code experience, the AWS Glue Studio visual editor is a good choice.

If you prefer an interactive notebook experience, AWS Glue Studio notebook is a good choice. For more information, see [Using Notebooks with AWS Glue Studio and AWS Glue](https://docs.aws.amazon.com/glue/latest/ug/notebooks-chapter.html). If you want to use your own local environment, interactive sessions is a good choice. For more information, see [Using interactive sessions with AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-chapter.html).

If you prefer local/remote development experience, the Docker image is a good choice. This helps you to develop and test AWS Glue for Spark job scripts anywhere you prefer without incurring AWS Glue cost.

If you prefer local development without Docker, installing the AWS Glue ETL library directory locally is a good choice.

## Developing using AWS Glue Studio


The AWS Glue Studio visual editor is a graphical interface that makes it easy to create, run, and monitor extract, transform, and load (ETL) jobs in AWS Glue. You can visually compose data transformation workflows and seamlessly run them on AWS Glue's Apache Spark-based serverless ETL engine. You can inspect the schema and data results in each step of the job. For more information, see the [AWS Glue Studio User Guide](https://docs.aws.amazon.com/glue/latest/ug/what-is-glue-studio.html).

## Developing using interactive sessions


Interactive sessions allow you to build and test applications from the environment of your choice. For more information, see [Using interactive sessions with AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-chapter.html).

# Develop and test AWS Glue jobs locally using a Docker image
Developing AWS Glue jobs locally with Docker

 For a production-ready data platform, the development process and CI/CD pipeline for AWS Glue jobs is a key topic. You can flexibly develop and test AWS Glue jobs in a Docker container. AWS Glue hosts Docker images on Docker Hub to set up your development environment with additional utilities. You can use your preferred IDE, notebook, or REPL using AWS Glue ETL library. This topic describes how to develop and test AWS Glue version 5.0 jobs in a Docker container using a Docker image.

## Available Docker images


 The following Docker images are available for AWS Glue on [Amazon ECR:](https://gallery.ecr.aws/glue/aws-glue-libs). 
+  For AWS Glue version 5.0: `public.ecr.aws/glue/aws-glue-libs:5` 
+ For AWS Glue version 4.0: `public.ecr.aws/glue/aws-glue-libs:glue_libs_4.0.0_image_01`
+ For AWS Glue version 3.0: `public.ecr.aws/glue/aws-glue-libs:glue_libs_3.0.0_image_01`
+ For AWS Glue version 2.0: `public.ecr.aws/glue/aws-glue-libs:glue_libs_2.0.0_image_01`

**Note**  
 AWS Glue Docker images are compatible with both x86\$164 and arm64. 

 In this example, we use `public.ecr.aws/glue/aws-glue-libs:5` and run the container on a local machine (Mac, Windows, or Linux). This container image has been tested for AWS Glue version 5.0 Spark jobs. The image contains the following: 
+  Amazon Linux 2023 
+  AWS Glue ETL Library 
+  Apache Spark 3.5.4 
+  Open table format libraries; Apache Iceberg 1.7.1, Apache Hudi 0.15.0, and Delta Lake 3.3.0 
+  AWS Glue Data Catalog Client 
+  Amazon Redshift connector for Apache Spark 
+  Amazon DynamoDB connector for Apache Hadoop 

 To set up your container, pull the image from ECR Public Gallery and then run the container. This topic demonstrates how to run your container with the following methods, depending on your requirements: 
+ `spark-submit`
+ REPL shell `(pyspark)`
+ `pytest`
+ Visual Studio Code

## Prerequisites


Before you start, make sure that Docker is installed and the Docker daemon is running. For installation instructions, see the Docker documentation for [Mac](https://docs.docker.com/docker-for-mac/install/) or [Linux](https://docs.docker.com/engine/install/). The machine running the Docker hosts the AWS Glue container. Also make sure that you have at least 7 GB of disk space for the image on the host running the Docker.

 For more information about restrictions when developing AWS Glue code locally, see [ Local development restrictions ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#local-dev-restrictions). 

### Configuring AWS


To enable AWS API calls from the container, set up AWS credentials by following steps. In the following sections, we will use this AWS named profile.

1.  [ Create an AWS named profile ](https://docs.aws.amazon.com//cli/latest/userguide/cli-configure-files.html). 

1.  Open `cmd` on Windows or a terminal on Mac/Linux and run the following command in a terminal: 

   ```
   PROFILE_NAME="<your_profile_name>"
   ```

In the following sections, we use this AWS named profile.

### 


 If you’re running Docker on Windows, choose the Docker icon (right-click) and choose **Switch to Linux containers** before pulling the image. 

Run the following command to pull the image from ECR Public:

```
docker pull public.ecr.aws/glue/aws-glue-libs:5 
```

## Run the container


You can now run a container using this image. You can choose any of following based on your requirements.

### spark-submit


You can run an AWS Glue job script by running the `spark-submit` command on the container. 

1.  Write your script and save it as `sample.py` in the example below and save it under the `/local_path_to_workspace/src/` directory using the following commands: 

   ```
   $ WORKSPACE_LOCATION=/local_path_to_workspace
   $ SCRIPT_FILE_NAME=sample.py
   $ mkdir -p ${WORKSPACE_LOCATION}/src
   $ vim ${WORKSPACE_LOCATION}/src/${SCRIPT_FILE_NAME}
   ```

1.  These variables are used in the docker run command below. The sample code (sample.py) used in the spark-submit command below is included in the appendix at the end of this topic. 

    Run the following command to execute the `spark-submit` command on the container to submit a new Spark application: 

   ```
   $ docker run -it --rm \
       -v ~/.aws:/home
       /hadoop/.aws \
       -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
       -e AWS_PROFILE=$PROFILE_NAME \
       --name glue5_spark_submit \
       public.ecr.aws/glue/aws-glue-libs:5 \
       spark-submit /home/hadoop/workspace/src/$SCRIPT_FILE_NAME
   ```

1. (Optionally) Configure `spark-submit` to match your environment. For example, you can pass your dependencies with the `--jars` configuration. For more information, consult [Dynamically Loading Spark Properties](https://spark.apache.org/docs/latest/configuration.html) in the Spark documentation. 

### REPL shell (Pyspark)


 You can run REPL (`read-eval-print loops`) shell for interactive development. Run the following command to execute the PySpark command on the container to start the REPL shell: 

```
$ docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pyspark \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark
```

 You will see the following output: 

```
Python 3.11.6 (main, Jan  9 2025, 00:00:00) [GCC 11.4.1 20230605 (Red Hat 11.4.1-2)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.5.4-amzn-0
      /_/

Using Python version 3.11.6 (main, Jan  9 2025 00:00:00)
Spark context Web UI available at None
Spark context available as 'sc' (master = local[*], app id = local-1740643079929).
SparkSession available as 'spark'.
>>>
```

 With this REPL shell, you can code and test interactively. 

### Pytest


 For unit testing, you can use `pytest` for AWS Glue Spark job scripts. Run the following commands for preparation. 

```
$ WORKSPACE_LOCATION=/local_path_to_workspace
$ SCRIPT_FILE_NAME=sample.py
$ UNIT_TEST_FILE_NAME=test_sample.py
$ mkdir -p ${WORKSPACE_LOCATION}/tests
$ vim ${WORKSPACE_LOCATION}/tests/${UNIT_TEST_FILE_NAME}
```

 Run the following command to run `pytest` using `docker run`: 

```
$ docker run -i --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_pytest \
    public.ecr.aws/glue/aws-glue-libs:5 \
    -c "python3 -m pytest --disable-warnings"
```

 Once `pytest` finishes executing unit tests, your output will look something like this: 

```
============================= test session starts ==============================
platform linux -- Python 3.11.6, pytest-8.3.4, pluggy-1.5.0
rootdir: /home/hadoop/workspace
plugins: integration-mark-0.2.0
collected 1 item

tests/test_sample.py .                                                   [100%]

======================== 1 passed, 1 warning in 34.28s =========================
```

### Setting up the container to use Visual Studio Code


 To set up the container with Visual Studio Code, complete the following steps: 

1. Install Visual Studio Code.

1. Install [Python](https://marketplace.visualstudio.com/items?itemName=ms-python.python).

1. Install [Visual Studio Code Remote - Containers](https://code.visualstudio.com/docs/remote/containers)

1. Open the workspace folder in Visual Studio Code.

1. Press `Ctrl+Shift+P` (Windows/Linux) or `Cmd+Shift+P` (Mac).

1. Type `Preferences: Open Workspace Settings (JSON)`.

1. Press Enter.

1. Paste the following JSON and save it.

   ```
   {
       "python.defaultInterpreterPath": "/usr/bin/python3.11",
       "python.analysis.extraPaths": [
           "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip:/usr/lib/spark/python/:/usr/lib/spark/python/lib/",
       ]
   }
   ```

 To set up the container: 

1. Run the Docker container.

   ```
   $ docker run -it --rm \
       -v ~/.aws:/home/hadoop/.aws \
       -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
       -e AWS_PROFILE=$PROFILE_NAME \
       --name glue5_pyspark \
       public.ecr.aws/glue/aws-glue-libs:5 \
       pyspark
   ```

1. Start Visual Studio Code.

1.  Choose **Remote Explorer** on the left menu, and choose `amazon/aws-glue-libs:glue_libs_4.0.0_image_01`. 

1.  Right-click and choose **Attach in Current Window**.   
![\[When right-click, a window with the option to Attach in Current Window is presented.\]](http://docs.aws.amazon.com/glue/latest/dg/images/vs-code-other-containers.png)

1.  If the following dialog appears, choose **Got it**.   
![\[A window warning with message "Attaching to a container may execute arbitrary code".\]](http://docs.aws.amazon.com/glue/latest/dg/images/vs-code-warning-got-it.png)

1. Open `/home/handoop/workspace/`.  
![\[A window drop-down with the option 'workspace' is highlighted.\]](http://docs.aws.amazon.com/glue/latest/dg/images/vs-code-open-workspace.png)

1.  Create a AWS Glue PySpark script and choose **Run**. 

   You will see the successful run of the script.  
![\[The successful run of the script.\]](http://docs.aws.amazon.com/glue/latest/dg/images/vs-code-run-successful-script.png)

## Changes between AWS Glue 4.0 and AWS Glue 5.0 Docker image


 The major changes between the AWS Glue 4.0 and AWS Glue 5.0 Docker image: 
+  In AWS Glue 5.0, there is a single container image for both batch and streaming jobs. This differs from Glue 4.0, where there was one image for batch and another for streaming. 
+  In AWS Glue 5.0, the default user name of the container is `hadoop`. In AWS Glue 4.0, the default user name was `glue_user`. 
+  In AWS Glue 5.0, several additional libraries including JupyterLab and Livy have been removed from the image. You can manually install them. 
+  In AWS Glue 5.0, all of Iceberg, Hudi and Delta libraries are pre-loaded by default, and the environment variable `DATALAKE_FORMATS` is no longer needed. Prior to AWS Glue 4.0, the environment variable `DATALAKE_FORMATS` environment variable was used to specify which specific table formats should be loaded. 

 The above list is specific to the Docker image. To learn more about AWS Glue 5.0 updates, see [Introducing AWS Glue 5.0 for Apache Spark ](https://aws.amazon.com/blogs/big-data/introducing-aws-glue-5-0-for-apache-spark/) and [Migrating AWS Glue for Spark jobs to AWS Glue version 5.0](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-50.html). 

## Considerations


 Keep in mind that the following features are not supported when using the AWS Glue container image to develop job scripts locally. 
+  [Job bookmarks](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuations.html) 
+  AWS Glue Parquet writer ([ Using the Parquet format in AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-parquet-home.html)) 
+  [ FillMissingValues transform ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-fillmissingvalues.html) 
+  [FindMatches transform](https://docs.aws.amazon.com/glue/latest/dg/machine-learning.html#find-matches-transform) 
+  [ Vectorized SIMD CSV reader ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-format-csv-home.html#aws-glue-programming-etl-format-simd-csv-reader) 
+  The property [ customJdbcDriverS3Path ](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-connect.html#aws-glue-programming-etl-connect-jdbc) for loading JDBC driver from Amazon S3 path 
+  [AWS Glue Data Quality](https://docs.aws.amazon.com/glue/latest/dg/glue-data-quality.html) 
+  [Sensitive Data Detection](https://docs.aws.amazon.com/glue/latest/dg/detect-PII.html) 
+  AWS Lake Formation permission-based credential vending 

## Appendix: Adding JDBC drivers and Java libraries


 To add JDBC driver not currently available in the container, you can create a new directory under your workspace with JAR files you need and mount the directory to `/opt/spark/jars/` in docker run command. JAR files found under `/opt/spark/jars/` within the container are automatically added to Spark Classpath and will be available for use during job run. 

 For example, use the following docker run command to add JDBC driver jars to PySpark REPL shell. 

```
docker run -it --rm \
    -v ~/.aws:/home/hadoop/.aws \
    -v $WORKSPACE_LOCATION:/home/hadoop/workspace/ \
    -v $WORKSPACE_LOCATION/jars/:/opt/spark/jars/ \
    --workdir /home/hadoop/workspace \
    -e AWS_PROFILE=$PROFILE_NAME \
    --name glue5_jdbc \
    public.ecr.aws/glue/aws-glue-libs:5 \
    pyspark
```

 As highlighted in **Considerations**, `customJdbcDriverS3Path` connection option cannot be used to import a custom JDBC driver from Amazon S3 in AWS Glue container images. 

# Development endpoints
Dev endpoints

**Note**  
 **The console experience for dev endpoints has been removed as of March 31, 2023.** Creating, updating, and monitoring dev endpoints is still available via the [Development endpoints API](aws-glue-api-dev-endpoint.md) and [ AWS Glue CLI](https://awscli.amazonaws.com/v2/documentation/api/latest/reference/glue/index.html#cli-aws-glue).

 We strongly recommend migrating from dev endpoints to interactive sessions for the reasons listed below. For required actions on how to migrate from dev endpoints to interactive sessions, see [Migrating from dev endpoints to interactive sessions](https://docs.aws.amazon.com/glue/latest/dg/development-migration-checklist.html). 


| Description | Dev endpoints | Interactive sessions | 
| --- | --- | --- | 
| Glue version support | Supports AWS Glue version 0.9 and 1.0 | Supports AWS Glue version 2.0 and later | 
| Dev endpoints are not available in the Asia Pacific (Jakarta) (ap-southeast-3), Middle East (UAE) (me-central-1), Europe (Spain) (eu-south-2), Europe (Zurich) (eu-central-2), or other new regions going forward | Interactive sessions are not currently available in the Middle East (UAE) (me-central-1) region, but may be made available later | 
| Access method to the Spark cluster | Supports SSH, REPL shell, Jupyter notebook, IDE (e.g. PyCharm) | supports AWS Glue Studio notebook, Jupyter notebook, various IDEs (for example, Visual Studio Code, PyCharm), and SageMaker AI notebook | 
| Time to first query | Requires 10-15 minutes to setup a Spark cluster | Can take up to 1 minute to set up an ephemeral Spark cluster | 
| Price model | AWS charges for development endpoints based on the time that the endpoint is provisioned and the number of DPUs. Development endpoints do not time out. There is a 10-minute minimum billing duration for each provisioned development endpoint. Additionally, AWS charges for Jupyter notebook on Amazon EC2 instances, and SageMaker AI notebooks when you configure them with dev endpoints.  | AWS charges for interactive sessions based on the time that the session is active and the number of DPUs. interactive sessions have configurable idle timeouts.  AWS Glue Studio notebooks provide a built-in interface for interactive sessions and are offered at no additional cost. There is a 1-minute minimum billing duration for each interactive session. AWS Glue Studio notebooks provide a built-in interface for interactive sessions and are offered at no additional cost | 
| Console experience | Only available via the CLI and API | Available through the AWS Glue console, CLI, and APIs | 

# Migrating from dev endpoints to interactive sessions


 Use the following checklist to determine the appropriate method to migrate from dev endpoints to interactive sessions. 

 **Does your script depend on AWS Glue 0.9 or 1.0 specific features (for example, HDFS, YARN, etc.)?** 

 If the answer is yes, see [Migrating AWS Glue jobs to  AWS Glue version 3.0](https://docs.aws.amazon.com/glue/latest/dg/migrating-version-30.html). to learn how to migrate from Glue 0.9 or 1.0 to Glue 3.0 and later. 

 **Which method do you use to access your dev endpoint?** 


| If you use this method | Then do this | 
| --- | --- | 
| SageMaker AI notebook, Jupyter notebook, or JupyterLab | Migrate to [AWS Glue Studio notebook](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-gs-notebook.html) by downloading .ipynb files on Jupyter and create a new AWS Glue Studio notebook job by uploading the  .ipynb file. Alternatively, you can also use [ SageMaker AI Studio](https://aws.amazon.com/blogs/machine-learning/prepare-data-at-scale-in-amazon-sagemaker-studio-using-serverless-aws-glue-interactive-sessions/) and select the AWS Glue kernel.  | 
| Zeppelin notebook | Convert the notebook to a Jupyter notebook manually by copying and pasting code or automatically using a third-party converter such as ze2nb. Then, use the notebook in AWS Glue Studio notebook or SageMaker AI Studio.  | 
| IDE |  See [ Author AWS Glue jobs with PyCharm using AWS Glue interactive sessions](https://aws.amazon.com/blogs/big-data/author-aws-glue-jobs-with-pycharm-using-aws-glue-interactive-sessions/), or [ Using interactive sessions with Microsoft Visual Studio Code](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions-vscode.html).  | 
| REPL |   Install the [https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html](https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html) locally, then run the following command:  [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/development-migration-checklist.html)  | 
| SSH | No corresponding option on interactive sessions. Alternatively, you can use a Docker image. To learn more, see [Developing using a Docker image](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-docker-image).  | 

The following sections provide information on using dev endpoints to develop jobs in AWS Glue version 1.0.

**Topics**
+ [

# Migrating from dev endpoints to interactive sessions
](development-migration-checklist.md)
+ [

# Developing scripts using development endpoints
](dev-endpoint.md)
+ [

# Managing notebooks
](notebooks-with-glue.md)

# Developing scripts using development endpoints


**Note**  
 Development Endpoints are only supported for versions of AWS Glue prior to 2.0. For an interactive environment where you can author and test ETL scripts, use [Notebooks on AWS Glue Studio](https://docs.aws.amazon.com/glue/latest/ug/notebooks-chapter.html). 

AWS Glue can create an environment—known as a *development endpoint*—that you can use to iteratively develop and test your extract, transform, and load (ETL) scripts. You can create, edit, and delete development endpoints using the  AWS Glue console or API.

## Managing your development environment
Development environment

When you create a development endpoint, you provide configuration values to provision the development environment. These values tell AWS Glue how to set up the network so that you can access the endpoint securely and the endpoint can access your data stores.

You can then create a notebook that connects to the endpoint, and use your notebook to author and test your ETL script. When you're satisfied with the results of your development process, you can create an ETL job that runs your script. With this process, you can add functions and debug your scripts in an interactive manner.

Follow the tutorials in this section to learn how to use your development endpoint with notebooks.

**Topics**
+ [

## Managing your development environment
](#dev-endpoint-managing-dev-environment)
+ [

# Development endpoint workflow
](dev-endpoint-workflow.md)
+ [

# How AWS Glue development endpoints work with SageMaker notebooks
](dev-endpoint-how-it-works.md)
+ [

# Adding a development endpoint
](add-dev-endpoint.md)
+ [

# Accessing your development endpoint
](dev-endpoint-elastic-ip.md)
+ [

# Tutorial: Set up a Jupyter notebook in JupyterLab to test and debug ETL scripts
](dev-endpoint-tutorial-local-jupyter.md)
+ [

# Tutorial: Use a SageMaker AI notebook with your development endpoint
](dev-endpoint-tutorial-sage.md)
+ [

# Tutorial: Use a REPL shell with your development endpoint
](dev-endpoint-tutorial-repl.md)
+ [

# Tutorial: Set up PyCharm professional with a development endpoint
](dev-endpoint-tutorial-pycharm.md)
+ [

# Advanced configuration: sharing development endpoints among multiple users
](dev-endpoint-sharing.md)

# Development endpoint workflow


To use an AWS Glue development endpoint, you can follow this workflow:

1. Create a development endpoint using the API. The endpoint is launched in a virtual private cloud (VPC) with your defined security groups.

1. The API polls the development endpoint until it is provisioned and ready for work. When it's ready, connect to the development endpoint using one of the following methods to create and test AWS Glue scripts.
   + Create an SageMaker AI notebook in your account. For more information about how to create a notebook, see [Authoring code with AWS Glue Studio notebooks](notebooks-chapter.md).
   + Open a terminal window to connect directly to a development endpoint.
   + If you have the professional edition of the JetBrains [PyCharm Python IDE](https://www.jetbrains.com/pycharm/), connect it to a development endpoint and use it to develop interactively. If you insert `pydevd` statements in your script, PyCharm can support remote breakpoints.

1. When you finish debugging and testing on your development endpoint, you can delete it.

# How AWS Glue development endpoints work with SageMaker notebooks
How Development Endpoints Work with SageMaker Notebooks

One of the common ways to access your development endpoints is to use [Jupyter](https://jupyter.org/) on SageMaker notebooks. The Jupyter notebook is an open-source web application which is widely used in visualization, analytics, machine learning, etc. An AWS Glue SageMaker notebook provides you a Jupyter notebook experience with AWS Glue development endpoints. In the AWS Glue SageMaker notebook, the Jupyter notebook environment is pre-configured with [SparkMagic](https://github.com/jupyter-incubator/sparkmagic), an open source Jupyter plugin to submit Spark jobs to a remote Spark cluster. [Apache Livy](https://livy.apache.org) is a service that allows interaction with a remote Spark cluster over a REST API. In the AWS Glue SageMaker notebook, SparkMagic is configured to call the REST API against a Livy server running on an AWS Glue development endpoint. 

The following text flow explains how each component works:

 *AWS Glue SageMaker notebook: (Jupyter → SparkMagic) → (network) →  AWS Glue development endpoint: (Apache Livy → Apache Spark)* 

Once you run your Spark script written in each paragraph on a Jupyter notebook, the Spark code is submitted to the Livy server via SparkMagic, then a Spark job named "livy-session-N" runs on the Spark cluster. This job is called a Livy session. The Spark job will run while the notebook session is alive. The Spark job will be terminated when you shutdown the Jupyter kernel from the notebook, or when the session is timed out. One Spark job is launched per notebook (.ipynb) file.

You can use a single AWS Glue development endpoint with multiple SageMaker notebook instances. You can create multiple notebook files in each SageMaker notebook instance. When you open an each notebook file and run the paragraphs, then a Livy session is launched per notebook file on the Spark cluster via SparkMagic. Each Livy session corresponds to single Spark job.

## Default behavior for AWS Glue development endpoints and SageMaker notebooks
Default Behavior for Endpoints and Notebooks

The Spark jobs run based on the [Spark configuration](https://spark.apache.org/docs/2.4.3/configuration.html). There are multiple ways to set the Spark configuration (for example, Spark cluster configuration, SparkMagic's configuration, etc.).

By default, Spark allocates cluster resources to a Livy session based on the Spark cluster configuration. In the AWS Glue development endpoints, the cluster configuration depends on the worker type. Here's a table which explains the common configurations per worker type.


****  

|  | Standard | G.1X | G.2X | 
| --- | --- | --- | --- | 
|  spark.driver.memory  | 5G | 10G | 20G | 
|  spark.executor.memory  | 5G | 10G | 20G | 
|  spark.executor.cores  | 4 | 8 | 16 | 
|  spark.dynamicAllocation.enabled  | TRUE | TRUE | TRUE | 

The maximum number of Spark executors is automatically calculated by combination of DPU (or `NumberOfWorkers`) and worker type. 


****  

|  | Standard | G.1X | G.2X | 
| --- | --- | --- | --- | 
| The number of max Spark executors |  (DPU - 1) \$1 2 - 1  |  (NumberOfWorkers - 1)   |  (NumberOfWorkers - 1)   | 

For example, if your development endpoint has 10 workers and the worker type is ` G.1X`, then you will have 9 Spark executors and the entire cluster will have 90G of executor memory since each executor will have 10G of memory.

Regardless of the specified worker type, Spark dynamic resource allocation will be turned on. If a dataset is large enough, Spark may allocate all the executors to a single Livy session since `spark.dynamicAllocation.maxExecutors` is not set by default. This means that other Livy sessions on the same dev endpoint will wait to launch new executors. If the dataset is small, Spark will be able to allocate executors to multiple Livy sessions at the same time.

**Note**  
For more information about how resources are allocated in different use cases and how you set a configuration to modify the behavior, see [Advanced configuration: sharing development endpoints among multiple users](dev-endpoint-sharing.md).

# Adding a development endpoint


Use development endpoints to iteratively develop and test your extract, transform, and load (ETL) scripts in AWS Glue. Working with development endpoints is only available through the AWS Command Line Interface.

1. In a command line window, enter a command similar to the following.

   ```
   aws glue create-dev-endpoint --endpoint-name "endpoint1" --role-arn "arn:aws:iam::account-id:role/role-name" --number-of-nodes "3" --glue-version "1.0" --arguments '{"GLUE_PYTHON_VERSION": "3"}' --region "region-name"
   ```

   This command specifies AWS Glue version 1.0. Because this version supports both Python 2 and Python 3, you can use the `arguments` parameter to indicate the desired Python version. If the `glue-version` parameter is omitted, AWS Glue version 0.9 is assumed. For more information about AWS Glue versions, see the [Glue version job property](add-job.md#glue-version-table).

   For information about additional command line parameters, see [create-dev-endpoint](https://docs.aws.amazon.com/cli/latest/reference/glue/create-dev-endpoint.html) in the *AWS CLI Command Reference*.

1. (Optional) Enter the following command to check the development endpoint status. When the status changes to `READY`, the development endpoint is ready to use.

   ```
   aws glue get-dev-endpoint --endpoint-name "endpoint1"
   ```

# Accessing your development endpoint


When you create a development endpoint in a virtual private cloud (VPC), AWS Glue returns only a private IP address. The public IP address field is not populated. When you create a non-VPC development endpoint, AWS Glue returns only a public IP address.

If your development endpoint has a **Public address**, confirm that it is reachable with the SSH private key for the development endpoint, as in the following example.

```
ssh -i dev-endpoint-private-key.pem glue@public-address
```

Suppose that your development endpoint has a **Private address**, your VPC subnet is routable from the public internet, and its security groups allow inbound access from your client. In this case, follow these steps to attach an *Elastic IP address* to a development endpoint to allow access from the internet.

**Note**  
If you want to use Elastic IP addresses, the subnet that is being used requires an internet gateway associated through the route table.

**To access a development endpoint by attaching an Elastic IP address**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Dev endpoints**, and navigate to the development endpoint details page. Record the **Private address** for use in the next step. 

1. Open the Amazon EC2 console at [https://console.aws.amazon.com/ec2/](https://console.aws.amazon.com/ec2/).

1. In the navigation pane, under **Network & Security**, choose **Network Interfaces**. 

1. Search for the **Private DNS (IPv4)** that corresponds to the **Private address** on the AWS Glue console development endpoint details page. 

   You might need to modify which columns are displayed on your Amazon EC2 console. Note the **Network interface ID** (ENI) for this address (for example, `eni-12345678`).

1. On the Amazon EC2 console, under **Network & Security**, choose **Elastic IPs**. 

1. Choose **Allocate new address**, and then choose **Allocate** to allocate a new Elastic IP address.

1. On the **Elastic IPs** page, choose the newly allocated **Elastic IP**. Then choose **Actions**, **Associate address**.

1. On the **Associate address** page, do the following:
   + For **Resource type**, choose **Network interface**.
   + In the **Network interface** box, enter the **Network interface ID** (ENI) for the private address.
   + Choose **Associate**.

1. Confirm that the newly associated Elastic IP address is reachable with the SSH private key that is associated with the development endpoint, as in the following example. 

   ```
   ssh -i dev-endpoint-private-key.pem glue@elastic-ip
   ```

   For information about using a bastion host to get SSH access to the development endpoint’s private address, see the AWS Security Blog post [Securely Connect to Linux Instances Running in a Private Amazon VPC](https://aws.amazon.com/blogs/security/securely-connect-to-linux-instances-running-in-a-private-amazon-vpc/).

# Tutorial: Set up a Jupyter notebook in JupyterLab to test and debug ETL scripts
Tutorial: Jupyter notebook in JupyterLab

In this tutorial, you connect a Jupyter notebook in JupyterLab running on your local machine to a development endpoint. You do this so that you can interactively run, debug, and test AWS Glue extract, transform, and load (ETL) scripts before deploying them. This tutorial uses Secure Shell (SSH) port forwarding to connect your local machine to an AWS Glue development endpoint. For more information, see [Port forwarding](https://en.wikipedia.org/wiki/Port_forwarding) on Wikipedia.

## Step 1: Install JupyterLab and Sparkmagic
Installing JupyterLab and Sparkmagic

You can install JupyterLab by using `conda` or `pip`. `conda` is an open-source package management system and environment management system that runs on Windows, macOS, and Linux. `pip` is the package installer for Python.

If you're installing on macOS, you must have Xcode installed before you can install Sparkmagic.

1. Install JupyterLab, Sparkmagic, and the related extensions.

   ```
   $ conda install -c conda-forge jupyterlab
   $ pip install sparkmagic
   $ jupyter nbextension enable --py --sys-prefix widgetsnbextension
   $ jupyter labextension install @jupyter-widgets/jupyterlab-manager
   ```

1. Check the `sparkmagic` directory from `Location`. 

   ```
   $ pip show sparkmagic | grep Location
   Location: /Users/username/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packages
   ```

1. Change your directory to the one returned for `Location`, and install the kernels for Scala and PySpark.

   ```
   $ cd /Users/username/.pyenv/versions/anaconda3-5.3.1/lib/python3.7/site-packages
   $ jupyter-kernelspec install sparkmagic/kernels/sparkkernel
   $ jupyter-kernelspec install sparkmagic/kernels/pysparkkernel
   ```

1. Download a sample `config` file. 

   ```
   $ curl -o ~/.sparkmagic/config.json https://raw.githubusercontent.com/jupyter-incubator/sparkmagic/master/sparkmagic/example_config.json
   ```

   In this configuration file, you can configure Spark-related parameters like `driverMemory` and `executorCores`.

## Step 2: Start JupyterLab


When you start JupyterLab, your default web browser is automatically opened, and the URL `http://localhost:8888/lab/workspaces/{workspace_name}` is shown.

```
$ jupyter lab
```

## Step 3: Initiate SSH port forwarding to connect to your development endpoint
Port forwarding

Next, use SSH local port forwarding to forward a local port (here, `8998`) to the remote destination that is defined by AWS Glue (`169.254.76.1:8998`). 

1. Open a separate terminal window that gives you access to SSH. In Microsoft Windows, you can use the BASH shell provided by [Git for Windows](https://git-scm.com/downloads), or you can install [Cygwin](https://www.cygwin.com/).

1. Run the following SSH command, modified as follows:
   + Replace `private-key-file-path` with a path to the `.pem` file that contains the private key corresponding to the public key that you used to create your development endpoint.
   + If you're forwarding a different port than `8998`, replace `8998` with the port number that you're actually using locally. The address `169.254.76.1:8998` is the remote port and isn't changed by you.
   + Replace `dev-endpoint-public-dns` with the public DNS address of your development endpoint. To find this address, navigate to your development endpoint in the AWS Glue console, choose the name, and copy the **Public address** that's listed on the **Endpoint details** page.

   ```
   ssh -i private-key-file-path -NTL 8998:169.254.76.1:8998 glue@dev-endpoint-public-dns
   ```

   You will likely see a warning message like the following:

   ```
   The authenticity of host 'ec2-xx-xxx-xxx-xx.us-west-2.compute.amazonaws.com (xx.xxx.xxx.xx)'
   can't be established.  ECDSA key fingerprint is SHA256:4e97875Brt+1wKzRko+JflSnp21X7aTP3BcFnHYLEts.
   Are you sure you want to continue connecting (yes/no)?
   ```

   Enter **yes** and leave the terminal window open while you use JupyterLab. 

1. Check that SSH port forwarding is working with the development endpoint correctly.

   ```
   $ curl localhost:8998/sessions
   {"from":0,"total":0,"sessions":[]}
   ```

## Step 4: Run a simple script fragment in a notebook paragraph
Running a sample script

Now your notebook in JupyterLab should work with your development endpoint. Enter the following script fragment into your notebook and run it.

1. Check that Spark is running successfully. The following command instructs Spark to calculate `1` and then print the value.

   ```
   spark.sql("select 1").show()
   ```

1. Check if AWS Glue Data Catalog integration is working. The following command lists the tables in the Data Catalog.

   ```
   spark.sql("show tables").show()
   ```

1. Check that a simple script fragment that uses AWS Glue libraries works.

   The following script uses the `persons_json` table metadata in the AWS Glue Data Catalog to create a `DynamicFrame` from your sample data. It then prints out the item count and the schema of this data. 

```
import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext
 
# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())
 
# Create a DynamicFrame using the 'persons_json' table
persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons_json")
 
# Print out information about *this* data
print("Count:  ", persons_DyF.count())
persons_DyF.printSchema()
```

The output of the script is as follows.

```
 Count:  1961
 root
 |-- family_name: string
 |-- name: string
 |-- links: array
 |    |-- element: struct
 |    |    |-- note: string
 |    |    |-- url: string
 |-- gender: string
 |-- image: string
 |-- identifiers: array
 |    |-- element: struct
 |    |    |-- scheme: string
 |    |    |-- identifier: string
 |-- other_names: array
 |    |-- element: struct
 |    |    |-- note: string
 |    |    |-- name: string
 |    |    |-- lang: string
 |-- sort_name: string
 |-- images: array
 |    |-- element: struct
 |    |    |-- url: string
 |-- given_name: string
 |-- birth_date: string
 |-- id: string
 |-- contact_details: array
 |    |-- element: struct
 |    |    |-- type: string
 |    |    |-- value: string
 |-- death_date: string
```

## Troubleshooting
Troubleshooting
+ During the installation of JupyterLab, if your computer is behind a corporate proxy or firewall, you might encounter HTTP and SSL errors due to custom security profiles managed by corporate IT departments.

  The following is an example of a typical error that occurs when `conda` can't connect to its own repositories:

  ```
  CondaHTTPError: HTTP 000 CONNECTION FAILED for url <https://repo.anaconda.com/pkgs/main/win-64/current_repodata.json>
  ```

  This might happen because your company can block connections to widely used repositories in Python and JavaScript communities. For more information, see [Installation Problems](https://jupyterlab.readthedocs.io/en/stable/getting_started/installation.html#installation-problems) on the JupyterLab website.
+ If you encounter a *connection refused* error when trying to connect to your development endpoint, you might be using a development endpoint that is out of date. Try creating a new development endpoint and reconnecting.

# Tutorial: Use a SageMaker AI notebook with your development endpoint
Tutorial: Use a SageMaker AI notebook

 In AWS Glue, you can create a development endpoint and then create a SageMaker AI notebook to help develop your ETL and machine learning scripts. A SageMaker AI notebook is a fully managed machine learning compute instance running the Jupyter Notebook application.

1. In the AWS Glue console, choose **Dev endpoints** to navigate to the development endpoints list. 

1. Select the check box next to the name of a development endpoint that you want to use, and on the **Action** menu, choose **Create SageMaker notebook**.

1. Fill out the **Create and configure a notebook** page as follows:

   1. Enter a notebook name.

   1. Under **Attach to development endpoint**, verify the development endpoint.

   1. Create or choose an AWS Identity and Access Management (IAM) role.

      Creating a role is recommended. If you use an existing role, ensure that it has the required permissions. For more information, see [Step 6: Create an IAM policy for SageMaker AI notebooks](create-sagemaker-notebook-policy.md).

   1. (Optional) Choose a VPC, a subnet, and one or more security groups.

   1. (Optional) Choose an AWS Key Management Service encryption key.

   1. (Optional) Add tags for the notebook instance.

1. Choose **Create notebook**. On the **Notebooks** page, choose the refresh icon at the upper right, and continue until the **Status** shows `Ready`.

1. Select the check box next to the new notebook name, and then choose **Open notebook**.

1. Create a new notebook: On the **jupyter** page, choose **New**, and then choose **Sparkmagic (PySpark)**.

   Your screen should now look like the following:  
![\[The jupyter page has a menu bar, toolbar, and a wide text field into which you can enter statements.\]](http://docs.aws.amazon.com/glue/latest/dg/images/sagemaker-notebook.png)

1. (Optional) At the top of the page, choose **Untitled**, and give the notebook a name.

1. To start a Spark application, enter the following command into the notebook, and then in the toolbar, choose **Run**.

   ```
   spark
   ```

   After a short delay, you should see the following response:  
![\[The system response shows Spark application status and outputs the following message: SparkSession available as 'spark'.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-command-response.png)

1. Create a dynamic frame and run a query against it: Copy, paste, and run the following code, which outputs the count and schema of the `persons_json` table.

   ```
   import sys
   from pyspark.context import SparkContext
   from awsglue.context import GlueContext
   from awsglue.transforms import *
   glueContext = GlueContext(SparkContext.getOrCreate())
   persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons_json")
   print ("Count:  ", persons_DyF.count())
   persons_DyF.printSchema()
   ```

# Tutorial: Use a REPL shell with your development endpoint
Tutorial: Use a REPL shell

 In AWS Glue, you can create a development endpoint and then invoke a REPL (Read–Evaluate–Print Loop) shell to run PySpark code incrementally so that you can interactively debug your ETL scripts before deploying them.

 In order to use a REPL on a development endpoint, you need to have authorization to SSH to the endpoint. 

1. On your local computer, open a terminal window that can run SSH commands, and paste in the edited SSH command. Run the command.

   Assuming that you accepted AWS Glue version 1.0 with Python 3 for the development endpoint, the output will look like this:

   ```
   Python 3.6.8 (default, Aug  2 2019, 17:42:44)
   [GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
   Type "help", "copyright", "credits" or "license" for more information.
   SLF4J: Class path contains multiple SLF4J bindings.
   SLF4J: Found binding in [jar:file:/usr/share/aws/glue/etl/jars/glue-assembly.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   SLF4J: Found binding in [jar:file:/usr/lib/spark/jars/slf4j-log4j12-1.7.16.jar!/org/slf4j/impl/StaticLoggerBinder.class]
   SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
   SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
   Setting default log level to "WARN".
   To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
   2019-09-23 22:12:23,071 WARN  [Thread-5] yarn.Client (Logging.scala:logWarning(66)) - Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
   2019-09-23 22:12:26,562 WARN  [Thread-5] yarn.Client (Logging.scala:logWarning(66)) - Same name resource file:/usr/lib/spark/python/lib/pyspark.zip added multiple times to distributed cache
   2019-09-23 22:12:26,580 WARN  [Thread-5] yarn.Client (Logging.scala:logWarning(66)) - Same path resource file:///usr/share/aws/glue/etl/python/PyGlue.zip added multiple times to distributed cache.
   2019-09-23 22:12:26,581 WARN  [Thread-5] yarn.Client (Logging.scala:logWarning(66)) - Same path resource file:///usr/lib/spark/python/lib/py4j-src.zip added multiple times to distributed cache.
   2019-09-23 22:12:26,581 WARN  [Thread-5] yarn.Client (Logging.scala:logWarning(66)) - Same path resource file:///usr/share/aws/glue/libs/pyspark.zip added multiple times to distributed cache.
   Welcome to
         ____              __
        / __/__  ___ _____/ /__
       _\ \/ _ \/ _ `/ __/  '_/
      /__ / .__/\_,_/_/ /_/\_\   version 2.4.3
         /_/
   
   Using Python version 3.6.8 (default, Aug  2 2019 17:42:44)
   SparkSession available as 'spark'.
   >>>
   ```

1. Test that the REPL shell is working correctly by typing the statement, `print(spark.version)`. As long as that displays the Spark version, your REPL is now ready to use.

1. Now you can try executing the following simple script, line by line, in the shell:

   ```
   import sys
   from pyspark.context import SparkContext
   from awsglue.context import GlueContext
   from awsglue.transforms import *
   glueContext = GlueContext(SparkContext.getOrCreate())
   persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons_json")
   print ("Count:  ", persons_DyF.count())
   persons_DyF.printSchema()
   ```

# Tutorial: Set up PyCharm professional with a development endpoint
Tutorial: Use PyCharm professional

This tutorial shows you how to connect the [PyCharm Professional](https://www.jetbrains.com/pycharm/) Python IDE running on your local machine to a development endpoint so that you can interactively run, debug, and test AWS Glue ETL (extract, transfer, and load) scripts before deploying them. The instructions and screen captures in the tutorial are based on PyCharm Professional version 2019.3.

To connect to a development endpoint interactively, you must have PyCharm Professional installed. You can't do this using the free edition.

**Note**  
The tutorial uses Amazon S3 as a data source. If you want to use a JDBC data source instead, you must run your development endpoint in a virtual private cloud (VPC). To connect with SSH to a development endpoint in a VPC, you must create an SSH tunnel. This tutorial does not include instructions for creating an SSH tunnel. For information on using SSH to connect to a development endpoint in a VPC, see [Securely Connect to Linux Instances Running in a Private Amazon VPC](https://aws.amazon.com/blogs/security/securely-connect-to-linux-instances-running-in-a-private-amazon-vpc/) in the AWS security blog.

**Topics**
+ [

## Connecting PyCharm professional to a development endpoint
](#dev-endpoint-tutorial-pycharm-connect)
+ [

## Deploying the script to your development endpoint
](#dev-endpoint-tutorial-pycharm-deploy)
+ [

## Configuring a remote interpreter
](#dev-endpoint-tutorial-pycharm-interpreter)
+ [

## Running your script on the development endpoint
](#dev-endpoint-tutorial-pycharm-debug-run)

## Connecting PyCharm professional to a development endpoint
Connecting PyCharm

1. Create a new pure-Python project in PyCharm named `legislators`.

1. Create a file named `get_person_schema.py` in the project with the following content:

   ```
   from pyspark.context import SparkContext
   from awsglue.context import GlueContext
   
   
   def main():
       # Create a Glue context
       glueContext = GlueContext(SparkContext.getOrCreate())
   
       # Create a DynamicFrame using the 'persons_json' table
       persons_DyF = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="persons_json")
   
       # Print out information about this data
       print("Count:  ", persons_DyF.count())
       persons_DyF.printSchema()
   
   
   if __name__ == "__main__":
       main()
   ```

1. Do one of the following:
   + For AWS Glue version 0.9, download the AWS Glue Python library file, `PyGlue.zip`, from `https://s3.amazonaws.com/aws-glue-jes-prod-us-east-1-assets/etl/python/PyGlue.zip` to a convenient location on your local machine.
   + For AWS Glue version 1.0 and later, download the AWS Glue Python library file, `PyGlue.zip`, from `https://s3.amazonaws.com/aws-glue-jes-prod-us-east-1-assets/etl-1.0/python/PyGlue.zip` to a convenient location on your local machine.

1. Add `PyGlue.zip` as a content root for your project in PyCharm:
   + In PyCharm, choose **File**, **Settings** to open the **Settings** dialog box. (You can also press `Ctrl+Alt+S`.)
   + Expand the `legislators` project and choose **Project Structure**. Then in the right pane, choose **\$1 Add Content Root**.
   + Navigate to the location where you saved `PyGlue.zip`, select it, then choose **Apply**.

    The **Settings** screen should look something like the following:  
![\[The PyCharm Settings screen with PyGlue.zip added as a content root.\]](http://docs.aws.amazon.com/glue/latest/dg/images/PyCharm_AddContentRoot.png)

   Leave the **Settings** dialog box open after you choose **Apply**.

1. Configure deployment options to upload the local script to your development endpoint using SFTP (this capability is available only in PyCharm Professional):
   + In the **Settings** dialog box, expand the **Build, Execution, Deployment** section. Choose the **Deployment** subsection.
   + Choose the **\$1** icon at the top of the middle pane to add a new server. Set its **Type** to `SFTP` and give it a name.
   + Set the **SFTP host** to the **Public address** of your development endpoint, as listed on its details page. (Choose the name of your development endpoint in the AWS Glue console to display the details page). For a development endpoint running in a VPC, set **SFTP host** to the host address and local port of your SSH tunnel to the development endpoint.
   + Set the **User name** to `glue`.
   + Set the **Auth type** to **Key pair (OpenSSH or Putty)**. Set the **Private key file** by browsing to the location where your development endpoint's private key file is located. Note that PyCharm only supports DSA, RSA and ECDSA OpenSSH key types, and does not accept keys in Putty's private format. You can use an up-to-date version of `ssh-keygen` to generate a key-pair type that PyCharm accepts, using syntax like the following:

     ```
     ssh-keygen -t rsa -f <key_file_name> -C "<your_email_address>"
     ```
   + Choose **Test connection**, and allow the connection to be tested. If the connection succeeds, choose **Apply**.

    The **Settings** screen should now look something like the following:  
![\[The PyCharm Settings screen with an SFTP server defined.\]](http://docs.aws.amazon.com/glue/latest/dg/images/PyCharm_SFTP.png)

   Again, leave the **Settings** dialog box open after you choose **Apply**.

1. Map the local directory to a remote directory for deployment:
   + In the right pane of the **Deployment** page, choose the middle tab at the top, labeled **Mappings**.
   + In the **Deployment Path** column, enter a path under `/home/glue/scripts/` for deployment of your project path. For example: `/home/glue/scripts/legislators`.
   + Choose **Apply**.

    The **Settings** screen should now look something like the following:  
![\[The PyCharm Settings screen after a deployment mapping.\]](http://docs.aws.amazon.com/glue/latest/dg/images/PyCharm_Mapping.png)

   Choose **OK** to close the **Settings** dialog box.

## Deploying the script to your development endpoint
Deployment

1. Choose **Tools**, **Deployment**, and then choose the name under which you set up your development endpoint, as shown in the following image:  
![\[The menu item for deploying your script.\]](http://docs.aws.amazon.com/glue/latest/dg/images/PyCharm_Deploy.png)

   After your script has been deployed, the bottom of the screen should look something like the following:  
![\[The bottom of the PyCharm screen after a successful deployment.\]](http://docs.aws.amazon.com/glue/latest/dg/images/PyCharm_Deployed.png)

1. On the menu bar, choose **Tools**, **Deployment**, **Automatic Upload (always)**. Ensure that a check mark appears next to **Automatic Upload (always)**.

   When this option is enabled, PyCharm automatically uploads changed files to the development endpoint.

## Configuring a remote interpreter


Configure PyCharm to use the Python interpreter on the development endpoint.

1. From the **File** menu, choose **Settings**.

1. Expand the project **legislators** and choose **Project Interpreter**.

1. Choose the gear icon next to the **Project Interpreter** list, and then choose **Add**.

1. In the **Add Python Interpreter** dialog box, in the left pane, choose **SSH Interpreter**.

1. Choose **Existing server configuration**, and in the **Deployment configuration** list, choose your configuration.

   Your screen should look something like the following image.  
![\[In the left pane, SSH Interpreter is selected, and in the right pane, the Existing server configuration radio button is selected. The Deployment configuration field contains the configuration name and the message "Remote SDK is saved in IDE settings, so it needs the deployment server to be saved there too. Which do you prefer?" The following are the choices beneath that message: "Create copy of this deployment server in IDE settings" and "Move this server to IDE settings."\]](http://docs.aws.amazon.com/glue/latest/dg/images/PyCharm_Interpreter1.png)

1. Choose **Move this server to IDE settings**, and then choose **Next**.

1. In the **Interpreter** field, change the path to` /usr/bin/gluepython` if you are using Python 2, or to `/usr/bin/gluepython3` if you are using Python 3. Then choose **Finish**.

## Running your script on the development endpoint
Running the script

To run the script:
+ In the left pane, right-click the file name and choose **Run '*<filename>*'**.

  After a series of messages, the final output should show the count and the schema.

  ```
  Count:   1961
  root
  |-- family_name: string
  |-- name: string
  |-- links: array
  |    |-- element: struct
  |    |    |-- note: string
  |    |    |-- url: string
  |-- gender: string
  |-- image: string
  |-- identifiers: array
  |    |-- element: struct
  |    |    |-- scheme: string
  |    |    |-- identifier: string
  |-- other_names: array
  |    |-- element: struct
  |    |    |-- lang: string
  |    |    |-- note: string
  |    |    |-- name: string
  |-- sort_name: string
  |-- images: array
  |    |-- element: struct
  |    |    |-- url: string
  |-- given_name: string
  |-- birth_date: string
  |-- id: string
  |-- contact_details: array
  |    |-- element: struct
  |    |    |-- type: string
  |    |    |-- value: string
  |-- death_date: string
  
  
  Process finished with exit code 0
  ```

You are now set up to debug your script remotely on your development endpoint.

# Advanced configuration: sharing development endpoints among multiple users
Advanced configuration: sharing dev endpoints among multiple users

This section explains how you can take advantage of development endpoints with SageMaker notebooks in typical use cases to share development endpoints among multiple users.

## Single-tenancy configuration


In single tenant use-cases, to simplify the developer experience and to avoid contention for resources it is recommended that you have each developer use their own development endpoint sized for the project they are working on. This also simplifies the decisions related to worker type and DPU count leaving them up to the discretion of the developer and project they are working on. 

You won't need to take care of resource allocation unless you runs multiple notebook files concurrently. If you run code in multiple notebook files at the same time, multiple Livy sessions will be launched concurrently. To segregate Spark cluster configurations in order to run multiple Livy sessions at the same time, you can follow the steps which are introduced in multi tenant use-cases.

For example, if your development endpoint has 10 workers and the worker type is ` G.1X`, then you will have 9 Spark executors and the entire cluster will have 90G of executor memory since each executor will have 10G of memory.

Regardless of the specified worker type, Spark dynamic resource allocation will be turned on. If a dataset is large enough, Spark may allocate all the executors to a single Livy session since `spark.dynamicAllocation.maxExecutors` is not set by default. This means that other Livy sessions on the same dev endpoint will wait to launch new executors. If the dataset is small, Spark will be able to allocate executors to multiple Livy sessions at the same time.

**Note**  
For more information about how resources are allocated in different use cases and how you set a configuration to modify the behavior, see [Advanced configuration: sharing development endpoints among multiple users](#dev-endpoint-sharing).

### Multi-tenancy configuration


**Note**  
Please note, development endpoints are intended to emulate the AWS Glue ETL environment as a single-tenant environment. While multi-tenant use is possible, it is an advanced use-case and it is recommended most users maintain a pattern of single-tenancy for each development endpoint.

In multi tenant use-cases, you might need to take care of resource allocation. The key factor is the number of concurrent users who use a Jupyter notebook at the same time. If your team works in a "follow-the-sun" workflow and there is only one Jupyter user at each time zone, then the number of concurrent users is only one, so you won't need to be concerned with resource allocation. However, if your notebook is shared among multiple users and each user submits code in an ad-hoc basis, then you will need to consider the below points.

To partition Spark cluster resources among multiple users, you can use SparkMagic configurations. There are two different ways to configure SparkMagic.

#### (A) Use the %%configure -f directive


If you want to modify the configuration per Livy session from the notebook, you can run the `%%configure -f` directive on the notebook paragraph.

For example, if you want to run Spark application on 5 executors, you can run the following command on the notebook paragraph.

```
%%configure -f
{"numExecutors":5}
```

Then you will see only 5 executors running for the job on the Spark UI.

We recommend limiting the maximum number of executors for dynamic resource allocation.

```
%%configure -f
{"conf":{"spark.dynamicAllocation.maxExecutors":"5"}}
```

#### (B) Modify the SparkMagic config file


SparkMagic works based on the [Livy API](https://livy.incubator.apache.org/docs/latest/rest-api.html). SparkMagic creates Livy sessions with configurations such as `driverMemory`, ` driverCores`, `executorMemory`, `executorCores`, ` numExecutors`, `conf`, etc. Those are the key factors that determine how much resources are consumed from the entire Spark cluster. SparkMagic allows you to provide a config file to specify those parameters which are sent to Livy. You can see a sample config file in this [Github repository](https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/example_config.json).

If you want to modify configuration across all the Livy sessions from a notebook, you can modify `/home/ec2-user/.sparkmagic/config.json` to add `session_config` .

To modify the config file on a SageMaker notebook instance, you can follow these steps.

1. Open a SageMaker notebook.

1. Open the Terminal kernel.

1. Run the following commands:

   ```
   sh-4.2$ cd .sparkmagic
   sh-4.2$ ls
   config.json logs
   sh-4.2$ sudo vim config.json
   ```

   For example, you can add these lines to ` /home/ec2-user/.sparkmagic/config.json` and restart the Jupyter kernel from the notebook.

   ```
     "session_configs": {
       "conf": {
         "spark.dynamicAllocation.maxExecutors":"5"
       }
     },
   ```

### Guidelines and best practices


To avoid this kind of resource conflict, you can use some basic approaches like:
+ Have a larger Spark cluster by increasing the `NumberOfWorkers` (scaling horizontally) and upgrading the `workerType` (scaling vertically)
+ Allocate fewer resources per user (fewer resources per Livy session)

Your approach will depend on your use case. If you have a larger development endpoint, and there is not a huge amount of data, the possibility of a resource conflict will decrease significantly because Spark can allocate resources based on a dynamic allocation strategy.

As described above, the number of Spark executors can be automatically calculated based on a combination of DPU (or `NumberOfWorkers`) and worker type. Each Spark application launches one driver and multiple executors. To calculate you will need the ` NumberOfWorkers` = `NumberOfExecutors + 1`. The matrix below explains how much capacity you need in your development endpoint based on the number of concurrent users.


****  

| Number of concurrent notebook users | Number of Spark executors you want to allocate per user | Total NumberOfWorkers for your dev endpoint | 
| --- | --- | --- | 
| 3 | 5 | 18 | 
| 10 | 5 | 60 | 
| 50 | 5 | 300 | 

If you want to allocate fewer resources per user, the ` spark.dynamicAllocation.maxExecutors` (or `numExecutors`) would be the easiest parameter to configure as a Livy session parameter. If you set the below configuration in `/home/ec2-user/.sparkmagic/config.json`, then SparkMagic will assign a maximum of 5 executors per Livy session. This will help segregating resources per Livy session.

```
"session_configs": {
    "conf": {
      "spark.dynamicAllocation.maxExecutors":"5"
    }
  },
```

Suppose there is a dev endpoint with 18 workers (G.1X) and there are 3 concurrent notebook users at the same time. If your session config has ` spark.dynamicAllocation.maxExecutors=5` then each user can make use of 1 driver and 5 executors. There won't be any resource conflicts even when you run multiple notebook paragraphs at the same time.

#### Trade-offs


With this session config `"spark.dynamicAllocation.maxExecutors":"5"`, you will be able to avoid resource conflict errors and you do not need to wait for resource allocation when there are concurrent user accesses. However, even when there are many free resources (for example, there are no other concurrent users), Spark cannot assign more than 5 executors for your Livy session.

#### Other notes


It is a good practice to stop the Jupyter kernel when you stop using a notebook. This will free resources and other notebook users can use those resources immediately without waiting for kernel expiration (auto-shutdown).

### Common issues


Even when following the guidelines, you may experience certain issues.

#### Session not found


When you try to run a notebook paragraph even though your Livy session has been already terminated, you will see the below message. To activate the Livy session, you need to restart the Jupyter kernel by choosing **Kernel** > **Restart** in the Jupyter menu, then run the notebook paragraph again.

```
An error was encountered:
Invalid status code '404' from http://localhost:8998/sessions/13 with error payload: "Session '13' not found."
```

#### Not enough YARN resources


When you try to run a notebook paragraph even though your Spark cluster does not have enough resources to start a new Livy session, you will see the below message. You can often avoid this issue by following the guidelines, however, there might be a possibility that you face this issue. To workaround the issue, you can check if there are any unneeded, active Livy sessions. If there are unneeded Livy sessions, you will need to terminate them to free the cluster resources. See the next section for details.

```
Warning: The Spark session does not have enough YARN resources to start. 
The code failed because of a fatal error:
    Session 16 did not start up in 60 seconds..

Some things to try:
a) Make sure Spark has enough available resources for Jupyter to create a Spark context.
b) Contact your Jupyter administrator to make sure the Spark magics library is configured correctly.
c) Restart the kernel.
```

### Monitoring and debugging


This section describes techniques for monitoring resources and sessions.

#### Monitoring and debugging cluster resource allocation


You can watch the Spark UI to monitor how many resources are allocated per Livy session, and what are the effective Spark configurations on the job. To activate the Spark UI, see [Enabling the Apache Spark Web UI for Development Endpoints](https://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-dev-endpoints.html).

(Optional) If you need a real-time view of the Spark UI, you can configure an SSH tunnel against the Spark history server running on the Spark cluster.

```
ssh -i <private-key.pem> -N -L 8157:<development endpoint public address>:18080 glue@<development endpoint public address>
```

You can then open http://localhost:8157 on your browser to view the Spark UI.

#### Free unneeded Livy sessions


Review these procedures to shut down any unneeded Livy sessions from a notebook or a Spark cluster.

**(a). Terminate Livy sessions from a notebook**  
You can shut down the kernel on a Jupyter notebook to terminate unneeded Livy sessions.

**(b). Terminate Livy sessions from a Spark cluster**  
If there are unneeded Livy sessions which are still running, you can shut down the Livy sessions on the Spark cluster.

As a pre-requisite to perform this procedure, you need to configure your SSH public key for your development endpoint.

To log in to the Spark cluster, you can run the following command:

```
$ ssh -i <private-key.pem> glue@<development endpoint public address>
```

You can run the following command to see the active Livy sessions:

```
$ yarn application -list
20/09/25 06:22:21 INFO client.RMProxy: Connecting to ResourceManager at ip-255-1-106-206.ec2.internal/172.38.106.206:8032
Total number of applications (application-types: [] and states: [SUBMITTED, ACCEPTED, RUNNING]):2
Application-Id Application-Name Application-Type User Queue State Final-State Progress Tracking-URL
application_1601003432160_0005 livy-session-4 SPARK livy default RUNNING UNDEFINED 10% http://ip-255-1-4-130.ec2.internal:41867
application_1601003432160_0004 livy-session-3 SPARK livy default RUNNING UNDEFINED 10% http://ip-255-1-179-185.ec2.internal:33727
```

You can then shut down the Livy session with the following command:

```
$ yarn application -kill application_1601003432160_0005
20/09/25 06:23:38 INFO client.RMProxy: Connecting to ResourceManager at ip-255-1-106-206.ec2.internal/255.1.106.206:8032
Killing application application_1601003432160_0005
20/09/25 06:23:39 INFO impl.YarnClientImpl: Killed application application_1601003432160_0005
```

# Managing notebooks
Managing notebooks

**Note**  
 Development Endpoints are only supported for versions of AWS Glue prior to 2.0. For an interactive environment where you can author and test ETL scripts, use [Notebooks on AWS Glue Studio](https://docs.aws.amazon.com/glue/latest/ug/notebooks-chapter.html). 

A notebook enables interactive development and testing of your ETL (extract, transform, and load) scripts on a development endpoint. AWS Glue provides an interface to SageMaker AI Jupyter notebooks. With AWS Glue, you create and manage SageMaker AI notebooks. You can also open SageMaker AI notebooks from the AWS Glue console.

In addition, you can use Apache Spark with SageMaker AI on AWS Glue development endpoints which support SageMaker AI (but not AWS Glue ETL jobs). SageMaker Spark is an open source Apache Spark library for SageMaker AI. For more information, see [Using Apache Spark with Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/apache-spark.html). 


| Region | Code | 
| --- | --- | 
|   Managing SageMaker AI notebooks with AWS Glue development endpoints is available in the following AWS Regions: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/notebooks-with-glue.html)   | 
| US East (Ohio) | `us-east-2` | 
| US East (N. Virginia) | `us-east-1` | 
| US West (N. California) | `us-west-1` | 
| US West (Oregon) | `us-west-2` | 
| Asia Pacific (Tokyo) | `ap-northeast-1` | 
| Asia Pacific (Seoul) | `ap-northeast-2` | 
| Asia Pacific (Mumbai) | `ap-south-1` | 
| Asia Pacific (Singapore) | `ap-southeast-1` | 
| Asia Pacific (Sydney) | `ap-southeast-2` | 
| Canada (Central) | `ca-central-1` | 
| Europe (Frankfurt) | `eu-central-1` | 
| Europe (Ireland) | `eu-west-1` | 
| Europe (London) | `eu-west-2` | 

**Topics**