

# Monitoring jobs using the Apache Spark web UI
Monitoring with the Spark UI

You can use the Apache Spark web UI to monitor and debug AWS Glue ETL jobs running on the AWS Glue job system, and also Spark applications running on AWS Glue development endpoints. The Spark UI enables you to check the following for each job:
+ The event timeline of each Spark stage
+ A directed acyclic graph (DAG) of the job
+ Physical and logical plans for SparkSQL queries
+ The underlying Spark environmental variables for each job

For more information about using the Spark Web UI, see [Web UI](https://spark.apache.org/docs/3.3.0/web-ui.html) in the Spark documentation. For guidance on how to interpret Spark UI results to improve the performance of your job, see [Best practices for performance tuning AWS Glue for Apache Spark jobs](https://docs.aws.amazon.com/prescriptive-guidance/latest/tuning-aws-glue-for-apache-spark/introduction.html) in AWS Prescriptive Guidance.

 You can see the Spark UI in the AWS Glue console. This is available when an AWS Glue job runs on AWS Glue 3.0 or later versions with logs generated in the Standard (rather than legacy) format, which is the default for newer jobs. If you have log files greater than 0.5 GB, you can enable rolling log support for job runs on AWS Glue 4.0 or later versions to simplify log archiving, analysis, and troubleshooting.

You can enable the Spark UI by using the AWS Glue console or the AWS Command Line Interface (AWS CLI). When you enable the Spark UI, AWS Glue ETL jobs and Spark applications on AWS Glue development endpoints can back up Spark event logs to a location that you specify in Amazon Simple Storage Service (Amazon S3). You can use the backed up event logs in Amazon S3 with the Spark UI, both in real time as the job is operating and after the job is complete. While the logs remain in Amazon S3, the Spark UI in the AWS Glue console can view them. 

## Permissions


 In order to use the Spark UI in the AWS Glue console, you can use `UseGlueStudio` or add all the individual service APIs. All APIs are needed to use the Spark UI completely, however users can access SparkUI features by adding its service APIs in their IAM permission for fine-grained access. 

 `RequestLogParsing` is the most critical as it performs the parsing of logs. The remaining APIs are for reading the respective parsed data. For example, `GetStages` provides access to the data about all stages of a Spark job. 

 The list of Spark UI service APIs mapped to `UseGlueStudio` are below in the sample policy. The policy below provides access to use only Spark UI features. To add more permissions like Amazon S3 and IAM see [ Creating Custom IAM Policies for AWS Glue Studio. ](https://docs.aws.amazon.com/glue/latest/dg/getting-started-min-privs.html#getting-started-all-gs-privs.html) 

 The list of Spark UI service APIs mapped to `UseGlueStudio` is below in the sample policy. When using a Spark UI service API, use the following namespace: `glue:<ServiceAPI>`. 

------
#### [ JSON ]

****  

```
{
  "Version":"2012-10-17",		 	 	 
  "Statement": [
    {
      "Sid": "AllowGlueStudioSparkUI",
      "Effect": "Allow",
      "Action": [
        "glue:RequestLogParsing",
        "glue:GetLogParsingStatus",
        "glue:GetEnvironment",
        "glue:GetJobs",
        "glue:GetJob",
        "glue:GetStage",
        "glue:GetStages",
        "glue:GetStageFiles",
        "glue:BatchGetStageFiles",
        "glue:GetStageAttempt",
        "glue:GetStageAttemptTaskList",
        "glue:GetStageAttemptTaskSummary",
        "glue:GetExecutors",
        "glue:GetExecutorsThreads",
        "glue:GetStorage",
        "glue:GetStorageUnit",
        "glue:GetQueries",
        "glue:GetQuery"
      ],
      "Resource": [
        "*"
      ]
    }
  ]
}
```

------

## Limitations

+ Spark UI in the AWS Glue console is not available for job runs that occurred before 20 Nov 2023 because they are in the legacy log format.
+  Spark UI in the AWS Glue console supports rolling logs for AWS Glue 4.0, such as those generated by default in streaming jobs. The maximum sum of all generated rolled log event files is 2 GB. For AWS Glue jobs without rolled log support, the maximum log event file size supported for SparkUI is 0.5 GB. 
+  Serverless Spark UI is not available for Spark event logs stored in an Amazon S3 bucket that can only be accessed by your VPC. 

## Example: Apache Spark web UI


This example shows you how to use the Spark UI to understand your job performance. Screen shots show the Spark web UI as provided by a self-managed Spark history server. Spark UI in the AWS Glue console provides similar views. For more information about using the Spark Web UI, see [Web UI](https://spark.apache.org/docs/3.3.0/web-ui.html) in the Spark documentation.

The following is an example of a Spark application that reads from two data sources, performs a join transform, and writes it out to Amazon S3 in Parquet format.

```
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.sql.functions import count, when, expr, col, sum, isnull
from pyspark.sql.functions import countDistinct
from awsglue.dynamicframe import DynamicFrame
 
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
 
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
 
job = Job(glueContext)
job.init(args['JOB_NAME'])
 
df_persons = spark.read.json("s3://awsglue-datasets/examples/us-legislators/all/persons.json")
df_memberships = spark.read.json("s3://awsglue-datasets/examples/us-legislators/all/memberships.json")
 
df_joined = df_persons.join(df_memberships, df_persons.id == df_memberships.person_id, 'fullouter')
df_joined.write.parquet("s3://aws-glue-demo-sparkui/output/")
 
job.commit()
```

The following DAG visualization shows the different stages in this Spark job.

![\[Screenshot of Spark UI showing 2 completed stages for job 0.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui1.png)


The following event timeline for a job shows the start, execution, and termination of different Spark executors.

![\[Screenshot of Spark UI showing the completed, failed, and active stages of different Spark executors.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui2.png)


The following screen shows the details of the SparkSQL query plans:
+ Parsed logical plan
+ Analyzed logical plan
+ Optimized logical plan
+ Physical plan for execution

![\[SparkSQL query plans: parsed, analyzed, and optimized logical plan and physical plans for execution.\]](http://docs.aws.amazon.com/glue/latest/dg/images/spark-ui3.png)


**Topics**
+ [

## Permissions
](#monitor-spark-ui-limitations-permissions)
+ [

## Limitations
](#monitor-spark-ui-limitations)
+ [

## Example: Apache Spark web UI
](#monitor-spark-ui-limitations-example)
+ [

# Enabling the Apache Spark web UI for AWS Glue jobs
](monitor-spark-ui-jobs.md)
+ [

# Launching the Spark history server
](monitor-spark-ui-history.md)

# Enabling the Apache Spark web UI for AWS Glue jobs
Enabling the Spark UI for jobs

You can use the Apache Spark web UI to monitor and debug AWS Glue ETL jobs running on the AWS Glue job system. You can configure the Spark UI using the AWS Glue console or the AWS Command Line Interface (AWS CLI).

Every 30 seconds, AWS Glue backs up the Spark event logs to the Amazon S3 path that you specify.

**Topics**
+ [

## Configuring the Spark UI (console)
](#monitor-spark-ui-jobs-console)
+ [

## Configuring the Spark UI (AWS CLI)
](#monitor-spark-ui-jobs-cli)
+ [

## Configuring the Spark UI for sessions using Notebooks
](#monitor-spark-ui-sessions)
+ [

## Enable rolling logs
](#monitor-spark-ui-rolling-logs)

## Configuring the Spark UI (console)
Using the AWS Management Console

Follow these steps to configure the Spark UI by using the AWS Management Console. When creating an AWS Glue job, Spark UI is enabled by default.

**To turn on the Spark UI when you create or edit a job**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Choose **Add job**, or select an existing one.

1. In **Job details**, open the **Advanced properties**.

1. Under the **Spark UI** tab, choose **Write Spark UI logs to Amazon S3**.

1. Specify an Amazon S3 path for storing the Spark event logs for the job. Note that if you use a security configuration in the job, the encryption also applies to the Spark UI log file. For more information, see [Encrypting data written by AWS Glue](encryption-security-configuration.md).

1. Under **Spark UI logging and monitoring configuration**:
   + Select **Standard** if you are generating logs to view in the AWS Glue console.
   + Select **Legacy** if you are generating logs to view on a Spark history server.
   + You can also choose to generate both.

## Configuring the Spark UI (AWS CLI)
Using the AWS CLI

To generate logs for viewing with Spark UI, in the AWS Glue console, use the AWS CLI to pass the following job parameters to AWS Glue jobs. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

```
'--enable-spark-ui': 'true',
'--spark-event-logs-path': 's3://s3-event-log-path'
```

To distribute logs to their legacy locations, set the `--enable-spark-ui-legacy-path` parameter to `"true"`. If you do not want to generate logs in both formats, remove the `--enable-spark-ui` parameter.

## Configuring the Spark UI for sessions using Notebooks
Using Notebooks

**Warning**  
AWS Glue interactive sessions do not currently support Spark UI in the console. Configure a Spark history server.

 If you use AWS Glue notebooks, set up SparkUI config before starting the session. To do this, use the `%%configure` cell magic: 

```
%%configure { “--enable-spark-ui”: “true”, “--spark-event-logs-path”: “s3://path” }
```

## Enable rolling logs


 Enabling SparkUI and rolling log event files for AWS Glue jobs provides several benefits: 
+  Rolling Log Event Files – With rolling log event files enabled, AWS Glue generates separate log files for each step of the job execution, making it easier to identify and troubleshoot issues specific to a particular stage or transformation. 
+  Better Log Management – Rolling log event files help in managing log files more efficiently. Instead of having a single, potentially large log file, the logs are split into smaller, more manageable files based on the job execution stages. This can simplify log archiving, analysis, and troubleshooting. 
+  Improved Fault Tolerance – If a AWS Glue job fails or is interrupted, the rolling log event files can provide valuable information about the last successful stage, making it easier to resume the job from that point rather than starting from scratch. 
+  Cost Optimization – By enabling rolling log event files, you can save on storage costs associated with log files. Instead of storing a single, potentially large log file, you store smaller, more manageable log files, which can be more cost-effective, especially for long-running or complex jobs. 

 In a new environment, users can explicitly enable rolling logs through: 

```
'—conf': 'spark.eventLog.rolling.enabled=true'
```

or

```
'—conf': 'spark.eventLog.rolling.enabled=true —conf 
spark.eventLog.rolling.maxFileSize=128m'
```

 When rolling logs are activated, `spark.eventLog.rolling.maxFileSize` specifies the maximum size of the event log file before it rolls over. The default value of this optional parameter if not specified is 128 MB. Minimum is 10 MB. 

 The maximum sum of all generated rolled log event files is 2 GB. For AWS Glue jobs without rolling log support, the maximum log event file size supported for SparkUI is 0.5 GB. 

You can turn off rolling logs for a streaming job by passing an additional configuration. Note that very large log files may be costly to maintain.

To turn off rolling logs, provide the following configuration:

```
'--spark-ui-event-logs-path': 'true',
'--conf': 'spark.eventLog.rolling.enabled=false'
```

# Launching the Spark history server
Launching the Spark history server

You can use a Spark history server to visualize Spark logs on your own infrastructure. You can see the same visualizations in the AWS Glue console for AWS Glue job runs on AWS Glue 4.0 or later versions with logs generated in the Standard (rather than legacy) format. For more information, see [Monitoring jobs using the Apache Spark web UI](monitor-spark-ui.md).

You can launch the Spark history server using a AWS CloudFormation template that hosts the server on an EC2 instance, or launch locally using Docker.

**Topics**
+ [

## Launching the Spark history server and viewing the Spark UI using AWS CloudFormation
](#monitor-spark-ui-history-cfn)
+ [

## Launching the Spark history server and viewing the Spark UI using Docker
](#monitor-spark-ui-history-local)

## Launching the Spark history server and viewing the Spark UI using AWS CloudFormation
Launching the Spark history server using AWS CloudFormation

You can use an AWS CloudFormation template to start the Apache Spark history server and view the Spark web UI. These templates are samples that you should modify to meet your requirements.

**To start the Spark history server and view the Spark UI using CloudFormation**

1. Choose one of the **Launch Stack** buttons in the following table. This launches the stack on the CloudFormation console.    
[\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitor-spark-ui-history.html)

1. On the **Specify template** page, choose **Next**.

1. On the **Specify stack details** page, enter the **Stack name**. Enter additional information under **Parameters**.

   1. 

**Spark UI configuration**

      Provide the following information:
      + **IP address range** — The IP address range that can be used to view the Spark UI. If you want to restrict access from a specific IP address range, you should use a custom value. 
      + **History server port** — The port for the Spark UI. You can use the default value.
      + **Event log directory** — Choose the location where Spark event logs are stored from the AWS Glue job or development endpoints. You must use **s3a://** for the event logs path scheme.
      + **Spark package location** — You can use the default value.
      + **Keystore path** — SSL/TLS keystore path for HTTPS. If you want to use a custom keystore file, you can specify the S3 path `s3://path_to_your_keystore_file` here. If you leave this parameter empty, a self-signed certificate based keystore is generated and used.
      + **Keystore password** — Enter a SSL/TLS keystore password for HTTPS.

   1. 

**EC2 instance configuration**

      Provide the following information:
      + **Instance type** — The type of Amazon EC2 instance that hosts the Spark history server. Because this template launches Amazon EC2 instance in your account, Amazon EC2 cost will be charged in your account separately.
      + **Latest AMI ID** — The AMI ID of Amazon Linux 2 for the Spark history server instance. You can use the default value.
      + **VPC ID** — The virtual private cloud (VPC) ID for the Spark history server instance. You can use any of the VPCs available in your account Using a default VPC with a [default Network ACL](https://docs.aws.amazon.com/vpc/latest/userguide/vpc-network-acls.html#default-network-acl) is not recommended. For more information, see [Default VPC and Default Subnets](https://docs.aws.amazon.com/vpc/latest/userguide/default-vpc.html) and [Creating a VPC](https://docs.aws.amazon.com/vpc/latest/userguide/working-with-vpcs.html#Create-VPC) in the *Amazon VPC User Guide*.
      + **Subnet ID** — The ID for the Spark history server instance. You can use any of the subnets in your VPC. You must be able to reach the network from your client to the subnet. If you want to access via the internet, you must use a public subnet that has the internet gateway in the route table.

   1. Choose **Next**.

1. On the **Configure stack options** page, to use the current user credentials for determining how CloudFormation can create, modify, or delete resources in the stack, choose **Next**. You can also specify a role in the ** Permissions** section to use instead of the current user permissions, and then choose **Next**.

1. On the **Review** page, review the template. 

   Select **I acknowledge that CloudFormation might create IAM resources**, and then choose **Create stack**.

1. Wait for the stack to be created.

1. Open the **Outputs** tab.

   1. Copy the URL of **SparkUiPublicUrl** if you are using a public subnet.

   1. Copy the URL of **SparkUiPrivateUrl** if you are using a private subnet.

1. Open a web browser, and paste in the URL. This lets you access the server using HTTPS on the specified port. Your browser may not recognize the server's certificate, in which case you have to override its protection and proceed anyway. 

## Launching the Spark history server and viewing the Spark UI using Docker
Launching the Spark history server using Docker

If you prefer local access (not to have an EC2 instance for the Apache Spark history server), you can also use Docker to start the Apache Spark history server and view the Spark UI locally. This Dockerfile is a sample that you should modify to meet your requirements. 

 **Prerequisites** 

For information about how to install Docker on your laptop see the [Docker Engine community](https://docs.docker.com/install/).

**To start the Spark history server and view the Spark UI locally using Docker**

1. Download files from GitHub.

   Download the Dockerfile and `pom.xml` from [ AWS Glue code samples](https://github.com/aws-samples/aws-glue-samples/tree/master/utilities/Spark_UI/).

1. Determine if you want to use your user credentials or federated user credentials to access AWS.
   + To use the current user credentials for accessing AWS, get the values to use for ` AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` in the `docker run` command. For more information, see [Managing access keys for IAM users](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_access-keys.html) in the *IAM User Guide*.
   + To use SAML 2.0 federated users for accessing AWS, get the values for ` AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, and ` AWS_SESSION_TOKEN`. For more information, see [Requesting temporary security credentials](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)

1. Determine the location of your event log directory, to use in the `docker run` command.

1. Build the Docker image using the files in the local directory, using the name ` glue/sparkui`, and the tag `latest`.

   ```
   $ docker build -t glue/sparkui:latest . 
   ```

1. Create and start the docker container.

   In the following commands, use the values obtained previously in steps 2 and 3.

   1. To create the docker container using your user credentials, use a command similar to the following

      ```
      docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog
       -Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY"
       -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
      ```

   1. To create the docker container using temporary credentials, use ` org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider` as the provider, and provide the credential values obtained in step 2. For more information, see [Using Session Credentials with TemporaryAWSCredentialsProvider](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html#Using_Session_Credentials_with_TemporaryAWSCredentialsProvider) in the *Hadoop: Integration with Amazon Web Services* documentation.

      ```
      docker run -itd -e SPARK_HISTORY_OPTS="$SPARK_HISTORY_OPTS -Dspark.history.fs.logDirectory=s3a://path_to_eventlog
       -Dspark.hadoop.fs.s3a.access.key=AWS_ACCESS_KEY_ID -Dspark.hadoop.fs.s3a.secret.key=AWS_SECRET_ACCESS_KEY
       -Dspark.hadoop.fs.s3a.session.token=AWS_SESSION_TOKEN
       -Dspark.hadoop.fs.s3a.aws.credentials.provider=org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider"
       -p 18080:18080 glue/sparkui:latest "/opt/spark/bin/spark-class org.apache.spark.deploy.history.HistoryServer"
      ```
**Note**  
These configuration parameters come from the [ Hadoop-AWS Module](https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html). You may need to add specific configuration based on your use case. For example: users in isolated regions will need to configure the ` spark.hadoop.fs.s3a.endpoint`.

1. Open `http://localhost:18080` in your browser to view the Spark UI locally.