# Monitoring with Amazon CloudWatch
<a name="monitor-cloudwatch"></a>

You can monitor AWS Glue using Amazon CloudWatch, which collects and processes raw data from AWS Glue into readable, near-real-time metrics. These statistics are recorded for a period of two weeks so that you can access historical information for a better perspective on how your web application or service is performing. By default, AWS Glue metrics data is sent to CloudWatch automatically. For more information, see [What Is Amazon CloudWatch?](https://docs.aws.amazon.com/AmazonCloudWatch/latest/DeveloperGuide/WhatIsCloudWatch.html) in the *Amazon CloudWatch User Guide*, and [AWS Glue metrics](monitoring-awsglue-with-cloudwatch-metrics.md#awsglue-metrics).

 **Continous logging** 

AWS Glue also supports real-time continuous logging for AWS Glue jobs. When continuous logging is enabled for a job, you can view the real-time logs on the AWS Glue console or the CloudWatch console dashboard. For more information, see [Logging for AWS Glue jobs](monitor-continuous-logging.md).

 **Observability metrics** 

 When **Job observability metrics** is enabled, additional Amazon CloudWatch metrics are generated when the job is run. Use AWS Glue Observability metrics to generate insights into what is happening inside your AWS Glue to improve triaging and analysis of issues. 

**Topics**
+ [Monitoring AWS Glue using Amazon CloudWatch metrics](monitoring-awsglue-with-cloudwatch-metrics.md)
+ [Setting up Amazon CloudWatch alarms on AWS Glue job profiles](monitor-profile-glue-job-cloudwatch-alarms.md)
+ [Logging for AWS Glue jobs](monitor-continuous-logging.md)
+ [Monitoring with AWS Glue Observability metrics](monitor-observability.md)

# Monitoring AWS Glue using Amazon CloudWatch metrics
<a name="monitoring-awsglue-with-cloudwatch-metrics"></a>

You can profile and monitor AWS Glue operations using AWS Glue job profiler. It collects and processes raw data from AWS Glue jobs into readable, near real-time metrics stored in Amazon CloudWatch. These statistics are retained and aggregated in CloudWatch so that you can access historical information for a better perspective on how your application is performing.

**Note**  
 You may incur additional charges when you enable job metrics and CloudWatch custom metrics are created. For more information, see [ Amazon CloudWatch pricing ](https://aws.amazon.com/cloudwatch/pricing/). 

## AWS Glue metrics overview
<a name="metrics-overview"></a>

When you interact with AWS Glue, it sends metrics to CloudWatch. You can view these metrics using the AWS Glue console (the preferred method), the CloudWatch console dashboard, or the AWS Command Line Interface (AWS CLI). 

**To view metrics using the AWS Glue console dashboard**

You can view summary or detailed graphs of metrics for a job, or detailed graphs for a job run. 

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Job run monitoring**.

1. In **Job runs** choose **Actions** to stop a job that is currently running, view a job, or rewind job bookmark.

1. Select a job, then choose **View run details** to view additional information about the job run.

**To view metrics using the CloudWatch console dashboard**

Metrics are grouped first by the service namespace, and then by the various dimension combinations within each namespace.

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Metrics**.

1. Choose the **Glue** namespace.

**To view metrics using the AWS CLI**
+ At a command prompt, use the following command.

  ```
  1. aws cloudwatch list-metrics --namespace Glue
  ```

AWS Glue reports metrics to CloudWatch every 30 seconds, and the CloudWatch metrics dashboards are configured to display them every minute. The AWS Glue metrics represent delta values from the previously reported values. Where appropriate, metrics dashboards aggregate (sum) the 30-second values to obtain a value for the entire last minute.

### AWS Glue metrics behavior for Spark jobs
<a name="metrics-overview-spark"></a>

 AWS Glue metrics are enabled at initialization of a `GlueContext` in a script and are generally updated only at the end of an Apache Spark task. They represent the aggregate values across all completed Spark tasks so far.

However, the Spark metrics that AWS Glue passes on to CloudWatch are generally absolute values representing the current state at the time they are reported. AWS Glue reports them to CloudWatch every 30 seconds, and the metrics dashboards generally show the average across the data points received in the last 1 minute.

AWS Glue metrics names are all preceded by one of the following types of prefix:
+ `glue.driver.` – Metrics whose names begin with this prefix either represent AWS Glue metrics that are aggregated from all executors at the Spark driver, or Spark metrics corresponding to the Spark driver.
+ `glue.`*executorId*`.` – The *executorId* is the number of a specific Spark executor. It corresponds with the executors listed in the logs.
+ `glue.ALL.` – Metrics whose names begin with this prefix aggregate values from all Spark executors.

## AWS Glue metrics
<a name="awsglue-metrics"></a>

AWS Glue profiles and sends the following metrics to CloudWatch every 30 seconds, and the AWS Glue Metrics Dashboard report them once a minute:


| Metric | Description | 
| --- | --- | 
|  `glue.driver.aggregate.bytesRead` |  The number of bytes read from all data sources by all completed Spark tasks running in all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation.  Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) This metric can be used the same way as the `glue.ALL.s3.filesystem.read_bytes` metric, with the difference that this metric is updated at the end of a Spark task and captures non-S3 data sources as well.  | 
|  `glue.driver.aggregate.elapsedTime` |  The ETL elapsed time in milliseconds (does not include the job bootstrap times). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Milliseconds Can be used to determine how long it takes a job run to run on average. Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.aggregate.numCompletedStages` |  The number of completed stages in the job. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.numCompletedTasks` |  The number of completed tasks in the job. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.numFailedTasks` |  The number of failed tasks. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) The data can be used to set alarms for increased failures that might suggest abnormalities in data, cluster or scripts.  | 
|  `glue.driver.aggregate.numKilledTasks` |  The number of tasks killed. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.aggregate.recordsRead` |  The number of records read from all data sources by all completed Spark tasks running in all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) This metric can be used in a similar way to the `glue.ALL.s3.filesystem.read_bytes` metric, with the difference that this metric is updated at the end of a Spark task.  | 
|   `glue.driver.aggregate.shuffleBytesWritten` |  The number of bytes written by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written for this purpose during the previous minute). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Bytes Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce). Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.aggregate.shuffleLocalBytesRead` |  The number of bytes read by all executors to shuffle data between them since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read for this purpose during the previous minute). Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard, a SUM statistic is used for aggregation. Unit: Bytes Can be used to monitor: Data shuffle in jobs (large joins, groupBy, repartition, coalesce). Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.BlockManager.disk.diskSpaceUsed_MB` |  The number of megabytes of disk space used across all executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Megabytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.ExecutorAllocationManager.executors.numberAllExecutors` |  The number of actively running job executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.ExecutorAllocationManager.executors.numberMaxNeededExecutors` |  The number of maximum (actively running and pending) job executors needed to satisfy the current load. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Maximum. This is a Spark metric, reported as an absolute value. Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.jvm.heap.usage`  `glue.`*executorId*`.jvm.heap.usage`  `glue.ALL.jvm.heap.usage`  |  The fraction of memory used by the JVM heap for this driver (scale: 0-1) for driver, executor identified by executorId, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Percentage Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.jvm.heap.used`  `glue.`*executorId*`.jvm.heap.used`  `glue.ALL.jvm.heap.used`  |  The number of memory bytes used by the JVM heap for the driver, the executor identified by *executorId*, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue Job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This is a Spark metric, reported as an absolute value. Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.s3.filesystem.read_bytes`  `glue.`*executorId*`.s3.filesystem.read_bytes`  `glue.ALL.s3.filesystem.read_bytes`  |  The number of bytes read from Amazon S3 by the driver, an executor identified by *executorId*, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes read during the previous minute). Valid dimensions: `JobName`, `JobRunId`, and `Type` (gauge). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes read by two different job runs. Unit: Bytes. Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Resulting data can be used for: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.s3.filesystem.write_bytes`  `glue.`*executorId*`.s3.filesystem.write_bytes`  `glue.ALL.s3.filesystem.write_bytes`  |  The number of bytes written to Amazon S3 by the driver, an executor identified by *executorId*, or ALL executors since the previous report (aggregated by the AWS Glue Metrics Dashboard as the number of bytes written during the previous minute). Valid dimensions: `JobName`, `JobRunId`, and `Type` (gauge). Valid Statistics: SUM. This metric is a delta value from the last reported value, so on the AWS Glue Metrics Dashboard a SUM statistic is used for aggregation. The area under the curve on the AWS Glue Metrics Dashboard can be used to visually compare bytes written by two different job runs. Unit: Bytes Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.streaming.numRecords` |  The number of records that are received in a micro-batch. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: Sum, Maximum, Minimum, Average, Percentile Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|  `glue.driver.streaming.batchProcessingTimeInMs` |  The time it takes to process the batches in milliseconds. This metric is only available for AWS Glue streaming jobs with AWS Glue version 2.0 and above. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (count). Valid Statistics: Sum, Maximum, Minimum, Average, Percentile Unit: Count Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 
|   `glue.driver.system.cpuSystemLoad`  `glue.`*executorId*`.system.cpuSystemLoad`  `glue.ALL.system.cpuSystemLoad`  |  The fraction of CPU system load used (scale: 0-1) by the driver, an executor identified by *executorId*, or ALL executors. Valid dimensions: `JobName` (the name of the AWS Glue job), `JobRunId` (the JobRun ID. or `ALL`), and `Type` (gauge). Valid Statistics: Average. This metric is reported as an absolute value. Unit: Percentage Can be used to monitor: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html) Some ways to use the data: [\[See the AWS documentation website for more details\]](http://docs.aws.amazon.com/glue/latest/dg/monitoring-awsglue-with-cloudwatch-metrics.html)  | 

## Dimensions for AWS Glue Metrics
<a name="awsglue-metricdimensions"></a>

AWS Glue metrics use the AWS Glue namespace and provide metrics for the following dimensions:


| Dimension | Description | 
| --- | --- | 
|  `JobName`  |  This dimension filters for metrics of all job runs of a specific AWS Glue job.  | 
|  `JobRunId`  |  This dimension filters for metrics of a specific AWS Glue job run by a JobRun ID, or `ALL`.  | 
|  `Type`  |  This dimension filters for metrics by either `count` (an aggregate number) or `gauge` (a value at a point in time).  | 

For more information, see the [Amazon CloudWatch User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/).

# Setting up Amazon CloudWatch alarms on AWS Glue job profiles
<a name="monitor-profile-glue-job-cloudwatch-alarms"></a>

AWS Glue metrics are also available in Amazon CloudWatch. You can set up alarms on any AWS Glue metric for scheduled jobs. 

A few common scenarios for setting up alarms are as follows:
+ Jobs running out of memory (OOM): Set an alarm when the memory usage exceeds the normal average for either the driver or an executor for an AWS Glue job.
+ Straggling executors: Set an alarm when the number of executors falls below a certain threshold for a large duration of time in an AWS Glue job.
+ Data backlog or reprocessing: Compare the metrics from individual jobs in a workflow using a CloudWatch math expression. You can then trigger an alarm on the resulting expression value (such as the ratio of bytes written by a job and bytes read by a following job).

For detailed instructions on setting alarms, see [Create or Edit a CloudWatch Alarm](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ConsoleAlarms.html) in the *[Amazon CloudWatch Events User Guide](https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/)*. 

For monitoring and debugging scenarios using CloudWatch, see [Job monitoring and debugging](monitor-profile-glue-job-cloudwatch-metrics.md).

# Logging for AWS Glue jobs
<a name="monitor-continuous-logging"></a>

 In AWS Glue 5.0, all jobs have real-time logging capabilities. Additionally, you can specify custom configuration options to tailor the logging behavior. These options include setting the Amazon CloudWatch log group name, the Amazon CloudWatch log stream prefix (which will precede the AWS Glue job run ID and driver/executor ID), and the log conversion pattern for log messages. These configurations allow you to aggregate logs in custom Amazon CloudWatch log groups with different expiration policies. Furthermore, you can analyze the logs more effectively by using custom log stream prefixes and conversion patterns. This level of customization enables you to optimize log management and analysis according to your specific requirements. 

## Logging behavior in AWS Glue 5.0
<a name="monitor-logging-behavior-glue-50"></a>

 By default, system logs, Spark daemon logs, and user AWS Glue Logger logs are written to the `/aws-glue/jobs/error` log group in Amazon CloudWatch. On the other hand, user stdout (standard output) and stderr (standard error) logs are written to the `/aws-glue/jobs/output` log group by default. 

## Custom logging
<a name="monitor-logging-custom"></a>

 You can customize the default log group and log stream prefixes using the following job arguments: 
+  `--custom-logGroup-prefix`: Allows you to specify a custom prefix for the `/aws-glue/jobs/error` and `/aws-glue/jobs/output` log groups. If you provide a custom prefix, the log group names will be in the following format: 
  +  `/aws-glue/jobs/error` will be `<customer prefix>/error` 
  +  `/aws-glue/jobs/output ` will be `<customer prefix>/output` 
+  `--custom-logStream-prefix`: Allows you to specify a custom prefix for the log stream names within the log groups. If you provide a custom prefix, the log stream names will be in the following format: 
  +  `jobrunid-driver` will be `<customer log stream>-driver` 
  +  `jobrunid-executorNum` will be `<customer log stream>-executorNum` 

 Validation rules and limitations for custom prefixes: 
+  The entire log stream name must be between 1 and 512 characters long. 
+  The custom prefix itself is restricted to 400 characters. 
+  The custom prefix must match the regular expression pattern `[^:\$1]\$1` (special characters allowed are '\$1', '-', and '/'). 

## Logging application-specific messages using the custom script logger
<a name="monitor-logging-script"></a>

You can use the AWS Glue logger to log any application-specific messages in the script that are sent in real time to the driver log stream.

The following example shows a Python script.

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info("info message")
logger.warn("warn message")
logger.error("error message")
```

The following example shows a Scala script.

```
import com.amazonaws.services.glue.log.GlueLogger

object GlueApp {
  def main(sysArgs: Array[String]) {
    val logger = new GlueLogger
    logger.info("info message")
    logger.warn("warn message")
    logger.error("error message")
  }
}
```

## Enabling the progress bar to show job progress
<a name="monitor-logging-progress"></a>

AWS Glue provides a real-time progress bar under the `JOB_RUN_ID-progress-bar` log stream to check AWS Glue job run status. Currently it supports only jobs that initialize `glueContext`. If you run a pure Spark job without initializing `glueContext`, the AWS Glue progress bar does not appear.

The progress bar shows the following progress update every 5 seconds.

```
Stage Number (Stage Name): > (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]
```

## Security configuration with Amazon CloudWatch logging
<a name="monitor-security-config-logging"></a>

 When a security configuration is enabled for Amazon CloudWatch logs, AWS Glue creates log groups with specific naming patterns that incorporate the security configuration name. 

### Log group naming with security configuration
<a name="monitor-log-group-naming"></a>

 The default and custom log groups will be as follows: 
+  **Default error log group:** `/aws-glue/jobs/Security-Configuration-Name-role/glue-job-role/error` 
+  **Default output log group:** `/aws-glue/jobs/Security-Configuration-Name-role/glue-job-role/output` 
+  **Custom error log group (AWS Glue 5.0):** `custom-log-group-prefix/Security-Configuration-Name-role/glue-job-role/error` 
+  **Custom output log group (AWS Glue 5.0):** `custom-log-group-prefix/Security-Configuration-Name-role/glue-job-role/output` 

### Required IAM Permissions
<a name="monitor-logging-iam-permissions"></a>

 You need to add the `logs:AssociateKmsKey` permission to your IAM role permissions, if you enable a security configuration with Amazon CloudWatch Logs. If that permission is not included, continuous logging will be disabled. 

 Also, to configure the encryption for the Amazon CloudWatch Logs, follow the instructions at [ Encrypt Log Data in Amazon CloudWatch Logs Using AWS Key Management Service](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html) in the Amazon Amazon CloudWatch Logs User Guide. 

### Additional Information
<a name="additional-info"></a>

 For more information on creating security configurations, see [Managing security configurations on the AWS Glue console](https://docs.aws.amazon.com/glue/latest/dg/console-security-configurations.html). 

**Topics**
+ [Logging behavior in AWS Glue 5.0](#monitor-logging-behavior-glue-50)
+ [Custom logging](#monitor-logging-custom)
+ [Logging application-specific messages using the custom script logger](#monitor-logging-script)
+ [Enabling the progress bar to show job progress](#monitor-logging-progress)
+ [Security configuration with Amazon CloudWatch logging](#monitor-security-config-logging)
+ [Enabling continuous logging for AWS Glue 4.0 and earlier jobs](monitor-continuous-logging-enable.md)
+ [Viewing logs for AWS Glue jobs](monitor-continuous-logging-view.md)

# Enabling continuous logging for AWS Glue 4.0 and earlier jobs
<a name="monitor-continuous-logging-enable"></a>

**Note**  
 In AWS Glue 4.0 and earlier versions, continuous logging was an available feature. However, with the introduction of AWS Glue 5.0, all jobs have real-time logging capability. For more details on the logging capabilities and configuration options in AWS Glue 5.0, see [Logging for AWS Glue jobs](https://docs.aws.amazon.com/glue/latest/dg/monitor-continuous-logging.html). 

You can enable continuous logging using the AWS Glue console or through the AWS Command Line Interface (AWS CLI). 

You can enable continuous logging when you create a new job, edit an existing job, or enable it through the AWS CLI.

You can also specify custom configuration options such as the Amazon CloudWatch log group name, CloudWatch log stream prefix before the AWS Glue job run ID driver/executor ID, and log conversion pattern for log messages. These configurations help you to set aggregate logs in custom CloudWatch log groups with different expiration policies, and analyze them further with custom log stream prefixes and conversions patterns. 

**Topics**
+ [Using the AWS Management Console](#monitor-continuous-logging-enable-console)
+ [Logging application-specific messages using the custom script logger](#monitor-continuous-logging-script)
+ [Enabling the progress bar to show job progress](#monitor-continuous-logging-progress)
+ [Security configuration with continuous logging](#monitor-logging-encrypt-log-data)

## Using the AWS Management Console
<a name="monitor-continuous-logging-enable-console"></a>

Follow these steps to use the console to enable continuous logging when creating or editing an AWS Glue job.

**To create a new AWS Glue job with continuous logging**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **ETL jobs**.

1. Choose **Visual ETL**.

1. In the **Job details** tab, expand the **Advanced properties** section.

1. Under **Continuous logging** select **Enable logs in CloudWatch**.

**To enable continuous logging for an existing AWS Glue job**

1. Open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Choose an existing job from the **Jobs** list.

1. Choose **Action**, **Edit job**.

1. In the **Job details** tab, expand the **Advanced properties** section.

1. Under **Continuous logging** select **Enable logs in CloudWatch**.

### Using the AWS CLI
<a name="monitor-continuous-logging-cli"></a>

To enable continuous logging, you pass in job parameters to an AWS Glue job. Pass the following special job parameters similar to other AWS Glue job parameters. For more information, see [Using job parameters in AWS Glue jobs](aws-glue-programming-etl-glue-arguments.md).

```
'--enable-continuous-cloudwatch-log': 'true'
```

You can specify a custom Amazon CloudWatch log group name. If not specified, the default log group name is `/aws-glue/jobs/logs-v2`.

```
'--continuous-log-logGroup': 'custom_log_group_name'
```

You can specify a custom Amazon CloudWatch log stream prefix. If not specified, the default log stream prefix is the job run ID.

```
'--continuous-log-logStreamPrefix': 'custom_log_stream_prefix'
```

You can specify a custom continuous logging conversion pattern. If not specified, the default conversion pattern is `%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n`. Note that the conversion pattern only applies to driver logs and executor logs. It does not affect the AWS Glue progress bar.

```
'--continuous-log-conversionPattern': 'custom_log_conversion_pattern'
```

## Logging application-specific messages using the custom script logger
<a name="monitor-continuous-logging-script"></a>

You can use the AWS Glue logger to log any application-specific messages in the script that are sent in real time to the driver log stream.

The following example shows a Python script.

```
from awsglue.context import GlueContext
from pyspark.context import SparkContext

sc = SparkContext()
glueContext = GlueContext(sc)
logger = glueContext.get_logger()
logger.info("info message")
logger.warn("warn message")
logger.error("error message")
```

The following example shows a Scala script.

```
import com.amazonaws.services.glue.log.GlueLogger

object GlueApp {
  def main(sysArgs: Array[String]) {
    val logger = new GlueLogger
    logger.info("info message")
    logger.warn("warn message")
    logger.error("error message")
  }
}
```

## Enabling the progress bar to show job progress
<a name="monitor-continuous-logging-progress"></a>

AWS Glue provides a real-time progress bar under the `JOB_RUN_ID-progress-bar` log stream to check AWS Glue job run status. Currently it supports only jobs that initialize `glueContext`. If you run a pure Spark job without initializing `glueContext`, the AWS Glue progress bar does not appear.

The progress bar shows the following progress update every 5 seconds.

```
Stage Number (Stage Name): > (numCompletedTasks + numActiveTasks) / totalNumOfTasksInThisStage]
```

## Security configuration with continuous logging
<a name="monitor-logging-encrypt-log-data"></a>

If a security configuration is enabled for CloudWatch logs, AWS Glue will create a log group named as follows for continuous logs:

```
<Log-Group-Name>-<Security-Configuration-Name>
```

The default and custom log groups will be as follows:
+ The default continuous log group will be `/aws-glue/jobs/error-<Security-Configuration-Name>`
+ The custom continuous log group will be `<custom-log-group-name>-<Security-Configuration-Name>`

You need to add the `logs:AssociateKmsKey` to your IAM role permissions, if you enable a security configuration with CloudWatch Logs. If that permission is not included, continuous logging will be disabled. Also, to configure the encryption for the CloudWatch Logs, follow the instructions at [Encrypt Log Data in CloudWatch Logs Using AWS Key Management Service](https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/encrypt-log-data-kms.html) in the *Amazon CloudWatch Logs User Guide*.

For more information on creating security configurations, see [Managing security configurations on the AWS Glue console](console-security-configurations.md).

**Note**  
 You may incur additional charges when you enable logging and additional CloudWatch log events are created. For more information, see [ Amazon CloudWatch pricing ](https://aws.amazon.com/cloudwatch/pricing/). 

# Viewing logs for AWS Glue jobs
<a name="monitor-continuous-logging-view"></a>

You can view real-time logs using the AWS Glue console or the Amazon CloudWatch console.

**To view real-time logs using the AWS Glue console dashboard**

1. Sign in to the AWS Management Console and open the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. In the navigation pane, choose **Jobs**.

1. Add or start an existing job. Choose **Action**, **Run job**.

   When you start running a job, you navigate to a page that contains information about the running job:
   + The **Logs** tab shows the older aggregated application logs.
   + The **Logs** tab shows a real-time progress bar when the job is running with `glueContext` initialized.
   + The **Logs** tab also contains the **Driver logs**, which capture real-time Apache Spark driver logs, and application logs from the script logged using the AWS Glue application logger when the job is running.

1. For older jobs, you can also view the real-time logs under the **Job History** view by choosing **Logs**. This action takes you to the CloudWatch console that shows all Spark driver, executor, and progress bar log streams for that job run.

**To view real-time logs using the CloudWatch console dashboard**

1. Open the CloudWatch console at [https://console.aws.amazon.com/cloudwatch/](https://console.aws.amazon.com/cloudwatch/).

1. In the navigation pane, choose **Log**.

1. Choose the **/aws-glue/jobs/error/** log group.

1. In the **Filter** box, paste the job run ID.

   You can view the driver logs, executor logs, and progress bar (if using the **Standard filter**).

# Monitoring with AWS Glue Observability metrics
<a name="monitor-observability"></a>

**Note**  
AWS Glue Observability metrics is available on AWS Glue 4.0 and later versions.

 Use AWS Glue Observability metrics to generate insights into what is happening inside your AWS Glue for Apache Spark jobs to improve triaging and analysis of issues. Observability metrics are visualized through Amazon CloudWatch dashboards and can be used to help perform root cause analysis for errors and for diagnosing performance bottlenecks. You can reduce the time spent debugging issues at scale so you can focus on resolving issues faster and more effectively. 

 AWS Glue Observability provides Amazon CloudWatch metrics categorized in following four groups: 
+  **Reliability (i.e., Errors Classes)** – easily identify the most common failure reasons at given time range that you may want to address. 
+  **Performance (i.e., Skewness)** – identify a performance bottleneck and apply tuning techniques. For example, when you experience degraded performance due to job skewness, you may want to enable Spark Adaptive Query Execution and fine-tune the skew join threshold. 
+  **Throughput (i.e., per source/sink throughput)** – monitor trends of data reads and writes. You can also configure Amazon CloudWatch alarms for anomalies. 
+  **Resource Utilization (i.e., workers, memory and disk utilization)** – efficiently find the jobs with low capacity utilization. You may want to enable AWS Glue auto-scaling for those jobs. 

## Getting started with AWS Glue Observability metrics
<a name="monitor-observability-getting-started"></a>

**Note**  
 The new metrics are enabled by default in the AWS Glue Studio console. 

**To configure observability metrics in AWS Glue Studio:**

1. Log in to the AWS Glue console and choose **ETL jobs** from the console menu.

1. Choose a job by clicking on the job name in the **Your jobs** section.

1. Choose the **Job details** tab.

1. Scroll to the bottom and choose **Advanced properties**, then **Job observability metrics**.  
![\[The screenshot shows the Job details tab Advanced properties. The Job observability metrics option is highlighted.\]](http://docs.aws.amazon.com/glue/latest/dg/images/job-details-observability-metrics.png)

**To enable AWS Glue Observability metrics using AWS CLI:**
+  Add to the `--default-arguments` map the following key-value in the input JSON file: 

  ```
  --enable-observability-metrics, true
  ```

## Using AWS Glue observability
<a name="monitor-observability-cloudwatch"></a>

 Because the AWS Glue observability metrics is provided through Amazon CloudWatch, you can use the Amazon CloudWatch console, AWS CLI, SDK or API to query the observability metrics datapoints. See [ Using Glue Observability for monitoring resource utilization to reduce cost ](https://aws.amazon.com/blogs/big-data/enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics/) for an example use case when to use AWS Glue observability metrics. 

### Using AWS Glue observability in the Amazon CloudWatch console
<a name="monitor-observability-cloudwatch-console"></a>

**To query and visualize metrics in the Amazon CloudWatch console:**

1.  Open the Amazon CloudWatch console and choose **All metrics**. 

1.  Under custom namespaces, choose **AWS Glue**. 

1.  Choose **Job Observability Metrics, Observability Metrics Per Source, or Observability Metrics Per Sink** . 

1. Search for the specific metric name, job name, job run ID, and select them.

1. Under the **Graphed metrics** tab, configure your preferred statistic, period, and other options.  
![\[The screenshot shows the Amazon CloudWatch console and metrics graph.\]](http://docs.aws.amazon.com/glue/latest/dg/images/cloudwatch-console-metrics.png)

**To query an Observability metric using AWS CLI:**

1.  Create a metric definition JSON file and replace `your-Glue-job-name`and `your-Glue-job-run-id` with yours. 

   ```
   $ cat multiplequeries.json
   [
       {
           "Id": "avgWorkerUtil_0",
           "MetricStat": {
               "Metric": {
                   "Namespace": "Glue",
                   "MetricName": "glue.driver.workerUtilization",
                   "Dimensions": [
                       {
                           "Name": "JobName",
                           "Value": "<your-Glue-job-name-A>"
                       },
                       {
                           "Name": "JobRunId",
                           "Value": "<your-Glue-job-run-id-A>"
                       },
                       {
                           "Name": "Type",
                           "Value": "gauge"
                       },
                       {
                           "Name": "ObservabilityGroup",
                           "Value": "resource_utilization"
                       }
                   ]
               },
               "Period": 1800,
               "Stat": "Minimum",
               "Unit": "None"
           }
       },
       {
           "Id": "avgWorkerUtil_1",
           "MetricStat": {
               "Metric": {
                   "Namespace": "Glue",
                   "MetricName": "glue.driver.workerUtilization",
                   "Dimensions": [
                       {
                           "Name": "JobName",
                           "Value": "<your-Glue-job-name-B>"
                       },
                       {
                           "Name": "JobRunId",
                           "Value": "<your-Glue-job-run-id-B>"
                       },
                       {
                           "Name": "Type",
                           "Value": "gauge"
                       },
                       {
                           "Name": "ObservabilityGroup",
                           "Value": "resource_utilization"
                       }
                   ]
               },
               "Period": 1800,
               "Stat": "Minimum",
               "Unit": "None"
           }
       }
   ]
   ```

1.  Run the `get-metric-data` command: 

   ```
   $ aws cloudwatch get-metric-data --metric-data-queries file: //multiplequeries.json \
        --start-time '2023-10-28T18: 20' \
        --end-time '2023-10-28T19: 10'  \
        --region us-east-1
   {
       "MetricDataResults": [
           {
               "Id": "avgWorkerUtil_0",
               "Label": "<your-label-for-A>",
               "Timestamps": [
                   "2023-10-28T18:20:00+00:00"
               ],
               "Values": [
                   0.06718750000000001
               ],
               "StatusCode": "Complete"
           },
           {
               "Id": "avgWorkerUtil_1",
               "Label": "<your-label-for-B>",
               "Timestamps": [
                   "2023-10-28T18:50:00+00:00"
               ],
               "Values": [
                   0.5959183673469387
               ],
               "StatusCode": "Complete"
           }
       ],
       "Messages": []
   }
   ```

## Observability metrics
<a name="monitor-observability-metrics-definitions"></a>

 AWS Glue Observability profiles and sends the following metrics to Amazon CloudWatch every 30 seconds, and some of these metrics can be visible in the AWS Glue Studio Job Runs Monitoring Page. 


| Metric | Description | Category | 
| --- | --- | --- | 
| glue.driver.skewness.stage |  Metric Category: job\$1performance The spark stages execution Skewness: this metric is an indicator of how long the maximum task duration time in a certain stage is compared to the median task duration in this stage. It captures execution skewness, which might be caused by input data skewness or by a transformation (e.g., skewed join). The values of this metric falls into the range of [0, infinity[, where 0 means the ratio of the maximum to median tasks' execution time, among all tasks in the stage is less than a certain stage skewness factor. The default stage skewness factor is `5` and it can be overwritten via spark conf: spark.metrics.conf.driver.source.glue.jobPerformance.skewnessFactor A stage skewness value of 1 means the ratio is twice the stage skewness factor.  The value of stage skewness is updated every 30 seconds to reflect the current skewness. The value at the end of the stage reflects the final stage skewness. This stage-level metric is used to calculate the job-level metric `glue.driver.skewness.job`. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (job\$1performance) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Count  | job\$1performance | 
| glue.driver.skewness.job |  Metric Category: job\$1performance  Job skewness is the maximum of weighted skewness of all stages. The stage skewness (glue.driver.skewness.stage) is weighted with stage duration. This is to avoid the corner case when a very skewed stage is actually running for very short time relative to other stages (and thus its skewness is not significant for the overall job performance and does not worth the effort to try to address its skewness).  This metric is updated upon completion of each stage, and thus the last value reflects the actual overall job skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (job\$1performance) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Count  | job\$1performance | 
| glue.succeed.ALL |  Metric Category: error Total number of successful job runs, to complete the picture of failures categories Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.error.ALL |  Metric Category: error  Total number of job run errors, to complete the picture of failures categories Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.error.[error category] |  Metric Category: error  This is actually a set of metrics, that are updated only when a job run fails. The error categorization helps with triaging and debugging. When a job run fails, the error causing the failure is categorized and the corresponding error category metric is set to 1. This helps to perform over time failures analysis, as well as over all jobs error analysis to identify most common failure categories to start addressing them. AWS Glue has 28 error categories, including OUT\$1OF\$1MEMORY (driver and executor), PERMISSION, SYNTAX and THROTTLING error categories. Error categories also include COMPILATION, LAUNCH and TIMEOUT error categories. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (count), and ObservabilityGroup (error) Valid Statistics: SUM Unit: Count  | error | 
| glue.driver.workerUtilization |  Metric Category: resource\$1utilization  The percentage of the allocated workers which are actually used. If not good, auto scaling can help. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average, Maximum, Minimum, Percentile Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.heap.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.heap.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.non-heap.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used non-heap memory during the job run. This helps to understand memory usage trensd, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.non-heap.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) non-heap memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.memory.total.[available \$1 used] |  Metric Category: resource\$1utilization  The driver's available / used total memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.driver.memory.total.used.percentage |  Metric Category: resource\$1utilization  The driver's used (%) total memory during the job run. This helps to understand memory usage trends, especially over time, which can help avoid potential failures, in addition to debugging memory related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.heap.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.heap.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.non-heap.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used non-heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.non-heap.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) non-heap memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.memory.total.[available \$1 used] |  Metric Category: resource\$1utilization  The executors' available/used total memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Bytes  | resource\$1utilization | 
| glue.ALL.memory.total.used.percentage |  Metric Category: resource\$1utilization  The executors' used (%) total memory. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.disk.[available\$1GB \$1 used\$1GB] |  Metric Category: resource\$1utilization  The driver's available/used disk space during the job run. This helps to understand disk usage trends, especially over time, which can help avoid potential failures, in addition to debugging not enought disk space related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Gigabytes  | resource\$1utilization | 
| glue.driver.disk.used.percentage] |  Metric Category: resource\$1utilization  The driver's available/used disk space during the job run. This helps to understand disk usage trends, especially over time, which can help avoid potential failures, in addition to debugging not enought disk space related failures. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.ALL.disk.[available\$1GB \$1 used\$1GB] |  Metric Category: resource\$1utilization  The executors' available/used disk space. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Gigabytes  | resource\$1utilization | 
| glue.ALL.disk.used.percentage |  Metric Category: resource\$1utilization  The executors' available/used/used(%) disk space. ALL means all executors. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), and ObservabilityGroup (resource\$1utilization) Valid Statistics: Average Unit: Percentage  | resource\$1utilization | 
| glue.driver.bytesRead |  Metric Category: throughput  The number of bytes read per input source in this job run, as well as well as for ALL sources. This helps understand the data volume and its changes over time, which helps addressing issues such as data skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Bytes  | throughput | 
| glue.driver.[recordsRead \$1 filesRead]  |  Metric Category: throughput  The number of records/files read per input source in this job run, as well as well as for ALL sources. This helps understand the data volume and its changes over time, which helps addressing issues such as data skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Count  | throughput | 
| glue.driver.partitionsRead  |  Metric Category: throughput  The number of partitions read per Amazon S3 input source in this job run, as well as well as for ALL sources. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Source (source data location) Valid Statistics: Average Unit: Count  | throughput | 
| glue.driver.bytesWrittten |  Metric Category: throughput  The number of bytes written per output sink in this job run, as well as well as for ALL sinks. This helps understand the data volume and how it evolves over time, which helps addressing issues such as processing skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Sink (sink data location) Valid Statistics: Average Unit: Bytes  | throughput | 
| glue.driver.[recordsWritten \$1 filesWritten] |  Metric Category: throughput  The nnumber of records/files written per output sink in this job run, as well as well as for ALL sinks. This helps understand the data volume and how it evolves over time, which helps addressing issues such as processing skewness. Valid dimensions: JobName (the name of the AWS Glue Job), JobRunId (the JobRun ID. or ALL), Type (gauge), ObservabilityGroup (resource\$1utilization), and Sink (sink data location) Valid Statistics: Average Unit: Count  | throughput | 

## Error categories
<a name="monitor-observability-error-categories"></a>


| Error categories | Description | 
| --- | --- | 
| COMPILATION\$1ERROR | Errors arise during the compilation of Scala code. | 
| CONNECTION\$1ERROR | Errors arise during connecting to a service/remote host/database service, etc. | 
| DISK\$1NO\$1SPACE\$1ERROR |  Errors arise when there is no space left in disk on driver/executor.  | 
| OUT\$1OF\$1MEMORY\$1ERROR | Errors arise when there is no space left in memory on driver/executor. | 
| IMPORT\$1ERROR | Errors arise when import dependencies. | 
| INVALID\$1ARGUMENT\$1ERROR | Errors arise when the input arguments are invalid/illegal. | 
| PERMISSION\$1ERROR | Errors arise when lacking the permission to service, data, etc.  | 
| RESOURCE\$1NOT\$1FOUND\$1ERROR |  Errors arise when data, location, etc does not exit.   | 
| QUERY\$1ERROR | Errors arise from Spark SQL query execution.  | 
| SYNTAX\$1ERROR | Errors arise when there is syntax error in the script.  | 
| THROTTLING\$1ERROR | Errors arise when hitting service concurrency limitation or execeding service quota limitaion. | 
| DATA\$1LAKE\$1FRAMEWORK\$1ERROR | Errors arise from AWS Glue native-supported data lake framework like Hudi, Iceberg, etc. | 
| UNSUPPORTED\$1OPERATION\$1ERROR | Errors arise when making unsupported operation. | 
| RESOURCES\$1ALREADY\$1EXISTS\$1ERROR | Errors arise when a resource to be created or added already exists. | 
| GLUE\$1INTERNAL\$1SERVICE\$1ERROR | Errors arise when there is a AWS Glue internal service issue.  | 
| GLUE\$1OPERATION\$1TIMEOUT\$1ERROR | Errors arise when a AWS Glue operation is timeout. | 
| GLUE\$1VALIDATION\$1ERROR | Errors arise when a required value could not be validated for AWS Glue job. | 
| GLUE\$1JOB\$1BOOKMARK\$1VERSION\$1MISMATCH\$1ERROR | Errors arise when same job exon the same source bucket and write to the same/different destination concurrently (concurrency >1) | 
| LAUNCH\$1ERROR | Errors arise during the AWS Glue job launch phase. | 
| DYNAMODB\$1ERROR | Generic errors arise from Amazon DynamoDB service. | 
| GLUE\$1ERROR | Generic Errors arise from AWS Glue service. | 
| LAKEFORMATION\$1ERROR | Generic Errors arise from AWS Lake Formation service. | 
| REDSHIFT\$1ERROR | Generic Errors arise from Amazon Redshift service. | 
| S3\$1ERROR | Generic Errors arise from Amazon S3 service. | 
| SYSTEM\$1EXIT\$1ERROR | Generic system exit error. | 
| TIMEOUT\$1ERROR | Generic errors arise when job failed by operation time out. | 
| UNCLASSIFIED\$1SPARK\$1ERROR | Generic errors arise from Spark. | 
| UNCLASSIFIED\$1ERROR | Default error category. | 

## Limitations
<a name="monitoring-observability-limitations"></a>

**Note**  
`glueContext` must be initialized to publish the metrics.

 In the Source Dimension, the value is either Amazon S3 path or table name, depending on the source type. In addition, if the source is JDBC and the query option is used, the query string is set in the source dimension. If the value is longer than 500 characters, it is trimmed within 500 characters.The following are limitations in the value: 
+ Non-ASCII characters will be removed.
+ If the source name doesn’t contain any ASCII character, it is converted to <non-ASCII input>.

### Limitations and considerations for throughput metrics
<a name="monitoring-observability-considerations"></a>
+  DataFrame and DataFrame-based DynamicFrame (e.g. JDBC, reading from parquet on Amazon S3) are supported, however, RDD-based DynamicFrame (e.g. reading csv, json on Amazon S3, etc.) is not supported. Technically, all reads and writes visible on Spark UI are supported. 
+  The `recordsRead` metric will be emitted if the data source is catalog table and the format is JSON, CSV, text, or Iceberg. 
+  `glue.driver.throughput.recordsWritten`, `glue.driver.throughput.bytesWritten`, and `glue.driver.throughput.filesWritten` metrics are not available in JDBC and Iceberg tables. 
+  Metrics may be delayed. If the job finishes in about one minute, there may be no throughput metrics in Amazon CloudWatch Metrics.