

# Optimizing query performance using column statistics
<a name="column-statistics"></a>

You can compute column-level statistics for AWS Glue Data Catalog tables in data formats such as Parquet, ORC, JSON, ION, CSV, and XML without setting up additional data pipelines. Column statistics help you to understand data profiles by getting insights about values within a column. 

Data Catalog supports generating statistics for column values such as minimum value, maximum value, total null values, total distinct values, average length of values, and total occurrences of true values. AWS analytical services such as Amazon Redshift and Amazon Athena can use these column statistics to generate query execution plans, and choose the optimal plan that improves query performance.

There are three scenarios for generating column statistics: 

 **Auto**   
AWS Glue supports automatic column statistics generation at the catalog-level so that it can automatically generate statistics for new tables in the AWS Glue Data Catalog. 

**Scheduled**  
AWS Glue supports scheduling column statistics generation so that it can be run automatically on a recurring schedule.   
With scheduled statistics computation, the column statistics task updates the overall table-level statistics, such as min, max, and avg with the new statistics, providing query engines with accurate and up-to-date statistics to optimize query execution. 

**On-demand**  
Use this option to generate column statistics on-demand whenever needed. This is useful for ad-hoc analysis or when statistics need to be computed immediately. 

You can configure to run column statistics generation task using AWS Glue console, AWS CLI, and AWS Glue API operations. When you initiate the process, AWS Glue starts a Spark job in the background and updates the AWS Glue table metadata in the Data Catalog. You can view column statistics using AWS Glue console or AWS CLI or by calling the [GetColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetColumnStatisticsForTable.html) API operation.

**Note**  
If you're using Lake Formation permissions to control access to the table, the role assumed by the column statistics task requires full table access to generate statistics.

 The following video demonstrates how to enhance query performance using column statistics. 

[![AWS Videos](http://img.youtube.com/vi/https://www.youtube.com/embed/zUHEXJdHUxs?si=HjyhpoALR6RXJz2i/0.jpg)](http://www.youtube.com/watch?v=https://www.youtube.com/embed/zUHEXJdHUxs?si=HjyhpoALR6RXJz2i)


**Topics**
+ [Prerequisites for generating column statistics](column-stats-prereqs.md)
+ [Automatic column statistics generation](auto-column-stats-generation.md)
+ [Generating column statistics on a schedule](generate-column-stats.md)
+ [Generating column statistics on demand](column-stats-on-demand.md)
+ [Viewing column statistics](view-column-stats.md)
+ [Viewing column statistics task runs](view-stats-run.md)
+ [Stopping column statistics task run](stop-stats-run.md)
+ [Deleting column statistics](delete-column-stats.md)
+ [Considerations and limitations](column-stats-notes.md)

# Prerequisites for generating column statistics
<a name="column-stats-prereqs"></a>

To generate or update column statistics, the statistics generation task assumes an AWS Identity and Access Management (IAM) role on your behalf. Based on the permissions granted to the role, the column statistics generation task can read the data from the Amazon S3 data store.

When you configure the column statistics generation task, AWS Glue allows you to create a role that includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. 

If you specify an existing role for generating column statistics, ensure that it includes the `AWSGlueServiceRole` policy or equivalent (or a scoped down version of this policy), plus the required inline policies. Follow these steps to create a new IAM role:

**Note**  
 To generate statistics for tables managed by Lake Formation, the IAM role used to generate statistics requires full table access. 

When you configure the column statistics generation task, AWS Glue allows you to create a role that includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. You can also create a role and attach the the permissions listed in the policy below, and add that role to the column statistics generation task.

**To create an IAM role for generating column statistics**

1. To create an IAM role, see [Create an IAM role for AWS Glue](https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html).

1. To update an existing role, in the IAM console, go to the IAM role that is being used by the generate column statistics process.

1. In the **Add permissions** section, choose **Attach policies**. In the newly opened browser window, choose `AWSGlueServiceRole` AWS managed policy.

1. You also need to include permissions to read data from the Amazon S3 data location.

   In the **Add permissions** section, choose **Create policy**. In the newly opened browser window, create a new policy to use with your role.

1. In the **Create policy** page, choose the **JSON** tab. Copy the following `JSON` code into the policy editor field.
**Note**  
In the following policies, replace account ID with a valid AWS account, and replace `region` with the Region of the table, and `bucket-name` with the Amazon S3 bucket name.

------
#### [ JSON ]

****  

   ```
   {
       "Version":"2012-10-17",		 	 	 
       "Statement": [
           {
               "Sid": "S3BucketAccess",
               "Effect": "Allow",
               "Action": [
                   "s3:ListBucket",
                   "s3:GetObject"
               ],
               "Resource": [
               	"arn:aws:s3:::amzn-s3-demo-bucket/*",
   							"arn:aws:s3:::amzn-s3-demo-bucket"
               ]
           }
        ]
   }
   ```

------

1. (Optional) If you're using Lake Formation permissions to provide access to your data, the IAM role requires `lakeformation:GetDataAccess` permissions.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "LakeFormationDataAccess",
         "Effect": "Allow",
         "Action": "lakeformation:GetDataAccess",
         "Resource": [
           "*"
         ]
       }
     ]
   }
   ```

------

    If the Amazon S3 data location is registered with Lake Formation, and the IAM role assumed by the column statistics generation task doesn't have `IAM_ALLOWED_PRINCIPALS` group permissions granted on the table, the role requires Lake Formation `ALTER` and `DESCRIBE` permissions on the table. The role used for registering the Amazon S3 bucket requires Lake Formation `INSERT` and `DELETE` permissions on the table. 

   If the Amazon S3 data location is not registered with Lake Formation, and the IAM role doesn't have `IAM_ALLOWED_PRINCIPALS` group permissions granted on the table, the role requires Lake Formation `ALTER`, `DESCRIBE`, `INSERT` and `DELETE` permissions on the table. 

1. If you've enabled the catalog-level `Automatic statistics generation` option, the IAM role must have the `glue:UpdateCatalog` permission or the Lake Formation `ALTER CATALOG` permission on the default Data Catalog. You can use the `GetCatalog` operation to verify the catalog properties. 

1. (Optional) The column statistics generation task that writes encrypted Amazon CloudWatch Logs requires the following permissions in the key policy.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "CWLogsKmsPermissions",
         "Effect": "Allow",
         "Action": [
           "logs:CreateLogGroup",
           "logs:CreateLogStream",
           "logs:PutLogEvents",
           "logs:AssociateKmsKey"
         ],
         "Resource": [
           "arn:aws:logs:us-east-1:111122223333:log-group:/aws-glue:*"
         ]
       },
       {
         "Sid": "KmsPermissions",
         "Effect": "Allow",
         "Action": [
           "kms:GenerateDataKey",
           "kms:Decrypt",
           "kms:Encrypt"
         ],
         "Resource": [
           "arn:aws:kms:us-east-1:111122223333:key/arn of key used for ETL cloudwatch encryption"
         ],
         "Condition": {
           "StringEquals": {
             "kms:ViaService": [
               "glue.us-east-1.amazonaws.com"
             ]
           }
         }
       }
     ]
   }
   ```

------

1. The role you use to run column statistics must have the `iam:PassRole` permission on the role.

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Effect": "Allow",
         "Action": [
           "iam:PassRole"
         ],
         "Resource": [
           "arn:aws:iam::111122223333:role/columnstats-role-name"
         ]
       }
     ]
   }
   ```

------

1. When you create an IAM role for generating column statistics, that role must also have the following trust policy that enables the service to assume the role. 

------
#### [ JSON ]

****  

   ```
   {
     "Version":"2012-10-17",		 	 	 
     "Statement": [
       {
         "Sid": "TrustPolicy",
         "Effect": "Allow",
         "Principal": {
           "Service": "glue.amazonaws.com"
         },
         "Action": "sts:AssumeRole"
       }
     ]
   }
   ```

------

# Automatic column statistics generation
<a name="auto-column-stats-generation"></a>

Automatic generation of column statistics allows you to schedule and automatically compute statistics on new tables in the AWS Glue Data Catalog. When you enable automatic statistics generation, the Data Catalog discovers new tables with specific data formats such as Parquet, JSON, CSV, XML, ORC, ION, and Apache Iceberg, along with their individual bucket paths. With a one-time catalog configuration, the Data Catalog generates statistics for these tables.

 Data lake administrators can configure the statistics generation by selecting the default catalog in the Lake Formation console, and enabling table statistics using the `Optimization configuration` option. When you create new tables or update existing tables in the Data Catalog, the Data Catalog collects the number of distinct values (NDVs) for Apache Iceberg tables, and additional statistics such as the number of nulls, maximum, minimum, and average length for other supported file formats on a weekly basis. 

If you have configured statistics generation at the table-level or if you have previously deleted the statistics generation settings for a table, those table-specific settings take precedence over the default catalog settings for automatic column statistics generation.

 Automatic statistics generation task analyzes 50% of records in the tables to calculate statistics. Automatic column statistics generation ensures that the Data Catalog maintains weekly metrics that can be used by query engines like Amazon Athena and Amazon Redshift Spectrum for improved query performance and potential cost savings. It allows scheduling statistics generation using AWS Glue APIs or the console, providing an automated process without manual intervention. 

**Topics**
+ [Enabling catalog-level automatic statistics generation](enable-auto-column-stats-generation.md)
+ [Viewing automated table-level settings](view-auto-column-stats-settings.md)
+ [Disabling catalog-level column statistics generation](disable-auto-column-stats-generation.md)

# Enabling catalog-level automatic statistics generation
<a name="enable-auto-column-stats-generation"></a>

You can enable the automatic column statistics generation for all new Apache Iceberg tables and tables in non-OTF table (Parquet, JSON, CSV, XML, ORC, ION) formats in the Data Catalog. After creating the table, you can also explicitly update the column statistics settings manually.

 To update the Data Catalog settings to enable catalog-level, the IAM role used must have the `glue:UpdateCatalog` permission or AWS Lake Formation `ALTER CATALOG` permission on the root catalog. You can use `GetCatalog` API to verify the catalog properties. 

------
#### [ AWS Management Console ]

**To enable the automatic column statistics generation at the account-level**

1. Open the Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. On the left navigation bar, choose **Catalogs**.

1. On the **Catalog summary** page, choose **Edit** under **Optimization configuration**.   
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-column-stats-auto.png)

1. On the **Table optimization configuration** page, choose the **Enable automatic statistics generation for the tables of the catalog** option.  
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/edit-optimization-option.jpg)

1. Choose an existing IAM role or create a new one that has the necessary permissions to run the column statistics task.

1. Choose **Submit**.

------
#### [ AWS CLI ]

You can also enable catalog-level statistics collection through the AWS CLI. To configure table-level statistics collection using AWS CLI, run the following command:

```
aws glue update-catalog --cli-input-json '{
    "name": "123456789012",
    "catalogInput": {
        "description": "Updating root catalog with role arn",
        "catalogProperties": {
            "customProperties": {
                "ColumnStatistics.RoleArn": "arn:aws:iam::"123456789012":role/service-role/AWSGlueServiceRole",
                "ColumnStatistics.Enabled": "true"
            }
        }
    }
}'
```

 The above command calls AWS Glue's `UpdateCatalog` operation, which takes in a `CatalogProperties` structure with the following key-value pairs for catalog-level statistics generation: 
+ ColumnStatistics.RoleArn – IAM role ARN to be used for all tasks triggered for Catalog-level statistics generation
+ ColumnStatistics.Enabled – Boolean indicating whether the catalog-level settings is enabled or disabled

------

# Viewing automated table-level settings
<a name="view-auto-column-stats-settings"></a>

 When catalog-level statistics collection is enabled, anytime an Apache Hive table or Apache Iceberg table is created or updated via the `CreateTable` or `UpdateTable` APIs through AWS Management Console, SDK, or AWS Glue crawler, an equivalent table level setting is created for that table. 

 Tables with automatic statistics generation enabled must follow one of following properties:
+ Use an `InputSerdeLibrary` that begins with org.apache.hadoop and `TableType` equals `EXTERNAL_TABLE`
+ Use an `InputSerdeLibrary` that begins with `com.amazon.ion` and `TableType` equals `EXTERNAL_TABLE`
+ Contain table\$1type: "ICEBERG" in it’s parameters structure. 

 After you create or update a table, you can verify the table details to confirm the statistics generation. The `Statistics generation summary` shows the `Schedule` property set as `AUTO` and `Statistics configuration` value is `Inherited from catalog`. Any table setting with the following setting would be automatically triggered by Glue internally. 

![\[An image of a Hive table with catalog-level statistics collection has been applied and statistics have been collected.\]](http://docs.aws.amazon.com/glue/latest/dg/images/auto-stats-summary.png)


# Disabling catalog-level column statistics generation
<a name="disable-auto-column-stats-generation"></a>

 You can disable automatic column statistics generation for new tables using the AWS Lake Formation console, the `glue:UpdateCatalogSettings` API, or the `glue:DeleteColumnStatisticsTaskSettings` API. 

**To disable the automatic column statistics generation at the account-level**

1. Open the Lake Formation console at [https://console.aws.amazon.com/lakeformation/](https://console.aws.amazon.com/lakeformation/).

1. On the left navigation bar, choose **Catalogs**.

1. On the **Catalog summary** page, choose **Edit** under **Optimization configuration**. 

1. On the **Table optimization configuration** page, unselect the **Enable automatic statistics generation for the tables of the catalog** option.

1. Choose **Submit**.

# Generating column statistics on a schedule
<a name="generate-column-stats"></a>

Follow these steps to configure a schedule for generating column statistics in the AWS Glue Data Catalog using the AWS Glue console, the AWS CLI, or the [CreateColumnStatisticsTaskSettings](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-CreateColumnStatisticsTaskSettings) operation.

------
#### [ Console ]

**To generate column statistics using the console**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1. Choose Data Catalog tables.

1. Choose a table from the list. 

1. Choose **Column statistics** tab in the lower section of the **Tables** page.

1. You can also choose **Generate on schedule** under **Column statistics** from **Actions**.

1. On the **Generate statistics on schedule** page, configure a recurring schedule for running the column statistics task by choosing the frequency and start time. You can choose the frequency to be hourly, daily, weekly, or define a cron expression to specify the schedule.

   A cron expression is a string representing a schedule pattern, consisting of 6 fields separated by spaces: \$1 \$1 \$1 \$1 \$1 <minute> <hour> <day of month> <month> <day of week> <year> For example, to run a task every day at midnight, the cron expression would be: 0 0 \$1 \$1 ? \$1

   For more information, see [Cron expressions](https://docs.aws.amazon.com/glue/latest/dg/monitor-data-warehouse-schedule.html#CronExpressions).  
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/generate-column-stats-schedule.png)

1. Next, choose the column option to generate statistics.
   + **All columns** – Choose this option to generate statistics for all columns in the table.
   + **Selected columns** – Choose this option to generate statistics for specific columns. You can select the columns from the drop-down list.

1. Choose an IAM role or create an existing role that has permissions to generate statistics. AWS Glue assumes this role to generate column statistics.

   A quicker approach is to let the AWS Glue console to create a role for you. The role that it creates is specifically for generating column statistics, and includes the `AWSGlueServiceRole` AWS managed policy plus the required inline policy for the specified data source. 

   If you specify an existing role for generating column statistics, ensure that it includes the `AWSGlueServiceRole` policy or equivalent (or a scoped down version of this policy), plus the required inline policies. 

1. (Optional) Next, choose a security configuration to enable at-rest encryption for logs.

1. (Optional) You can choose a sample size by indicating only a specific percent of rows from the table to generate statistics. The default is all rows. Use the up and down arrows to increase or decrease the percent value. 

   We recommend to include all rows in the table to compute accurate statistics. Use sample rows to generate column statistics only when approximate values are acceptable.

1. Choose **Generate statistics** to run the column statistics generation task.

------
#### [ AWS CLI ]

You can use the following AWS CLI example to create a column statistics generation schedule. The database-name, table-name, and role are required parameters, and optional parameters are schedule, column-name-list, catalog-id, sample-size, and security-configuration.

```
aws glue create-column-statistics-task-settings \ 
 --database-name 'database_name' \ 
 --table-name table_name \ 
 --role 'arn:aws:iam::123456789012:role/stats-role' \ 
 --schedule 'cron(0 0-5 14 * * ?)' \ 
 --column-name-list 'col-1' \  
 --catalog-id '123456789012' \ 
 --sample-size '10.0 ' \
 --security-configuration 'test-security'
```

You can generate column statistics also by calling the [StartColumnStatisticsTaskRun](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-StartColumnStatisticsTaskRun) operation.

------

# Managing the schedule for column statistics generation
<a name="manage-column-stats-schedule"></a>

You can manage the scheduling operations such as updating, starting, stopping, and deleting schedules for the column statistics generation in AWS Glue. You can use AWS Glue console, AWS CLI, or [AWS Glue column statistics API operations](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html) to perform these tasks.

**Topics**
+ [Updating the column statistics generation schedule](#update-column-stats-shedule)
+ [Stopping the schedule for column statistics generation](#stop-column-stats-schedule)
+ [Resuming the schedule for column statistics generation](#resume-column-stats-schedule)
+ [Deleting column statistics generation schedule](#delete-column-stats-schedule)

## Updating the column statistics generation schedule
<a name="update-column-stats-shedule"></a>

You can update the schedule to trigger the column statistics generation task after it has been created. You can use the AWS Glue console, AWS CLI, or run the [UpdateColumnStatisticsTaskSettings](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-UpdateColumnStatisticsTaskSettings) operation to update the schedule for a table. You can modify the parameters of an existing schedule, such as the schedule type (on-demand, or scheduled) and other optional parameters. 

------
#### [ AWS Management Console ]

**To update the settings for a column statistics generation task**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/).

1. Choose the table that you want to update from the tables list.

1. In the lower section of the table details page, choose **Column statistics**. 

1. Under **Actions**, choose **Edit** to update the schedule.

1. Make the desired changes to the schedule, and choose **Save**.

------
#### [ AWS CLI ]

 If you are not using AWS Glue's statistics generation feature in the console, you can manually update the schedule using the `update-column-statistics-task-settings` command. The following example shows how to update column statistics using AWS CLI. 

```
aws glue update-column-statistics-task-settings \ 
 --database-name 'database_name' \ 
 --table-name 'table_name' \ 
 --role arn:aws:iam::123456789012:role/stats_role \ 
 --schedule 'cron(0 0-5 16 * * ?)' \ 
 --column-name-list 'col-1' \
 --sample-size '20.0' \  
 --catalog-id '123456789012'\
 --security-configuration 'test-security'
```

------

## Stopping the schedule for column statistics generation
<a name="stop-column-stats-schedule"></a>

 If you no longer need the incremental statistics, you can stop the scheduled generation to save resources and costs. Pausing the schedule doesn't impact the previously generated statistics. You can resume the schedule at your convenience. 

------
#### [ AWS Management Console ]

**To stop the schedule for a column statistics generation task**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Under **Actions**, choose **Scheduled generation**, **Pause**.

1. Choose **Pause** to confirm.

------
#### [ AWS CLI ]

To stop a column statistics task run schedule using the AWS CLI, you can use the following command: 

```
aws glue stop-column-statistics-task-run-schedule \
 --database-name ''database_name' \
 --table-name 'table_name'
```

Replace the `database_name` and the `table_name` with the actual names of the database and table for which you want to stop the column statistics task run schedule.

------

## Resuming the schedule for column statistics generation
<a name="resume-column-stats-schedule"></a>

 If you've paused the statistics generation schedule, AWS Glue allows you to resume the schedule at your convenience. You can resume the schedule using the AWS Glue console, AWS CLI, or the [StartColumnStatisticsTaskRunSchedule](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-StartColumnStatisticsTaskRunSchedule) operation. 

------
#### [ AWS Management Console ]

**To resume the schedule for column statistics generation**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Under **Actions**, choose **Scheduled generation**, and choose **Resume**.

1. Choose **Resume**to confirm.

------
#### [ AWS CLI ]

Replace the `database_name` and the `table_name` with the actual names of the database and table for which you want to stop the column statistics task run schedule.

```
aws glue start-column-statistics-task-run-schedule \
 --database-name 'database_name' \
 --table-name 'table_name'
```

------

## Deleting column statistics generation schedule
<a name="delete-column-stats-schedule"></a>

 While maintaining up-to-date statistics is generally recommended for optimal query performance, there are specific use cases where removing the automatic generation schedule might be beneficial.
+ If the data remains relatively static, the existing column statistics may remain accurate for an extended period, reducing the need for frequent updates. Deleting the schedule can prevent unnecessary resource consumption and overhead associated with regenerating statistics on unchanging data.
+ When manual control over statistics generation is preferred. By deleting the automatic schedule, administrators can selectively update column statistics at specific intervals or after significant data changes, aligning the process with their maintenance strategies and resource allocation needs. 

------
#### [ AWS Management Console ]

**To delete the schedule for column statistics generation**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Under **Actions**, choose **Scheduled generation**, **Delete**.

1. Choose **Delete**to confirm.

------
#### [ AWS CLI ]

Replace the `database_name` and the `table_name` with the actual names of the database and table for which you want to stop the column statistics task run schedule.

You can delete column statistics schedule using the [DeleteColumnStatisticsTaskSettings](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-DeleteColumnStatisticsTaskSettings) API operation or AWS CLI. The following example shows how to delete the schedule for generating column statistics using AWS Command Line Interface (AWS CLI).

```
aws glue delete-column-statistics-task-settings \
    --database-name 'database_name' \
    --table-name 'table_name'
```

------

# Generating column statistics on demand
<a name="column-stats-on-demand"></a>

You can run the column statistics task for the AWS Glue Data Catalog tables task on-demand without a set schedule. This option is useful for ad-hoc analysis or when statistics need to be computed immediately.

Follow these steps to generate column statistics on demand for the Data Catalog tables using AWS Glue console or AWS CLI.

------
#### [ AWS Management Console ]

**To generate column statistics using the console**

1. Sign in to the AWS Glue console at [https://console.aws.amazon.com/glue/](https://console.aws.amazon.com/glue/). 

1. Choose Data Catalog tables.

1.  Choose a table from the list. 

1. Choose **Generate statistics** under **Actions** menu.

   You can also choose **Generate**, **Generate on demand** option under **Column statistics** tab in the lower section of the **Table** page.

1. Follow steps 7 - 11 in the [Generating column statistics on a schedule](generate-column-stats.md) to generate column statistics for the table.

1. On the **Generate statistics** page, specify the following options:   
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/generate-column-stats.png)
   + **All columns** – Choose this option to generate statistics for all columns in the table.
   + **Selected columns** – Choose this option to generate statistics for specific columns. You can select the columns from the drop-down list.
   + **IAM role** –Choose **Create a new IAM role** that has the required permission policies to run the column statistics generation task. Choose View permission details to review the policy statement. You can also select an IAM role from the list. For more information about the required permissions, see [Prerequisites for generating column statistics](column-stats-prereqs.md).

     AWS Glue assumes the permissions of the role that you specify to generate statistics. 

     For more information about providing roles for AWS Glue, see [Identity-based policies for AWS Glue.](https://docs.aws.amazon.com/glue/latest/dg/security_iam_service-with-iam.html#security_iam_service-with-iam-id-based-policies).
   + (Optional) Next, choose a security configuration to enable at-rest encryption for logs.
   + **Sample rows** – Choose only a specific percent of rows from the table to generate statistics. The default is all rows. Use the up and down arrows to increase or decrease the percent value.
**Note**  
We recommend to include all rows in the table to compute accurate statistics. Use sample rows to generate column statistics only when approximate values are acceptable.

   Choose **Generate statistics** to run the task.

------
#### [ AWS CLI ]

This command will trigger an column statistics task run for the specified table. You need to provide the database name, table name, an IAM role with permissions to generate statistics, and optionally provide column names and a sample size percentage for the statistics computation.

```
aws glue start-column-statistics-task-run \ 
    --database-name 'database_name \ 
    --table-name 'table_name' \ 
    --role 'arn:aws:iam::123456789012:role/stats-role' \
    --column-name 'col1','col2'  \
    --sample-size 10.0
```

This command will start a task to generate column statistics for the specified table. 

------

## Updating column statistics on demand
<a name="update-column-stats-on-demand"></a>

 Maintaining up-to-date column statistics is crucial for the query optimizer to generate efficient execution plans, ensuring improved query performance, reduced resource consumption, and better overall system performance. This process is particularly important after significant data changes, such as bulk loads or extensive modifications, which can render existing statistics obsolete. 

You need to explicitly run the **Generate statistics** task from the AWS Glue console to refresh the column statistics. Data Catalog doesn't automatically refresh the statistics.

If you are not using AWS Glue's statistics generation feature in the console, you can manually update column statistics using the [UpdateColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_UpdateColumnStatisticsForTable.html) API operation or AWS CLI. The following example shows how to update column statistics using AWS CLI.

```
aws glue update-column-statistics-for-table --cli-input-json:

{
    "CatalogId": "111122223333",
    "DatabaseName": "database_name",
    "TableName": "table_name",
    "ColumnStatisticsList": [
        {
            "ColumnName": "col1",
            "ColumnType": "Boolean",
            "AnalyzedTime": "1970-01-01T00:00:00",
            "StatisticsData": {
                "Type": "BOOLEAN",
                "BooleanColumnStatisticsData": {
                    "NumberOfTrues": 5,
                    "NumberOfFalses": 5,
                    "NumberOfNulls": 0
                }
            }
        }
    ]
}
```

# Viewing column statistics
<a name="view-column-stats"></a>

After generating the statistics successfully, Data Catalog stores this information for the cost-based optimizers in Amazon Athena and Amazon Redshift to make optimal choices when running queries. The statistics varies based on the type of the column.

------
#### [ AWS Management Console ]

**To view column statistics for a table**
+ After running column statistics task, the **Column statistics** tab on the **Table details** page shows the statistics for the table.   
![\[The screenshot shows columns generated from the most recent run.\]](http://docs.aws.amazon.com/glue/latest/dg/images/view-column-stats.png)

  The following statistics are available:
  + Column name: Column name used to generate statistics
  + Last updated: Data and time when the statistics were generated
  + Average length: Average length of values in the column
  + Distinct values: Total number of distinct values in the column. We estimate the number of distinct values in a column with 5% relative error.
  + Max value: The largest value in the column.
  + Min value: The smallest value in the column. 
  + Max length: The length of the highest value in the column.
  + Null values: The total number of null values in the column.
  + True values: The total number of true values in the column.
  + False values: The total number of false values in the column.
  + numFiles: The total number of files in the table. This value is available under the **Advanced properties** tab.

------
#### [ AWS CLI ]

The following example shows how to retrieve column statistics using AWS CLI.

```
aws glue get-column-statistics-for-table \
    --database-name database_name \
    --table-name table_name \
    --column-names <column_name>
```

 You can also view the column statistics using the [GetColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_GetColumnStatisticsForTable.html) API operation. 

------

# Viewing column statistics task runs
<a name="view-stats-run"></a>

After you run a column statistics task, you can explore the task run details for a table using AWS Glue console, AWS CLI or using [GetColumnStatisticsTaskRuns](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-GetColumnStatisticsTaskRun) operation.

------
#### [ Console ]

**To view column statistics task run details**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select a table with column statistics.

1. On the **Table details** page, choose **Column statistics**.

1. Choose **View runs**.

   You can see information about all runs associated with the specified table.  
![\[The screenshot shows the options available to generate column stats.\]](http://docs.aws.amazon.com/glue/latest/dg/images/view-column-stats-task-runs.png)

------
#### [ AWS CLI ]

In the following example, replace values for `DatabaseName` and `TableName` with the actual database and table name.

```
aws glue get-column-statistics-task-runs --input-cli-json file://input.json
{
    "DatabaseName": "database_name",
    "TableName": "table_name"
}
```

------

# Stopping column statistics task run
<a name="stop-stats-run"></a>

You can stop a column statistics task run for a table using AWS Glue console, AWS CLI or using [StopColumnStatisticsTaskRun](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-column-statistics.html#aws-glue-api-crawler-column-statistics-StopColumnStatisticsTaskRun) operation.

------
#### [ Console ]

**To stop a column statistics task run**

1. On AWS Glue console, choose **Tables** under Data Catalog.

1. Select the table with the column statistics task run is in progress.

1. On the **Table details** page, choose **Column statistics**.

1. Choose **Stop**.

   If you stop the task before the run is complete, column statistics won't be generated for the table.

------
#### [ AWS CLI ]

In the following example, replace values for `DatabaseName` and `TableName` with the actual database and table name.

```
aws glue stop-column-statistics-task-run --input-cli-json file://input.json
{
    "DatabaseName": "database_name",
    "TableName": "table_name"
}
```

------

# Deleting column statistics
<a name="delete-column-stats"></a>

You can delete column statistics using the [DeleteColumnStatisticsForTable](https://docs.aws.amazon.com/glue/latest/webapi/API_DeleteColumnStatisticsForTable.html) API operation or AWS CLI. The following example shows how to delete column statistics using AWS Command Line Interface (AWS CLI).

```
aws glue delete-column-statistics-for-table \
    --database-name 'database_name' \
    --table-name 'table_name' \
    --column-name 'column_name'
```

# Considerations and limitations
<a name="column-stats-notes"></a>

The following considerations and limitations apply to generating column statistics.

**Considerations**
+ Using sampling to generate statistics reduces run time, but can generate inaccurate statistics.
+ Data Catalog doesn't store different versions of the statistics.
+ You can only run one statistics generation task at a time per table.
+ If a table is encrypted using customer AWS KMS key registered with Data Catalog, AWS Glue uses the same key to encrypt statistics.

**Column statistics task supports generating statistics:**
+ When the IAM role has full table permissions (IAM or Lake Formation).
+ When the IAM role has permissions on the table using Lake Formation hybrid access mode.

**Column statistics task doesn’t support generating statistics for:**
+ Tables with Lake Formation cell-based access control
+ Transactional data lakes - Linux foundation Delta Lake, Apache Hudi
+ Tables in federated databases - Hive metastore, Amazon Redshift datashares
+ Nested columns, arrays, and struct data types.
+ Table that is shared with you from another account