Amazon Redshift will no longer support the creation of new Python UDFs starting Patch 198. Existing Python UDFs will continue to function until June 30, 2026. For more information, see the [ blog post ](https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/). 

# Amazon Redshift best practices
<a name="best-practices"></a>

Following, you can find best practices for planning a proof of concept, designing tables, loading data into tables, and writing queries for Amazon Redshift, and also a discussion of working with Amazon Redshift Advisor. 

Amazon Redshift is not the same as other SQL database systems. To fully realize the benefits of the Amazon Redshift architecture, you must specifically design, build, and load your tables to use massively parallel processing, columnar data storage, and columnar data compression. If your data loading and query execution times are longer than you expect, or longer than you want, you might be overlooking key information. 

If you are an experienced SQL database developer, we strongly recommend that you review this topic before you begin developing your Amazon Redshift data warehouse. 

If you are new to developing SQL databases, this topic is not the best place to start. We recommend that you begin by reading [Run commands to define and use a database in your data warehouse](https://docs.aws.amazon.com/redshift/latest/gsg/database-tasks.html) in the *Amazon Redshift Getting Started Guide*, and trying the examples yourself. 

In this topic, you can find an overview of the most important development principles, along with specific tips, examples, and best practices for implementing those principles. No single practice can apply to every application. Evaluate all of your options before finishing a database design. For more information, see [Automatic table optimization](t_Creating_tables.md), [Loading data in Amazon Redshift](t_Loading_data.md), [Query performance tuning](c-optimizing-query-performance.md), and the reference chapters. 

**Topics**
+ [Conduct a proof of concept (POC) for Amazon Redshift](proof-of-concept-playbook.md)
+ [Amazon Redshift best practices for designing tables](c_designing-tables-best-practices.md)
+ [Amazon Redshift best practices for loading data](c_loading-data-best-practices.md)
+ [Amazon Redshift best practices for designing queries](c_designing-queries-best-practices.md)
+ [Follow recommendations from Amazon Redshift Advisor](advisor.md)

# Conduct a proof of concept (POC) for Amazon Redshift
<a name="proof-of-concept-playbook"></a>

Amazon Redshift is a popular cloud data warehouse, which offers a fully managed cloud-based service that integrates with an organization’s Amazon Simple Storage Service data lake, real-time streams, machine learning (ML) workflows, transactional workflows, and much more. The following sections guide you through the process of doing a proof of concept (POC) on Amazon Redshift. The information here helps you set goals for your POC, and takes advantage of tools that can automate the provisioning and configuration of services for your POC.

**Note**  
For a copy of this information as a PDF, choose the link **Run your own Redshift POC** on the [Amazon Redshift resources](https://aws.amazon.com/redshift/resources/) page.

When doing a POC of Amazon Redshift, you test, prove out, and adopt features ranging from best-in-class security capabilities, elastic scaling, easy integration and ingestion, and flexible decentralized data architecture options.

![\[Shows a depiction of the steps in the proof of concept flow.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/poc-steps-overview.png)


Follow the these steps to conduct a successful POC.

## Step 1: Scope your POC
<a name="proof-of-concept-scope"></a>

![\[Shows that the scope step is the current step in the proof of concept flow.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/poc-step1.png)


When conducting a POC, you can either choose to use your own data, or you can choose to use benchmarking datasets. When you choose your own data you run your own queries against the data. With benchmarking data, sample queries are provided with the benchmark. See [Use sample datasets](#use-sample-datasets) for more details if you are not ready to conduct a POC with your own data just yet.

In general, we recommend using two weeks of data for an Amazon Redshift POC.

Start by doing the following:

1. **Identify your business and functional requirements,** then work backwards. Common examples are: faster performance, lower costs, test a new workload or feature, or comparison between Amazon Redshift and another data warehouse.

1. **Set specific targets** which become the success criteria for the POC. For example, from *faster performance*, come up with a list of the top five processes you wish to accelerate, and include the current run times along with your required run time. These can be reports, queries, ETL processes, data ingestion, or whatever your current pain points are.

1. **Identify the specific scope and artifacts** needed to run the tests. What datasets do you need to migrate or continuously ingest into Amazon Redshift, and what queries and processes are needed to run the tests to measure against the success criteria? There are two ways to do this:

   
**Bring your own data**
   + To test your own data, come up with the minimum viable list of data artifacts which is required to test for your success criteria. For example, if your current data warehouse has 200 tables, but the reports you want to test only need 20, your POC can be run faster by using only the smaller subset of tables.

   
**Use sample datasets**
   + If you don’t have your own datasets ready, you can still get started doing a POC on Amazon Redshift by using the industry-standard benchmark datasets such as [TPC-DS](https://github.com/awslabs/amazon-redshift-utils/tree/master/src/CloudDataWarehouseBenchmark/Cloud-DWB-Derived-from-TPCDS) or [TPC-H](https://github.com/awslabs/amazon-redshift-utils/tree/master/src/CloudDataWarehouseBenchmark/Cloud-DWB-Derived-from-TPCH) and run sample benchmarking queries to harness the power of Amazon Redshift. These datasets can be accessed from within your Amazon Redshift data warehouse after it is created. For detailed instructions on how to access these datasets and sample queries, see [Step 2: Launch Amazon Redshift](#proof-of-concept-launch).

## Step 2: Launch Amazon Redshift
<a name="proof-of-concept-launch"></a>

![\[Shows that the Amazon Redshift launch step is the current step in the proof of concept flow.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/poc-step2.png)


Amazon Redshift accelerates your time to insights with fast, easy, and secure cloud data warehousing at scale. You can start quickly by launching your warehouse on the [Redshift Serverless console](https://console.aws.amazon.com//redshiftv2/home?#serverless-dashboard) and get from data to insights in seconds. With Redshift Serverless, you can focus on delivering on your business outcomes without worrying about managing your data warehouse.

### Set up Amazon Redshift Serverless
<a name="proof-of-concept-setup-serverless"></a>

The first time you use Redshift Serverless, the console leads you through the steps required to launch your warehouse. You might also be eligible for a credit towards your Redshift Serverless usage in your account. For more information about choosing a free trial, see [Amazon Redshift free trial](https://aws.amazon.com/redshift/free-trial/). Follow the steps in the [Creating a data warehouse with Redshift Serverless](https://docs.aws.amazon.com/redshift/latest/gsg/new-user-serverless.html#serverless-console-resource-creation) in the *Amazon Redshift Getting Started Guide* to create a data warehouse with Redshift Serverless. If you do not have a dataset that you would like to load, the guide also contains steps on how to load a sample data set.

If you have previously launched Redshift Serverless in your account, follow the steps in [Creating a workgroup with a namespace](https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-console-workgroups-create-workgroup-wizard.html) in the *Amazon Redshift Management Guide*. After your warehouse is available, you can opt to load the sample data available in Amazon Redshift. For information about using Amazon Redshift query editor v2 to load data, see [Loading sample data](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-loading.html#query-editor-v2-loading-sample-data) in the *Amazon Redshift Management Guide*.

If you are bringing your own data instead of loading the sample data set, see [Step 3: Load your data](#proof-of-concept-load-data).

## Step 3: Load your data
<a name="proof-of-concept-load-data"></a>

![\[Shows that the load step is the current step in the proof of concept flow.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/poc-step3.png)


After launching Redshift Serverless, the next step is to load your data for the POC. Whether you are uploading a simple CSV file, ingesting semi-structured data from S3, or streaming data directly, Amazon Redshift provides the flexibility to quickly and easily move the data into Amazon Redshift tables from the source.

Choose one of the following methods to load your data.

### Upload a local file
<a name="proof-of-concept-load-data-local-file"></a>

For quick ingestion and analysis, you can use [Amazon Redshift query editor v2](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2.html) to easily load data files from your local desktop. It has the capability to process files in various formats such as CSV, JSON, AVRO, PARQUET, ORC, and more. To enable your users, as an administrator, to load data from a local desktop using query editor v2 you have to specify a common Amazon S3 bucket, and the user account must be [configured with the proper permissions](https://docs.aws.amazon.com/redshift/latest/mgmt/query-editor-v2-loading.html#query-editor-v2-loading-data-local). You can follow [Data load made easy and secure in Amazon Redshift using Query Editor V2](https://aws.amazon.com/blogs//big-data/data-load-made-easy-and-secure-in-amazon-redshift-using-query-editor-v2/) for step-by-step guidance.

### Load an Amazon S3 file
<a name="proof-of-concept-load-data-s3-file"></a>

To load data from an Amazon S3 bucket into Amazon Redshift, begin by using the [COPY command](https://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html), specifying the source Amazon S3 location and target Amazon Redshift table. Ensure that the IAM roles and permissions are properly configured to allow Amazon Redshift access to the designated Amazon S3 bucket. Follow [Tutorial: Loading data from Amazon S3](https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data.html) for step-by-step guidance. You can also choose the **Load data** option in query editor v2 to directly load data from your S3 bucket.

### Continuous data ingestion
<a name="proof-of-concept-load-data-autocopy"></a>

[Autocopy (in preview)](https://docs.aws.amazon.com/redshift/latest/dg/loading-data-copy-job.html) is an extension of the [COPY command](https://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html) and automates continuous data loading from Amazon S3 buckets. When you create a copy job, Amazon Redshift detects when new Amazon S3 files are created in a specified path, and then loads them automatically without your intervention. Amazon Redshift keeps track of the loaded files to verify that they are loaded only one time. For instructions on how to create copy jobs, see [COPY JOB](r_COPY-JOB.md)

**Note**  
Autocopy is currently in preview and supported only in provisioned clusters in specific AWS Regions. To create a preview cluster for autocopy, see [Create an S3 event integration to automatically copy files from Amazon S3 buckets](loading-data-copy-job.md).

### Load your streaming data
<a name="proof-of-concept-load-data-streaming"></a>

Streaming ingestion provides low-latency, high-speed ingestion of stream data from [Amazon Kinesis Data Streams](https://aws.amazon.com/kinesis/data-streams/) and [Amazon Managed Streaming for Apache Kafka](https://aws.amazon.com/msk/) into Amazon Redshift. Amazon Redshift streaming ingestion uses a materialized view, which is updated directly from the stream utilizing [auto refresh](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-refresh.html#materialized-view-auto-refresh). The materialized view maps to the stream data source. You can perform filtering and aggregations on the stream data as part of the materialized view definition. For step-by-step guidance to load data from a stream, see [Getting started with Amazon Kinesis Data Streams](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started.html) or an [Getting started with Amazon Managed Streaming for Apache Kafka](https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-streaming-ingestion-getting-started-MSK.html).

## Step 4: Analyze your data
<a name="proof-of-concept-analyze"></a>

![\[Shows that the analyze step is the current step in the proof of concept flow.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/poc-step4.png)


After creating your Redshift Serverless workgroup and namespace, and loading your data, you can immediately run queries by opening the **Query editor v2** from the navigation panel of the [Redshift Serverless console](https://console.aws.amazon.com//redshiftv2/home?#serverless-dashboard). You can use query editor v2 to test query functionality or query performance against your own datasets.

### Query using Amazon Redshift query editor v2
<a name="proof-of-concept-setup-analyze-query"></a>

You can access query editor v2 from the Amazon Redshift console. See [Simplify your data analysis with Amazon Redshift query editor v2](https://aws.amazon.com/blogs//big-data/simplify-your-data-analysis-with-amazon-redshift-query-editor-v2/) for a complete guide on how to configure, connect, and run queries with query editor v2.

Alternatively, if you want to run a load test as part of your POC, you can do this by the following steps to install and run Apache JMeter.

### Run a load test using Apache JMeter
<a name="proof-of-concept-setup-analyze-load-test"></a>

To perform a load test to simulate “N” users submitting queries concurrently to Amazon Redshift, you can use [Apache JMeter](https://jmeter.apache.org/), an open-source Java based tool.

To install and configure Apache JMeter to run against your Redshift Serverless workgroup, follow the instructions in [Automate Amazon Redshift load testing with the AWS Analytics Automation Toolkit](https://aws.amazon.com/blogs//big-data/automate-amazon-redshift-load-testing-with-the-aws-analytics-automation-toolkit/). It uses the [AWS Analytics Automation toolkit (AAA)](https://github.com/aws-samples/amazon-redshift-infrastructure-automation/tree/main), an open source utility for dynamically deploying Redshift solutions, to automatically launch these resources. If you have loaded your own data into Amazon Redshift, be sure to perform the Step \$15 – Customize SQL option, to make sure you supply the appropriate SQL statements you would like to test against your tables. Test each of these SQL statements one time using query editor v2 to make sure they run without errors.

After you complete customizing your SQL statements and finalizing your test plan, save and run your test plan against your Redshift Serverless workgroup. To monitor the progress of your test, open the [Redshift Serverless console](https://console.aws.amazon.com/redshiftv2/home?#serverless-query-and-database-monitoring), navigate to **Query and database monitoring**, choose the **Query history** tab and view information about your queries.

For performance metrics, choose the **Database performance** tab on the Redshift Serverless console, to monitor metrics such as **Database Connections** and **CPU utilization**. Here you can view a graph to monitor the RPU capacity used and observe how Redshift Serverless automatically scales to meet concurrent workload demands while the load test is running on your workgroup.

![\[Example graph showing average RPU capacity used.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/poc-rpu-capacity-used.png)


Database connections is another useful metric to monitor while running the load test to see how your workgroup is handling numerous concurrent connections at a given time to meet the increasing workload demands.

![\[Example graph showing database connections.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/poc-database-connections.png)


## Step 5: Optimize
<a name="proof-of-concept-optimize"></a>

![\[Shows that the optimize step is the current step in the proof of concept flow.\]](http://docs.aws.amazon.com/redshift/latest/dg/images/poc-step5.png)


Amazon Redshift empowers tens of thousands of users to process exabytes of data every day and power their analytics workloads by offering a variety of configurations and features to support individual use cases. When choosing between these options, customers are looking for tools that help them determine the most optimal data warehouse configuration to support their Amazon Redshift workload.

### Test drive
<a name="proof-of-concept-optimize-test-drive"></a>

You can use [Test Drive](https://github.com/aws/redshift-test-drive/tree/main) to automatically replay your existing workload on potential configurations and analyze the corresponding outputs to evaluate the optimal target to migrate your workload to. See [Find the best Amazon Redshift configuration for your workload using Redshift Test Drive](https://aws.amazon.com/blogs/big-data/find-the-best-amazon-redshift-configuration-for-your-workload-using-redshift-test-drive/) for information about using Test Drive to evaluate different Amazon Redshift configurations.

# Amazon Redshift best practices for designing tables
<a name="c_designing-tables-best-practices"></a>

As you plan your database, certain key table design decisions heavily influence overall query performance. These design choices also have a significant effect on storage requirements, which in turn affects query performance by reducing the number of I/O operations and minimizing the memory required to process queries.

In this section, you can find a summary of the most important design decisions and best practices for optimizing query performance. [Automatic table optimization](t_Creating_tables.md) provides more detailed explanations and examples of table design options.

**Topics**
+ [Choose the best sort key](c_best-practices-sort-key.md)
+ [Choose the best distribution style](c_best-practices-best-dist-key.md)
+ [Let COPY choose compression encodings](c_best-practices-use-auto-compression.md)
+ [Define primary key and foreign key constraints](c_best-practices-defining-constraints.md)
+ [Use the smallest possible column size](c_best-practices-smallest-column-size.md)
+ [Use date/time data types for date columns](c_best-practices-timestamp-date-columns.md)

# Choose the best sort key
<a name="c_best-practices-sort-key"></a>

Amazon Redshift stores your data on disk in sorted order according to the sort key. The Amazon Redshift query optimizer uses sort order when it determines optimal query plans. 

**Note**  
When you use automatic table optimization, you don't need to choose the sort key of your table. For more information, see [Automatic table optimization](t_Creating_tables.md).

Some suggestions for the best approach follow:
+ To have Amazon Redshift choose the appropriate sort order, specify `AUTO` for the sort key. 
+ If recent data is queried most frequently, specify the timestamp column as the leading column for the sort key. 

  Queries are more efficient because they can skip entire blocks that fall outside the time range.
+ If you do frequent range filtering or equality filtering on one column, specify that column as the sort key. 

   Amazon Redshift can skip reading entire blocks of data for that column. It can do so because it tracks the minimum and maximum column values stored on each block and can skip blocks that don't apply to the predicate range.
+ If you frequently join a table, specify the join column as both the sort key and the distribution key. 

  Doing this enables the query optimizer to choose a sort merge join instead of a slower hash join. Because the data is already sorted on the join key, the query optimizer can bypass the sort phase of the sort merge join.

# Choose the best distribution style
<a name="c_best-practices-best-dist-key"></a>

When you run a query, the query optimizer redistributes the rows to the compute nodes as needed to perform any joins and aggregations. The goal in selecting a table distribution style is to minimize the impact of the redistribution step by locating the data where it needs to be before the query is run. 

**Note**  
When you use automatic table optimization, you don't need to choose the distribution style of your table. For more information, see [Automatic table optimization](t_Creating_tables.md).

Some suggestions for the best approach follow:

1. Distribute the fact table and one dimension table on their common columns.

   Your fact table can have only one distribution key. Any tables that join on another key aren't collocated with the fact table. Choose one dimension to collocate based on how frequently it is joined and the size of the joining rows. Designate both the dimension table's primary key and the fact table's corresponding foreign key as the DISTKEY. 

1. Choose the largest dimension based on the size of the filtered dataset. 

   Only the rows that are used in the join must be distributed, so consider the size of the dataset after filtering, not the size of the table. 

1. Choose a column with high cardinality in the filtered result set. 

   If you distribute a sales table on a date column, for example, you should probably get fairly even data distribution, unless most of your sales are seasonal. However, if you commonly use a range-restricted predicate to filter for a narrow date period, most of the filtered rows occur on a limited set of slices and the query workload is skewed. 

1. Change some dimension tables to use ALL distribution.

   If a dimension table cannot be collocated with the fact table or other important joining tables, you can improve query performance significantly by distributing the entire table to all of the nodes. Using ALL distribution multiplies storage space requirements and increases load times and maintenance operations, so you should weigh all factors before choosing ALL distribution.

To have Amazon Redshift choose the appropriate distribution style, specify `AUTO` for the distribution style. 

For more information about choosing distribution styles, see [Data distribution for query optimization](t_Distributing_data.md).

# Let COPY choose compression encodings
<a name="c_best-practices-use-auto-compression"></a>

You can specify compression encodings when you create a table, but in most cases, automatic compression produces the best results.

ENCODE AUTO is the default for tables. When a table is set to ENCODE AUTO, Amazon Redshift automatically manages compression encoding for all columns in the table. For more information, see [CREATE TABLE](r_CREATE_TABLE_NEW.md) and [ALTER TABLE](r_ALTER_TABLE.md).

The COPY command analyzes your data and applies compression encodings to an empty table automatically as part of the load operation. 

Automatic compression balances overall performance when choosing compression encodings. Range-restricted scans might perform poorly if sort key columns are compressed much more highly than other columns in the same query. As a result, automatic compression chooses a less efficient compression encoding to keep the sort key columns balanced with other columns.

Suppose that your table's sort key is a date or timestamp and the table uses many large varchar columns. In this case, you might get better performance by not compressing the sort key column at all. Run the [ANALYZE COMPRESSION](r_ANALYZE_COMPRESSION.md) command on the table, then use the encodings to create a new table, but leave out the compression encoding for the sort key.

There is a performance cost for automatic compression encoding, but only if the table is empty and does not already have compression encoding. For short-lived tables and tables that you create frequently, such as staging tables, load the table once with automatic compression or run the ANALYZE COMPRESSION command. Then use those encodings to create new tables. You can add the encodings to the CREATE TABLE statement, or use CREATE TABLE LIKE to create a new table with the same encoding. 

For more information, see [Loading tables with automatic compression](c_Loading_tables_auto_compress.md).

# Define primary key and foreign key constraints
<a name="c_best-practices-defining-constraints"></a>

Define primary key and foreign key constraints between tables wherever appropriate. Even though they are informational only, the query optimizer uses those constraints to generate more efficient query plans.

Do not define primary key and foreign key constraints unless your application enforces the constraints. Amazon Redshift does not enforce unique, primary-key, and foreign-key constraints. 

See [Table constraints](t_Defining_constraints.md) for additional information about how Amazon Redshift uses constraints.

# Use the smallest possible column size
<a name="c_best-practices-smallest-column-size"></a>

Don't make it a practice to use the maximum column size for convenience. 

Instead, consider the largest values you are likely to store in your columns and size them accordingly. For instance, a CHAR column for storing U.S. state and territory abbreviations used by the post office only needs to be CHAR(2).

# Use date/time data types for date columns
<a name="c_best-practices-timestamp-date-columns"></a>

Amazon Redshift stores DATE and TIMESTAMP data more efficiently than CHAR or VARCHAR, which results in better query performance. Use the DATE or TIMESTAMP data type, depending on the resolution you need, rather than a character type when storing date/time information. For more information, see [Datetime types](r_Datetime_types.md).

# Amazon Redshift best practices for loading data
<a name="c_loading-data-best-practices"></a>

Loading very large datasets can take a long time and consume a lot of computing resources. How your data is loaded can also affect query performance. This section presents best practices for loading data efficiently using COPY commands, bulk inserts, and staging tables. 

**Topics**
+ [Learn how to load data with a tutorial](c_best-practices-loading-take-loading-data-tutorial.md)
+ [Use a COPY command to load data](c_best-practices-use-copy.md)
+ [Use a single COPY command to load from multiple files](c_best-practices-single-copy-command.md)
+ [Loading data files](c_best-practices-use-multiple-files.md)
+ [Compressing your data files](c_best-practices-compress-data-files.md)
+ [Verify data files before and after a load](c_best-practices-verifying-data-files.md)
+ [Use a multi-row insert](c_best-practices-multi-row-inserts.md)
+ [Use a bulk insert](c_best-practices-bulk-inserts.md)
+ [Load data in sort key order](c_best-practices-sort-key-order.md)
+ [Load data in sequential blocks](c_best-practices-load-data-in-sequential-blocks.md)
+ [Use time-series tables](c_best-practices-time-series-tables.md)
+ [Schedule around maintenance windows](c_best-practices-avoid-maintenance.md)

# Learn how to load data with a tutorial
<a name="c_best-practices-loading-take-loading-data-tutorial"></a>

[Tutorial: Loading data from Amazon S3](tutorial-loading-data.md) walks you beginning to end through the steps to upload data to an Amazon S3 bucket and then use the COPY command to load the data into your tables. The tutorial includes help with troubleshooting load errors and compares the performance difference between loading from a single file and loading from multiple files.

# Use a COPY command to load data
<a name="c_best-practices-use-copy"></a>

The COPY command loads data in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB, or multiple data sources on remote hosts. COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well.

 For more information about using the COPY command, see [Loading data from Amazon S3](t_Loading-data-from-S3.md) and [Loading data from an Amazon DynamoDB table](t_Loading-data-from-dynamodb.md).

# Use a single COPY command to load from multiple files
<a name="c_best-practices-single-copy-command"></a>

Amazon Redshift can automatically load in parallel from multiple compressed data files. You can specify the files to be loaded by using an Amazon S3 object prefix or by using a manifest file.

However, if you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load. This type of load is much slower and requires a VACUUM process at the end if the table has a sort column defined. For more information about using COPY to load data in parallel, see [Loading data from Amazon S3](t_Loading-data-from-S3.md).

# Loading data files
<a name="c_best-practices-use-multiple-files"></a>

Source-data files come in different formats and use varying compression algorithms. When loading data with the COPY command, Amazon Redshift loads all of the files referenced by the Amazon S3 bucket prefix. (The prefix is a string of characters at the beginning of the object key name.) If the prefix refers to multiple files or files that can be split, Amazon Redshift loads the data in parallel, taking advantage of Amazon Redshift’s MPP architecture. This divides the workload among the nodes in the cluster. In contrast, when you load data from a file that can't be split, Amazon Redshift is forced to perform a serialized load, which is much slower. The following sections describe the recommended way to load different file types into Amazon Redshift, depending on their format and compression. 

## Loading data from files that can be split
<a name="c_best-practices-use-multiple-files-split"></a>

The following files can be automatically split when their data is loaded:
+ an uncompressed CSV file
+ a columnar file (Parquet/ORC)

Amazon Redshift automatically splits files 128MB or larger into chunks. Columnar files, specifically Parquet and ORC, aren't split if they're less than 128MB. Redshift makes use of slices working in parallel to load the data. This provides fast load performance.

## Loading data from files that can't be split
<a name="c_best-practices-use-multiple-files-comma"></a>

 File types such as JSON, or CSV, when compressed with other compression algorithms, such as GZIP, aren't automatically split. For these we recommend manually splitting the data into multiple smaller files that are close in size, from 1 MB to 1 GB after compression. Additionally, make the number of files a multiple of the number of slices in your cluster. For more information about how to split your data into multiple files and examples of loading data using COPY, see [Loading data from Amazon S3](https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html).

# Compressing your data files
<a name="c_best-practices-compress-data-files"></a>

When you want to compress large load files, we recommend that you use gzip, lzop, bzip2, or Zstandard to compress them and split the data into multiple smaller files.

Specify the GZIP, LZOP, BZIP2, or ZSTD option with the COPY command. This example loads the TIME table from a pipe-delimited lzop file.

```
copy time
from 's3://amzn-s3-demo-bucket/data/timerows.lzo' 
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
lzop
delimiter '|';
```

There are instances when you don't have to split uncompressed data files. For more information about splitting your data and examples of using COPY to load data, see [Loading data from Amazon S3](t_Loading-data-from-S3.md). 

# Verify data files before and after a load
<a name="c_best-practices-verifying-data-files"></a>

Before you load data from Amazon S3, first verify that your Amazon S3 bucket contains all the correct files, and only those files. For more information, see [Verifying that the correct files are present in your bucket](verifying-that-correct-files-are-present.md). 

After the load operation is complete, query the [STL\$1LOAD\$1COMMITS](r_STL_LOAD_COMMITS.md) system table to verify that the expected files were loaded. For more information, see [Verifying that the data loaded correctly](verifying-that-data-loaded-correctly.md). 

# Use a multi-row insert
<a name="c_best-practices-multi-row-inserts"></a>

If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible. Data compression is inefficient when you add data only one row or a few rows at a time.

Multi-row inserts improve performance by batching up a series of inserts. The following example inserts three rows into a four-column table using a single INSERT statement. This is still a small insert, shown simply to illustrate the syntax of a multi-row insert.

```
insert into category_stage values
(default, default, default, default),
(20, default, 'Country', default),
(21, 'Concerts', 'Rock', default);
```

For more details and examples, see [INSERT](r_INSERT_30.md). 

# Use a bulk insert
<a name="c_best-practices-bulk-inserts"></a>

Use a bulk insert operation with a SELECT clause for high-performance data insertion.

Use the [INSERT](r_INSERT_30.md) and [CREATE TABLE AS](r_CREATE_TABLE_AS.md) commands when you need to move data or a subset of data from one table into another.

For example, the following INSERT statement selects all of the rows from the CATEGORY table and inserts them into the CATEGORY\$1STAGE table.

```
insert into category_stage
(select * from category);
```

The following example creates CATEGORY\$1STAGE as a copy of CATEGORY and inserts all of the rows in CATEGORY into CATEGORY\$1STAGE. 

```
create table category_stage as
select * from category;
```

# Load data in sort key order
<a name="c_best-practices-sort-key-order"></a>

Load your data in sort key order to avoid needing to vacuum.

If each batch of new data follows the existing rows in your table and the COPY operation is not large enough to trigger certain load optimizations, your data is properly stored in sort order, and you don't need to run a vacuum. You don't need to presort the rows in each load because COPY sorts each batch of incoming data as it loads.

For example, suppose that you load data every day based on the current day's activity. If your sort key is a timestamp column and the COPY operation is not large enough to trigger certain load optimizations, your data is stored in sort order. This order occurs because the current day's data is always appended at the end of the previous day's data. For more information, see [Load your data in sort key order](vacuum-managing-vacuum-times.md#vacuum-load-in-sort-key-order). For more information about vacuum operations, see [Vacuuming tables](https://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html).

# Load data in sequential blocks
<a name="c_best-practices-load-data-in-sequential-blocks"></a>

If you need to add a large quantity of data, load the data in sequential blocks according to sort order to eliminate the need to vacuum. 

For example, suppose that you need to load a table with events from January 2017 to December 2017. Assuming each month is in a single file, load the rows for January, then February, and so on. Your table is completely sorted when your load completes, and you don't need to run a vacuum. For more information, see [Use time-series tables](c_best-practices-time-series-tables.md).

When loading very large datasets, the space required to sort might exceed the total available space. By loading data in smaller blocks, you use much less intermediate sort space during each load. In addition, loading smaller blocks make it easier to restart if the COPY fails and is rolled back. 

# Use time-series tables
<a name="c_best-practices-time-series-tables"></a>

If your data has a fixed retention period, you can organize your data as a sequence of time-series tables. In such a sequence, each table is identical but contains data for different time ranges.

You can easily remove old data simply by running a DROP TABLE command on the corresponding tables. This approach is much faster than running a large-scale DELETE process and saves you from having to run a subsequent VACUUM process to reclaim space. To hide the fact that the data is stored in different tables, you can create a UNION ALL view. When you delete old data, refine your UNION ALL view to remove the dropped tables. Similarly, as you load new time periods into new tables, add the new tables to the view. To signal the optimizer to skip the scan on tables that don't match the query filter, your view definition filters for the date range that corresponds to each table.

Avoid having too many tables in the UNION ALL view. Each additional table adds a small processing time to the query. Tables don't need to use the same time frame. For example, you might have tables for differing time periods, such as daily, monthly, and yearly.

If you use time-series tables with a timestamp column for the sort key, you effectively load your data in sort key order. Doing this eliminates the need to vacuum to re-sort the data. For more information, see [Load your data in sort key order](vacuum-managing-vacuum-times.md#vacuum-load-in-sort-key-order).

# Schedule around maintenance windows
<a name="c_best-practices-avoid-maintenance"></a>

If a scheduled maintenance occurs while a query is running, the query is terminated and rolled back and you need to restart it. Schedule long-running operations, such as large data loads or VACUUM operation, to avoid maintenance windows. You can also minimize the risk, and make restarts easier when they are needed, by performing data loads in smaller increments and managing the size of your VACUUM operations. For more information, see [Load data in sequential blocks](c_best-practices-load-data-in-sequential-blocks.md) and [Vacuuming tables](t_Reclaiming_storage_space202.md).

# Amazon Redshift best practices for designing queries
<a name="c_designing-queries-best-practices"></a>

To maximize query performance, follow these recommendations when creating queries:
+ Design tables according to best practices to provide a solid foundation for query performance. For more information, see [Amazon Redshift best practices for designing tables](c_designing-tables-best-practices.md).
+ Avoid using `select *`. Include only the columns you specifically need.
+ Use a [CASE conditional expression](r_CASE_function.md) to perform complex aggregations instead of selecting from the same table multiple times.
+ Don't use cross-joins unless absolutely necessary. These joins without a join condition result in the Cartesian product of two tables. Cross-joins are typically run as nested-loop joins, which are the slowest of the possible join types. 
+ Use subqueries in cases where one table in the query is used only for predicate conditions and the subquery returns a small number of rows (less than about 200). The following example uses a subquery to avoid joining the LISTING table.

  ```
  select sum(sales.qtysold)
  from sales
  where salesid in (select listid from listing where listtime > '2008-12-26');
  ```
+ Use predicates to restrict the dataset as much as possible.
+ In the predicate, use the least expensive operators that you can. [Comparison condition](r_comparison_condition.md) operators are preferable to [LIKE](r_patternmatching_condition_like.md) operators. LIKE operators are still preferable to [SIMILAR TO](pattern-matching-conditions-similar-to.md) or [POSIX operators](pattern-matching-conditions-posix.md).
+ Avoid using functions in query predicates. Using them can drive up the cost of the query by requiring large numbers of rows to resolve the intermediate steps of the query.
+ If possible, use a WHERE clause to restrict the dataset. The query planner can then use row order to help determine which records match the criteria, so it can skip scanning large numbers of disk blocks. Without this, the query execution engine must scan participating columns entirely.
+ Add predicates to filter tables that participate in joins, even if the predicates apply the same filters. The query returns the same result set, but Amazon Redshift is able to filter the join tables before the scan step and can then efficiently skip scanning blocks from those tables. Redundant filters aren't needed if you filter on a column that's used in the join condition. 

  For example, suppose that you want to join `SALES` and `LISTING` to find ticket sales for tickets listed after December, grouped by seller. Both tables are sorted by date. The following query joins the tables on their common key and filters for `listing.listtime` values greater than December 1.

  ```
  select listing.sellerid, sum(sales.qtysold)
  from sales, listing
  where sales.salesid = listing.listid
  and listing.listtime > '2008-12-01'
  group by 1 order by 1;
  ```

  The WHERE clause doesn't include a predicate for `sales.saletime`, so the execution engine is forced to scan the entire `SALES` table. If you know the filter would result in fewer rows participating in the join, then add that filter as well. The following example cuts execution time significantly.

  ```
  select listing.sellerid, sum(sales.qtysold)
  from sales, listing
  where sales.salesid = listing.listid
  and listing.listtime > '2008-12-01'
  and sales.saletime > '2008-12-01'
  group by 1 order by 1;
  ```
+ Use sort keys in the GROUP BY clause so the query planner can use more efficient aggregation. A query might qualify for one-phase aggregation when its GROUP BY list contains only sort key columns, one of which is also the distribution key. The sort key columns in the GROUP BY list must include the first sort key, then other sort keys that you want to use in sort key order. For example, it is valid to use the first sort key, the first and second sort keys, the first, second, and third sort keys, and so on. It is not valid to use the first and third sort keys.

  You can confirm the use of one-phase aggregation by running the [EXPLAIN](r_EXPLAIN.md) command and looking for `XN GroupAggregate` in the aggregation step of the query.
+ If you use both GROUP BY and ORDER BY clauses, make sure that you put the columns in the same order in both. That is, use the approach just following.

  ```
  group by a, b, c
  order by a, b, c
  ```

  Don't use the following approach.

  ```
  group by b, c, a
  order by a, b, c
  ```

# Follow recommendations from Amazon Redshift Advisor
<a name="advisor"></a>

To help you improve the performance and decrease the operating costs for your Amazon Redshift cluster, Amazon Redshift Advisor offers you specific recommendations about changes to make. Advisor develops its customized recommendations by analyzing performance and usage metrics for your cluster. These tailored recommendations relate to operations and cluster settings. To help you prioritize your optimizations, Advisor ranks recommendations by order of impact.

## How Advisor Works
<a name="advisor-how-it-works"></a>

Advisor bases its recommendations on observations regarding performance statistics or operations data. Advisor develops observations by running tests on your clusters/workgroups to determine if a test value is within a specified range. If the test result is outside of that range, Advisor generates an observation for your cluster. At the same time, Advisor creates a recommendation about how to bring the observed value back into the best-practice range.

For multi cluster architectures using Amazon Redshift Data Sharing, Advisor now provides enhanced optimization by analyzing workload patterns across all clusters/workgroups in your data mesh, including clusters/workgroups across different regions. When you share tables between producer and consumer clusters/workgroups, Advisor automatically collects query patterns from all consumer endpoints in the data mesh, unless they are explicitly denylisted, and combines them with producer workloads to generate more effective recommendations. This means your table optimizations—including sort keys, distribution keys, and compression are based on how your data is actually being used across your entire organization, not just on a single cluster. Advisor also supports Amazon Redshift Serverless, automatically maintaining optimization continuity across pause and resume cycles.

For example, suppose that your data warehouse contains tables with suboptimal distribution keys that cause data skew across compute nodes. In this case, Advisor automatically recommends redistributing tables using the DISTKEY parameter to specify a column that evenly distributes data. In another example, suppose that Advisor observes that your cluster has tables without sort keys or with inefficient sort key definitions that result in poor query performance. In this case, Advisor automatically provides recommendations for appropriate sort key columns based on your query patterns to improve data filtering and reduce disk I/O.

## Optimizing Data Sharing Architectures
<a name="advisor-data-sharing-optimization"></a>

When you use Amazon Redshift Data Sharing to distribute workloads across multiple clusters/workgroups, Advisor helps you optimize performance across your entire data mesh. Advisor automatically analyzes how shared tables are being queried across all consumer clusters/workgroups. This includes understanding which columns are frequently filtered, which tables are commonly joined together, and how data is being scanned. By considering the complete picture of data usage, Advisor generates recommendations that improve performance for all users of your shared data.

By optimizing tables based on usage patterns across your entire organization rather than a single cluster, you can:
+ Make data-driven optimization decisions based on data access patterns across all clusters/workgroups in the mesh
+ Lower storage costs through more effective compression strategies
+ Improve resource utilization across your data mesh

## Amazon Redshift Regions where Advisor is supported
<a name="advisor-regions"></a>

The Amazon Redshift Advisor feature is available only in the following AWS Regions: 
+ US East (N. Virginia) Region (us-east-1)
+ US East (Ohio) Region (us-east-2)
+ US West (N. California) Region (us-west-1)
+ US West (Oregon) Region (us-west-2) 
+ Africa (Cape Town) Region (af-south-1) 
+ Asia Pacific (Hong Kong) Region (ap-east-1)
+ Asia Pacific (Hyderabad) Region (ap-south-2)
+ Asia Pacific (Jakarta) Region (ap-southeast-3)
+ Asia Pacific (Melbourne) Region (ap-southeast-4)
+ Asia Pacific (Malaysia) Region (ap-southeast-5)
+ Asia Pacific (Mumbai) Region (ap-south-1)
+ Asia Pacific (Osaka) Region (ap-northeast-3)
+ Asia Pacific (Seoul) Region (ap-northeast-2)
+ Asia Pacific (Singapore) Region (ap-southeast-1)
+ Asia Pacific (Sydney) Region (ap-southeast-2)
+ Asia Pacific (Tokyo) Region (ap-northeast-1)
+ Canada (Central) Region (ca-central-1)
+ Canada West (Calgary) Region (ca-west-1)
+ China (Beijing) Region (cn-north-1)
+ China (Ningxia) Region (cn-northwest-1)
+ Europe (Frankfurt) Region (eu-central-1)
+ Europe (Ireland) Region (eu-west-1)
+ Europe (London) Region (eu-west-2)
+ Europe (Milan) Region (eu-south-1)
+ Europe (Paris) Region (eu-west-3)
+ Europe (Spain) Region (eu-south-2)
+ Europe (Stockholm) Region (eu-north-1)
+ Europe (Zurich) Region (eu-central-2)
+ Israel (Tel Aviv) Region (il-central-1)
+ Middle East (Bahrain) Region (me-south-1)
+ Middle East (UAE) Region (me-central-1)
+ South America (São Paulo) Region (sa-east-1)

**Topics**
+ [How Advisor Works](#advisor-how-it-works)
+ [Optimizing Data Sharing Architectures](#advisor-data-sharing-optimization)
+ [Amazon Redshift Regions where Advisor is supported](#advisor-regions)
+ [Viewing Amazon Redshift Advisor recommendations](access-advisor.md)
+ [Amazon Redshift Advisor recommendations](advisor-recommendations.md)

# Viewing Amazon Redshift Advisor recommendations
<a name="access-advisor"></a>

You can access Amazon Redshift Advisor recommendations using the Amazon Redshift console, Amazon Redshift API, or AWS CLI. To access recommendations you must have permission `redshift:ListRecommendations` attached to your IAM role or identity.

## Viewing Amazon Redshift Advisor recommendations on the Amazon Redshift provisioned console
<a name="access-advisor-console"></a>

You can view Amazon Redshift Advisor recommendations on the AWS Management Console. 

**To view Amazon Redshift Advisor recommendations for Amazon Redshift clusters on the console**

1. Sign in to the AWS Management Console and open the Amazon Redshift console at [https://console.aws.amazon.com/redshiftv2/](https://console.aws.amazon.com/redshiftv2/).

1. On the navigation menu, choose **Advisor**.

1. Expand each recommendation to see more details. On this page, you can sort and group recommendations. 

## Viewing Amazon Redshift Advisor recommendations using Amazon Redshift API operations
<a name="access-advisor-api"></a>

You can list Amazon Redshift Advisor recommendations for Amazon Redshift clusters using the Amazon Redshift API. Typically, you develop and application in your programming language of your choice to call the `redshift:ListRecommendations` API using an AWS SDK. For more information, see [ListRecommendations](https://docs.aws.amazon.com/redshift/latest/APIReference/API_ListRecommendations.html) in the *Amazon Redshift API Reference*.

## Viewing Amazon Redshift Advisor recommendations using AWS Command Line Interface operations
<a name="access-advisor-cli"></a>

You can list Amazon Redshift Advisor recommendations for Amazon Redshift clusters using the AWS Command Line Interface. For more information, see [list-recommendations](https://docs.aws.amazon.com/cli/latest/reference/redshift/list-recommendations.html) in the *AWS CLI Command Reference*.

# Amazon Redshift Advisor recommendations
<a name="advisor-recommendations"></a>

Amazon Redshift Advisor offers recommendations about how to optimize your Amazon Redshift cluster to increase performance and save on operating costs. You can find explanations for each recommendation in the console, as described preceding. You can find further details on these recommendations in the following sections. 

**Topics**
+ [Compress Amazon S3 file objects loaded by COPY](#cluster-compress-s3-recommendation)
+ [Isolate multiple active databases](#isolate-active-dbs-recommendation)
+ [Reallocate workload management (WLM) memory](#reallocate-wlm-recommendation)
+ [Skip compression analysis during COPY](#skip-compression-analysis-recommendation)
+ [Split Amazon S3 objects loaded by COPY](#split-s3-objects-recommendation)
+ [Update table statistics](#update-table-statistics-recommendation)
+ [Enable short query acceleration](#enable-sqa-recommendation)
+ [Alter distribution keys on tables](#alter-diststyle-distkey-recommendation)
+ [Alter sort keys on tables](#alter-sortkey-recommendation)
+ [Alter compression encodings on columns](#alter-compression-encoding-recommendation)
+ [Data type recommendations](#data-type-recommendation)

## Compress Amazon S3 file objects loaded by COPY
<a name="cluster-compress-s3-recommendation"></a>

The COPY command takes advantage of the massively parallel processing (MPP) architecture in Amazon Redshift to read and load data in parallel. It can read files from Amazon S3, DynamoDB tables, and text output from one or more remote hosts. 

When loading large amounts of data, we strongly recommend using the COPY command to load compressed data files from S3. Compressing large datasets saves time uploading the files to Amazon S3. COPY can also speed up the load process by uncompressing the files as they are read. 

**Analysis**

Long-running COPY commands that load large uncompressed datasets often have an opportunity for considerable performance improvement. The Advisor analysis identifies COPY commands that load large uncompressed datasets. In such a case, Advisor generates a recommendation to implement compression on the source files in Amazon S3. 

**Recommendation**

Ensure that each COPY that loads a significant amount of data, or runs for a significant duration, ingests compressed data objects from Amazon S3. You can identify the COPY commands that load large uncompressed datasets from Amazon S3 by running the following SQL command as a superuser.

```
SELECT
    wq.userid, query, exec_start_time AS starttime, COUNT(*) num_files,
    ROUND(MAX(wq.total_exec_time/1000000.0),2) execution_secs,
    ROUND(SUM(transfer_size)/(1024.0*1024.0),2) total_mb,
    SUBSTRING(querytxt,1,60) copy_sql
FROM stl_s3client s
JOIN stl_query q USING (query)
JOIN stl_wlm_query wq USING (query)
WHERE s.userid>1 AND http_method = 'GET'
    AND POSITION('COPY ANALYZE' IN querytxt) = 0
    AND aborted = 0 AND final_state='Completed'
GROUP BY 1, 2, 3, 7
HAVING SUM(transfer_size) = SUM(data_size) 
AND SUM(transfer_size)/(1024*1024) >= 5
ORDER BY 6 DESC, 5 DESC;
```

If the staged data remains in Amazon S3 after you load it, which is common in data lake architectures, storing this data in a compressed form can reduce your storage costs. 

**Implementation tips**
+ The ideal object size is 1–128 MB after compression.
+ You can compress files with gzip, lzop, or bzip2 format.

## Isolate multiple active databases
<a name="isolate-active-dbs-recommendation"></a>

As a best practice, we recommend isolating databases in Amazon Redshift from one another. Queries run in a specific database and can't access data from any other database on the cluster. However, the queries that you run in all databases of a cluster share the same underlying cluster storage space and compute resources. When a single cluster contains multiple active databases, their workloads are usually unrelated.

**Analysis**

The Advisor analysis reviews all databases on the cluster for active workloads running at the same time. If there are active workloads running at the same time, Advisor generates a recommendation to consider migrating databases to separate Amazon Redshift clusters.

**Recommendation**

Consider moving each actively queried database to a separate dedicated cluster. Using a separate cluster can reduce resource contention and improve query performance. It can do so because it allows you to set the size for each cluster for the storage, cost, and performance needs of each workload. Also, unrelated workloads often benefit from different workload management configurations.

To identify which databases are actively used, you can run this SQL command as a superuser.

```
SELECT database,
  COUNT(*) as num_queries,
  AVG(DATEDIFF(sec,starttime,endtime)) avg_duration,
  MIN(starttime) as oldest_ts,
  MAX(endtime) as latest_ts
FROM stl_query
WHERE userid > 1
GROUP BY database;
```

**Implementation tips**
+ Because a user must connect to each database specifically, and queries can only access a single database, moving databases to separate clusters has minimal impact for users.
+ One option to move a database is to take the following steps: 

  1. Temporarily restore a snapshot of the current cluster to a cluster of the same size.

  1. Delete all databases from the new cluster except the target database to be moved.

  1. Resize the cluster to an appropriate node type and count for the database's workload.

## Reallocate workload management (WLM) memory
<a name="reallocate-wlm-recommendation"></a>

Amazon Redshift routes user queries to [Implementing manual WLM](cm-c-defining-query-queues.md) for processing. Workload management (WLM) defines how those queries are routed to the queues. Amazon Redshift allocates each queue a portion of the cluster's available memory. A queue's memory is divided among the queue's query slots. 

When a queue is configured with more slots than the workload requires, the memory allocated to these unused slots goes underutilized. Reducing the configured slots to match the peak workload requirements redistributes the underutilized memory to active slots, and can result in improved query performance. 

**Analysis**

The Advisor analysis reviews workload concurrency requirements to identify query queues with unused slots. Advisor generates a recommendation to reduce the number of slots in a queue when it finds the following:
+ A queue with slots that are completely inactive throughout the analysis.
+ A queue with more than four slots that had at least two inactive slots throughout the analysis.

**Recommendation**

Reducing the configured slots to match peak workload requirements redistributes underutilized memory to active slots. Consider reducing the configured slot count for queues where the slots have never been fully used. To identify these queues, you can compare the peak hourly slot requirements for each queue by running the following SQL command as a superuser.

```
WITH
 generate_dt_series AS (select sysdate - (n * interval '5 second') as dt from (select row_number() over () as n from stl_scan limit 17280)),
 apex AS (
     SELECT iq.dt, iq.service_class, iq.num_query_tasks, count(iq.slot_count) as service_class_queries, sum(iq.slot_count) as service_class_slots
     FROM
         (select gds.dt, wq.service_class, wscc.num_query_tasks, wq.slot_count
         FROM stl_wlm_query wq
         JOIN stv_wlm_service_class_config wscc ON (wscc.service_class = wq.service_class AND wscc.service_class > 5)
         JOIN generate_dt_series gds ON (wq.service_class_start_time <= gds.dt AND wq.service_class_end_time > gds.dt)
         WHERE wq.userid > 1 AND wq.service_class > 5) iq
     GROUP BY iq.dt, iq.service_class, iq.num_query_tasks),
     maxes as (SELECT apex.service_class, trunc(apex.dt) as d, date_part(h,apex.dt) as dt_h, max(service_class_slots) max_service_class_slots
                     from apex group by apex.service_class, apex.dt, date_part(h,apex.dt))
SELECT apex.service_class - 5 AS queue, apex.service_class, apex.num_query_tasks AS max_wlm_concurrency, maxes.d AS day, maxes.dt_h || ':00 - ' || maxes.dt_h || ':59' as hour, MAX(apex.service_class_slots) as max_service_class_slots
FROM apex
JOIN maxes ON (apex.service_class = maxes.service_class AND apex.service_class_slots = maxes.max_service_class_slots)
GROUP BY  apex.service_class, apex.num_query_tasks, maxes.d, maxes.dt_h
ORDER BY apex.service_class, maxes.d, maxes.dt_h;
```

The `max_service_class_slots` column represents the maximum number of WLM query slots in the query queue for that hour. If underutilized queues exist, implement the slot reduction optimization by [modifying a parameter group](https://docs.aws.amazon.com/redshift/latest/mgmt/managing-parameter-groups-console.html#parameter-group-modify), as described in the *Amazon Redshift Management Guide.*

**Implementation tips**
+ If your workload is highly variable in volume, make sure that the analysis captured a peak utilization period. If it didn't, run the preceding SQL repeatedly to monitor peak concurrency requirements. 
+ For more details on interpreting the query results from the preceding SQL code, see the [wlm\$1apex\$1hourly.sql script](https://github.com/awslabs/amazon-redshift-utils/blob/master/src/AdminScripts/wlm_apex_hourly.sql) on GitHub.

## Skip compression analysis during COPY
<a name="skip-compression-analysis-recommendation"></a>

When you load data into an empty table with compression encoding declared with the COPY command, Amazon Redshift applies storage compression. This optimization ensures that data in your cluster is stored efficiently even when loaded by end users. The analysis required to apply compression can require significant time.

**Analysis**

The Advisor analysis checks for COPY operations that were delayed by automatic compression analysis. The analysis determines the compression encodings by sampling the data while it's being loaded. This sampling is similar to that performed by the [ANALYZE COMPRESSION](r_ANALYZE_COMPRESSION.md) command. 

When you load data as part of a structured process, such as in an overnight extract, transform, load (ETL) batch, you can define the compression beforehand. You can also optimize your table definitions to skip this phase permanently without any negative impacts.

**Recommendation**

To improve COPY responsiveness by skipping the compression analysis phase, implement either of the following two options:
+ Use the column `ENCODE` parameter when creating any tables that you load using the COPY command.
+ Turn off compression altogether by supplying the `COMPUPDATE OFF` parameter in the COPY command.

The best solution is generally to use column encoding during table creation, because this approach also maintains the benefit of storing compressed data on disk. You can use the ANALYZE COMPRESSION command to suggest compression encodings, but you must recreate the table to apply these encodings. To automate this process, you can use the AWS[ColumnEncodingUtility](https://github.com/awslabs/amazon-redshift-utils/tree/master/src/ColumnEncodingUtility), found on GitHub. 

To identify recent COPY operations that triggered automatic compression analysis, run the following SQL command.

```
  WITH xids AS (
    SELECT xid FROM stl_query WHERE userid>1 AND aborted=0 
    AND querytxt = 'analyze compression phase 1' GROUP BY xid
    INTERSECT SELECT xid FROM stl_commit_stats WHERE node=-1)
SELECT a.userid, a.query, a.xid, a.starttime, b.complyze_sec, 
    a.copy_sec, a.copy_sql
FROM (SELECT q.userid, q.query, q.xid, date_trunc('s',q.starttime) 
    starttime, substring(querytxt,1,100) as copy_sql, 
    ROUND(datediff(ms,starttime,endtime)::numeric / 1000.0, 2) copy_sec
    FROM stl_query q JOIN xids USING (xid)
    WHERE (querytxt ilike 'copy %from%' OR querytxt ilike '% copy %from%') 
    AND querytxt not like 'COPY ANALYZE %') a
LEFT JOIN (SELECT xid, 
    ROUND(sum(datediff(ms,starttime,endtime))::numeric / 1000.0,2) complyze_sec 
    FROM stl_query q JOIN xids USING (xid)
    WHERE (querytxt like 'COPY ANALYZE %' 
    OR querytxt like 'analyze compression phase %') 
    GROUP BY xid ) b ON a.xid = b.xid
WHERE b.complyze_sec IS NOT NULL ORDER BY a.copy_sql, a.starttime;
```

**Implementation tips**
+ Ensure that all tables of significant size created during your ETL processes (for example, staging tables and temporary tables) declare a compression encoding for all columns except the first sort key.
+ Estimate the expected lifetime size of the table being loaded for each of the COPY commands identified by the SQL command preceding. If you are confident that the table will remain extremely small, turn off compression altogether with the `COMPUPDATE OFF` parameter. Otherwise, create the table with explicit compression before loading it with the COPY command.

## Split Amazon S3 objects loaded by COPY
<a name="split-s3-objects-recommendation"></a>

The COPY command takes advantage of the massively parallel processing (MPP) architecture in Amazon Redshift to read and load data from files on Amazon S3. The COPY command loads the data in parallel from multiple files, dividing the workload among the nodes in your cluster. To achieve optimal throughput, we strongly recommend that you divide your data into multiple files to take advantage of parallel processing. 

**Analysis**

The Advisor analysis identifies COPY commands that load large datasets contained in a small number of files staged in Amazon S3. Long-running COPY commands that load large datasets from a few files often have an opportunity for considerable performance improvement. When Advisor identifies that these COPY commands are taking a significant amount of time, it creates a recommendation to increase parallelism by splitting the data into additional files in Amazon S3. 

**Recommendation**

In this case, we recommend the following actions, listed in priority order:

1. Optimize COPY commands that load fewer files than the number of cluster nodes.

1. Optimize COPY commands that load fewer files than the number of cluster slices. 

1. Optimize COPY commands where the number of files is not a multiple of the number of cluster slices.

Certain COPY commands load a significant amount of data or run for a significant duration. For these commands, we recommend that you load a number of data objects from Amazon S3 that is equivalent to a multiple of the number of slices in the cluster. To identify how many S3 objects each COPY command has loaded, run the following SQL code as a superuser. 

```
SELECT
    query, COUNT(*) num_files,
    ROUND(MAX(wq.total_exec_time/1000000.0),2) execution_secs,
    ROUND(SUM(transfer_size)/(1024.0*1024.0),2) total_mb,
    SUBSTRING(querytxt,1,60) copy_sql
FROM stl_s3client s
JOIN stl_query q USING (query)
JOIN stl_wlm_query wq USING (query)
WHERE s.userid>1 AND http_method = 'GET'
    AND POSITION('COPY ANALYZE' IN querytxt) = 0
    AND aborted = 0 AND final_state='Completed'
GROUP BY query, querytxt
HAVING (SUM(transfer_size)/(1024*1024))/COUNT(*) >= 2
ORDER BY CASE
WHEN COUNT(*) < (SELECT max(node)+1 FROM stv_slices) THEN 1
WHEN COUNT(*) < (SELECT COUNT(*) FROM stv_slices WHERE node=0) THEN 2
ELSE 2+((COUNT(*) % (SELECT COUNT(*) FROM stv_slices))/(SELECT COUNT(*)::DECIMAL FROM stv_slices))
END, (SUM(transfer_size)/(1024.0*1024.0))/COUNT(*) DESC;
```

**Implementation tips**
+ The number of slices in a node depends on the node size of the cluster. For more information about the number of slices in the various node types, see [Clusters and Nodes in Amazon Redshift](https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-about-clusters-and-nodes) in the *Amazon Redshift Management Guide.* 
+ You can load multiple files by specifying a common prefix, or prefix key, for the set, or by explicitly listing the files in a manifest file. For more information about loading files, see [Loading data from compressed and uncompressed files](t_splitting-data-files.md).
+ Amazon Redshift doesn't take file size into account when dividing the workload. Split your load data files so that the files are about equal size, between 1 MB and 1 GB after compression. 

## Update table statistics
<a name="update-table-statistics-recommendation"></a>

Amazon Redshift uses a cost-based query optimizer to choose the optimum execution plan for queries. The cost estimates are based on table statistics gathered using the ANALYZE command. When statistics are out of date or missing, the database might choose a less efficient plan for query execution, especially for complex queries. Maintaining current statistics helps complex queries run in the shortest possible time. 

**Analysis**

The Advisor analysis tracks tables whose statistics are out-of-date or missing. It reviews table access metadata associated with complex queries. If tables that are frequently accessed with complex patterns are missing statistics, Advisor creates a **critical** recommendation to run ANALYZE. If tables that are frequently accessed with complex patterns have out-of-date statistics, Advisor creates a **suggested** recommendation to run ANALYZE.

**Recommendation**

Whenever table content changes significantly, update statistics with ANALYZE. We recommend running ANALYZE whenever a significant number of new data rows are loaded into an existing table with COPY or INSERT commands. We also recommend running ANALYZE whenever a significant number of rows are modified using UPDATE or DELETE commands. To identify tables with missing or out-of-date statistics, run the following SQL command as a superuser. The results are ordered from largest to smallest table. 

To identify tables with missing or out-of-date statistics, run the following SQL command as a superuser. The results are ordered from largest to smallest table.

```
SELECT
   ti.schema||'.'||ti."table" tablename,
   ti.size table_size_mb,
   ti.stats_off statistics_accuracy
 FROM svv_table_info ti
 WHERE ti.stats_off > 5.00
 ORDER BY ti.size DESC;
```

**Implementation tips**

The default ANALYZE threshold is 10 percent. This default means that the ANALYZE command skips a given table if fewer than 10 percent of the table's rows have changed since the last ANALYZE. As a result, you might choose to issue ANALYZE commands at the end of each ETL process. Taking this approach means that ANALYZE is often skipped but also ensures that ANALYZE runs when needed.

ANALYZE statistics have the most impact for columns that are used in joins (for example, `JOIN tbl_a ON col_b`) or as predicates (for example, `WHERE col_b = 'xyz'`). By default, ANALYZE collects statistics for all columns in the table specified. If needed, you can reduce the time required to run ANALYZE by running ANALYZE only for the columns where it has the most impact. You can run the following SQL command to identify columns used as predicates. You can also let Amazon Redshift choose which columns to analyze by specifying `ANALYZE PREDICATE COLUMNS`. 

```
WITH predicate_column_info as (
SELECT ns.nspname AS schema_name, c.relname AS table_name, a.attnum as col_num,  a.attname as col_name,
        CASE
            WHEN 10002 = s.stakind1 THEN array_to_string(stavalues1, '||') 
            WHEN 10002 = s.stakind2 THEN array_to_string(stavalues2, '||')
            WHEN 10002 = s.stakind3 THEN array_to_string(stavalues3, '||')
            WHEN 10002 = s.stakind4 THEN array_to_string(stavalues4, '||')
            ELSE NULL::varchar
        END AS pred_ts
   FROM pg_statistic s
   JOIN pg_class c ON c.oid = s.starelid
   JOIN pg_namespace ns ON c.relnamespace = ns.oid
   JOIN pg_attribute a ON c.oid = a.attrelid AND a.attnum = s.staattnum)
SELECT schema_name, table_name, col_num, col_name,
       pred_ts NOT LIKE '2000-01-01%' AS is_predicate,
       CASE WHEN pred_ts NOT LIKE '2000-01-01%' THEN (split_part(pred_ts, '||',1))::timestamp ELSE NULL::timestamp END as first_predicate_use,
       CASE WHEN pred_ts NOT LIKE '%||2000-01-01%' THEN (split_part(pred_ts, '||',2))::timestamp ELSE NULL::timestamp END as last_analyze
FROM predicate_column_info;
```

For more information, see [Analyzing tables](t_Analyzing_tables.md).

## Enable short query acceleration
<a name="enable-sqa-recommendation"></a>

Short query acceleration (SQA) prioritizes selected short-running queries ahead of longer-running queries. SQA runs short-running queries in a dedicated space, so that SQA queries aren't forced to wait in queues behind longer queries. SQA only prioritizes queries that are short-running and are in a user-defined queue. With SQA, short-running queries begin running more quickly and users see results sooner. 

If you turn on SQA, you can reduce or eliminate workload management (WLM) queues that are dedicated to running short queries. In addition, long-running queries don't need to contend with short queries for slots in a queue, so you can configure your WLM queues to use fewer query slots. When you use lower concurrency, query throughput is increased and overall system performance is improved for most workloads. For more information, see [Short query acceleration](wlm-short-query-acceleration.md). 

**Analysis**

Advisor checks for workload patterns and reports the number of recent queries where SQA would reduce latency and the daily queue time for SQA-eligible queries.

**Recommendation**

Modify the WLM configuration to turn on SQA. Amazon Redshift uses a machine learning algorithm to analyze each eligible query. Predictions improve as SQA learns from your query patterns. For more information, see [Configuring Workload Management](https://docs.aws.amazon.com/redshift/latest/mgmt/workload-mgmt-config.html). 

When you turn on SQA, WLM sets the maximum runtime for short queries to dynamic by default. We recommend keeping the dynamic setting for SQA maximum runtime. 

**Implementation tips**

To check whether SQA is turned on, run the following query. If the query returns a row, then SQA is turned on.

```
select * from stv_wlm_service_class_config 
where service_class = 14;
```

For more information, see [Monitoring SQA](wlm-short-query-acceleration.md#wlm-monitoring-sqa). 

## Alter distribution keys on tables
<a name="alter-diststyle-distkey-recommendation"></a>

Amazon Redshift distributes table rows throughout the cluster according to the table distribution style. Tables with KEY distribution require a column as the distribution key (DISTKEY). A table row is assigned to a node slice of a cluster based on its DISTKEY column value. 

An appropriate DISTKEY places a similar number of rows on each node slice and is frequently referenced in join conditions. An optimized join occurs when tables are joined on their DISTKEY columns, accelerating query performance.

**Analysis**

Advisor analyzes your cluster’s workload to identify the most appropriate distribution key for the tables that can significantly benefit from a KEY distribution style. 

**Recommendation**

Advisor provides [ALTER TABLE](r_ALTER_TABLE.md) statements that alter the DISTSTYLE and DISTKEY of a table based on its analysis. To realize a significant performance benefit, make sure to implement all SQL statements within a recommendation group. 

Redistributing a large table with ALTER TABLE consumes cluster resources and requires temporary table locks at various times. Implement each recommendation group when other cluster workload is light. For more details on optimizing table distribution properties, see the [Amazon Redshift Engineering's Advanced Table Design Playbook: Distribution Styles and Distribution Keys](https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-distribution-styles-and-distribution-keys/). 

For more information about ALTER DISTSYLE and DISTKEY, see [ALTER TABLE](r_ALTER_TABLE.md). 

**Note**  
If you don't see a recommendation that doesn't necessarily mean that the current distribution styles are the most appropriate. Advisor doesn't provide recommendations when there isn't enough data or the expected benefit of redistribution is small.   
Advisor recommendations apply to a particular table and don't necessarily apply to a table that contains a column with the same name. Tables that share a column name can have different characteristics for those columns unless data inside the tables is the same.   
If you see recommendations for staging tables that are created or dropped by ETL jobs, modify your ETL processes to use the Advisor recommended distribution keys. 

## Alter sort keys on tables
<a name="alter-sortkey-recommendation"></a>

Amazon Redshift sorts table rows according to the table [sort key](t_Sorting_data.md). The sorting of table rows is based on the sort key column values. 

Sorting a table on an appropriate sort key can accelerate performance of queries, especially those with range-restricted predicates, by requiring fewer table blocks to be read from disk. 

**Analysis**

Advisor analyzes your cluster's workload over several days to identify a beneficial sort key for your tables. 

**Recommendation**

 Advisor provides two groups of ALTER TABLE statements that alter the sort key of a table based on its analysis: 
+ Statements that alter a table that currently doesn't have a sort key to add a COMPOUND sort key.
+ Statements that alter a sort key from INTERLEAVED to COMPOUND or no sort key.

  Using compound sort keys significantly reduces maintenance overhead. Tables with compound sort keys don't need the expensive VACUUM REINDEX operations that are necessary for interleaved sorts. In practice, compound sort keys are more effective than interleaved sort keys for the vast majority of Amazon Redshift workloads. However, if a table is small, it's more efficient not to have a sort key to avoid sort key storage overhead.

When sorting a large table with the ALTER TABLE, cluster resources are consumed and table locks are required at various times. Implement each recommendation when a cluster's workload is moderate. More details on optimizing table sort key configurations can be found in the [Amazon Redshift Engineering's Advanced Table Design Playbook: Compound and Interleaved Sort Keys](https://aws.amazon.com/blogs/big-data/amazon-redshift-engineerings-advanced-table-design-playbook-compound-and-interleaved-sort-keys/). 

For more information about ALTER SORTKEY, see [ALTER TABLE](r_ALTER_TABLE.md). 

**Note**  
If you don't see a recommendation for a table, that doesn't necessarily mean that the current configuration is the best. Advisor doesn't provide recommendations when there isn't enough data or the expected benefit of sorting is small.   
Advisor recommendations apply to a particular table and don’t necessarily apply to a table that contains a column with the same name and data type. Tables that share column names can have different recommendations based on the data in the tables and the workload. 

## Alter compression encodings on columns
<a name="alter-compression-encoding-recommendation"></a>

Compression is a column-level operation that reduces the size of data when it's stored. Compression is used in Amazon Redshift to conserve storage space and improve query performance by reducing the amount of disk I/O. We recommend an optimal compression encoding for each column based on its data type and on query patterns. With optimal compression, queries can run more efficiently and the database can take up minimal storage space. 

**Analysis**

Advisor performs analysis of your cluster's workload and database schema continually to identify the optimal compression encoding for each table column.

**Recommendation**

Advisor provides ALTER TABLE statements that change the compression encoding of particular columns, based on its analysis. 

Changing column compression encodings with [ALTER TABLE](r_ALTER_TABLE.md) consumes cluster resources and requires table locks at various times. It's best to implement recommendations when the cluster workload is light. 

For reference, [ALTER TABLE examples](r_ALTER_TABLE_examples_basic.md) shows several statements that change the encoding for a column.

**Note**  
Advisor doesn't provide recommendations when there isn't enough data or the expected benefit of changing the encoding is small.

## Data type recommendations
<a name="data-type-recommendation"></a>

Amazon Redshift has a library of SQL data types for various use cases. These include integer types like `INT` and types to store characters, like `VARCHAR`. Redshift stores types in an optimized way to provide fast access and good query performance. Also, Redshift provides functions for specific types, which you can use to format or perform calculations on query results. 

**Analysis**

Advisor performs analysis of your cluster's workload and database schema continually to identify columns that can benefit significantly from a data type change.

**Recommendation**

 Advisor provides an `ALTER TABLE` statement that adds a new column with the suggested data type. An accompanying `UPDATE` statement copies data from the existing column to the new column. After you create the new column and load the data, change your queries and ingestion scripts to access the new column. Then leverage features and functions specialized to the new data type, found in [SQL functions reference](c_SQL_functions.md). 

 Copying existing data to the new column can take time. We recommend that you implement each advisor recommendation when the cluster’s workload is light. Reference the list of available data types at [Data types](c_Supported_data_types.md). 

 Note that Advisor doesn't provide recommendations when there isn't enough data or the expected benefit of changing the data type is small.