Amazon Redshift will no longer support the creation of new Python UDFs starting Patch 198. Existing Python UDFs will continue to function until June 30, 2026. For more information, see the [ blog post ](https://aws.amazon.com/blogs/big-data/amazon-redshift-python-user-defined-functions-will-reach-end-of-support-after-june-30-2026/). 

# Amazon Redshift best practices for loading data
<a name="c_loading-data-best-practices"></a>

Loading very large datasets can take a long time and consume a lot of computing resources. How your data is loaded can also affect query performance. This section presents best practices for loading data efficiently using COPY commands, bulk inserts, and staging tables. 

**Topics**
+ [Learn how to load data with a tutorial](c_best-practices-loading-take-loading-data-tutorial.md)
+ [Use a COPY command to load data](c_best-practices-use-copy.md)
+ [Use a single COPY command to load from multiple files](c_best-practices-single-copy-command.md)
+ [Loading data files](c_best-practices-use-multiple-files.md)
+ [Compressing your data files](c_best-practices-compress-data-files.md)
+ [Verify data files before and after a load](c_best-practices-verifying-data-files.md)
+ [Use a multi-row insert](c_best-practices-multi-row-inserts.md)
+ [Use a bulk insert](c_best-practices-bulk-inserts.md)
+ [Load data in sort key order](c_best-practices-sort-key-order.md)
+ [Load data in sequential blocks](c_best-practices-load-data-in-sequential-blocks.md)
+ [Use time-series tables](c_best-practices-time-series-tables.md)
+ [Schedule around maintenance windows](c_best-practices-avoid-maintenance.md)

# Learn how to load data with a tutorial
<a name="c_best-practices-loading-take-loading-data-tutorial"></a>

[Tutorial: Loading data from Amazon S3](tutorial-loading-data.md) walks you beginning to end through the steps to upload data to an Amazon S3 bucket and then use the COPY command to load the data into your tables. The tutorial includes help with troubleshooting load errors and compares the performance difference between loading from a single file and loading from multiple files.

# Use a COPY command to load data
<a name="c_best-practices-use-copy"></a>

The COPY command loads data in parallel from Amazon S3, Amazon EMR, Amazon DynamoDB, or multiple data sources on remote hosts. COPY loads large amounts of data much more efficiently than using INSERT statements, and stores the data more effectively as well.

 For more information about using the COPY command, see [Loading data from Amazon S3](t_Loading-data-from-S3.md) and [Loading data from an Amazon DynamoDB table](t_Loading-data-from-dynamodb.md).

# Use a single COPY command to load from multiple files
<a name="c_best-practices-single-copy-command"></a>

Amazon Redshift can automatically load in parallel from multiple compressed data files. You can specify the files to be loaded by using an Amazon S3 object prefix or by using a manifest file.

However, if you use multiple concurrent COPY commands to load one table from multiple files, Amazon Redshift is forced to perform a serialized load. This type of load is much slower and requires a VACUUM process at the end if the table has a sort column defined. For more information about using COPY to load data in parallel, see [Loading data from Amazon S3](t_Loading-data-from-S3.md).

# Loading data files
<a name="c_best-practices-use-multiple-files"></a>

Source-data files come in different formats and use varying compression algorithms. When loading data with the COPY command, Amazon Redshift loads all of the files referenced by the Amazon S3 bucket prefix. (The prefix is a string of characters at the beginning of the object key name.) If the prefix refers to multiple files or files that can be split, Amazon Redshift loads the data in parallel, taking advantage of Amazon Redshift’s MPP architecture. This divides the workload among the nodes in the cluster. In contrast, when you load data from a file that can't be split, Amazon Redshift is forced to perform a serialized load, which is much slower. The following sections describe the recommended way to load different file types into Amazon Redshift, depending on their format and compression. 

## Loading data from files that can be split
<a name="c_best-practices-use-multiple-files-split"></a>

The following files can be automatically split when their data is loaded:
+ an uncompressed CSV file
+ a columnar file (Parquet/ORC)

Amazon Redshift automatically splits files 128MB or larger into chunks. Columnar files, specifically Parquet and ORC, aren't split if they're less than 128MB. Redshift makes use of slices working in parallel to load the data. This provides fast load performance.

## Loading data from files that can't be split
<a name="c_best-practices-use-multiple-files-comma"></a>

 File types such as JSON, or CSV, when compressed with other compression algorithms, such as GZIP, aren't automatically split. For these we recommend manually splitting the data into multiple smaller files that are close in size, from 1 MB to 1 GB after compression. Additionally, make the number of files a multiple of the number of slices in your cluster. For more information about how to split your data into multiple files and examples of loading data using COPY, see [Loading data from Amazon S3](https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html).

# Compressing your data files
<a name="c_best-practices-compress-data-files"></a>

When you want to compress large load files, we recommend that you use gzip, lzop, bzip2, or Zstandard to compress them and split the data into multiple smaller files.

Specify the GZIP, LZOP, BZIP2, or ZSTD option with the COPY command. This example loads the TIME table from a pipe-delimited lzop file.

```
copy time
from 's3://amzn-s3-demo-bucket/data/timerows.lzo' 
iam_role 'arn:aws:iam::0123456789012:role/MyRedshiftRole'
lzop
delimiter '|';
```

There are instances when you don't have to split uncompressed data files. For more information about splitting your data and examples of using COPY to load data, see [Loading data from Amazon S3](t_Loading-data-from-S3.md). 

# Verify data files before and after a load
<a name="c_best-practices-verifying-data-files"></a>

Before you load data from Amazon S3, first verify that your Amazon S3 bucket contains all the correct files, and only those files. For more information, see [Verifying that the correct files are present in your bucket](verifying-that-correct-files-are-present.md). 

After the load operation is complete, query the [STL\$1LOAD\$1COMMITS](r_STL_LOAD_COMMITS.md) system table to verify that the expected files were loaded. For more information, see [Verifying that the data loaded correctly](verifying-that-data-loaded-correctly.md). 

# Use a multi-row insert
<a name="c_best-practices-multi-row-inserts"></a>

If a COPY command is not an option and you require SQL inserts, use a multi-row insert whenever possible. Data compression is inefficient when you add data only one row or a few rows at a time.

Multi-row inserts improve performance by batching up a series of inserts. The following example inserts three rows into a four-column table using a single INSERT statement. This is still a small insert, shown simply to illustrate the syntax of a multi-row insert.

```
insert into category_stage values
(default, default, default, default),
(20, default, 'Country', default),
(21, 'Concerts', 'Rock', default);
```

For more details and examples, see [INSERT](r_INSERT_30.md). 

# Use a bulk insert
<a name="c_best-practices-bulk-inserts"></a>

Use a bulk insert operation with a SELECT clause for high-performance data insertion.

Use the [INSERT](r_INSERT_30.md) and [CREATE TABLE AS](r_CREATE_TABLE_AS.md) commands when you need to move data or a subset of data from one table into another.

For example, the following INSERT statement selects all of the rows from the CATEGORY table and inserts them into the CATEGORY\$1STAGE table.

```
insert into category_stage
(select * from category);
```

The following example creates CATEGORY\$1STAGE as a copy of CATEGORY and inserts all of the rows in CATEGORY into CATEGORY\$1STAGE. 

```
create table category_stage as
select * from category;
```

# Load data in sort key order
<a name="c_best-practices-sort-key-order"></a>

Load your data in sort key order to avoid needing to vacuum.

If each batch of new data follows the existing rows in your table and the COPY operation is not large enough to trigger certain load optimizations, your data is properly stored in sort order, and you don't need to run a vacuum. You don't need to presort the rows in each load because COPY sorts each batch of incoming data as it loads.

For example, suppose that you load data every day based on the current day's activity. If your sort key is a timestamp column and the COPY operation is not large enough to trigger certain load optimizations, your data is stored in sort order. This order occurs because the current day's data is always appended at the end of the previous day's data. For more information, see [Load your data in sort key order](vacuum-managing-vacuum-times.md#vacuum-load-in-sort-key-order). For more information about vacuum operations, see [Vacuuming tables](https://docs.aws.amazon.com/redshift/latest/dg/t_Reclaiming_storage_space202.html).

# Load data in sequential blocks
<a name="c_best-practices-load-data-in-sequential-blocks"></a>

If you need to add a large quantity of data, load the data in sequential blocks according to sort order to eliminate the need to vacuum. 

For example, suppose that you need to load a table with events from January 2017 to December 2017. Assuming each month is in a single file, load the rows for January, then February, and so on. Your table is completely sorted when your load completes, and you don't need to run a vacuum. For more information, see [Use time-series tables](c_best-practices-time-series-tables.md).

When loading very large datasets, the space required to sort might exceed the total available space. By loading data in smaller blocks, you use much less intermediate sort space during each load. In addition, loading smaller blocks make it easier to restart if the COPY fails and is rolled back. 

# Use time-series tables
<a name="c_best-practices-time-series-tables"></a>

If your data has a fixed retention period, you can organize your data as a sequence of time-series tables. In such a sequence, each table is identical but contains data for different time ranges.

You can easily remove old data simply by running a DROP TABLE command on the corresponding tables. This approach is much faster than running a large-scale DELETE process and saves you from having to run a subsequent VACUUM process to reclaim space. To hide the fact that the data is stored in different tables, you can create a UNION ALL view. When you delete old data, refine your UNION ALL view to remove the dropped tables. Similarly, as you load new time periods into new tables, add the new tables to the view. To signal the optimizer to skip the scan on tables that don't match the query filter, your view definition filters for the date range that corresponds to each table.

Avoid having too many tables in the UNION ALL view. Each additional table adds a small processing time to the query. Tables don't need to use the same time frame. For example, you might have tables for differing time periods, such as daily, monthly, and yearly.

If you use time-series tables with a timestamp column for the sort key, you effectively load your data in sort key order. Doing this eliminates the need to vacuum to re-sort the data. For more information, see [Load your data in sort key order](vacuum-managing-vacuum-times.md#vacuum-load-in-sort-key-order).

# Schedule around maintenance windows
<a name="c_best-practices-avoid-maintenance"></a>

If a scheduled maintenance occurs while a query is running, the query is terminated and rolled back and you need to restart it. Schedule long-running operations, such as large data loads or VACUUM operation, to avoid maintenance windows. You can also minimize the risk, and make restarts easier when they are needed, by performing data loads in smaller increments and managing the size of your VACUUM operations. For more information, see [Load data in sequential blocks](c_best-practices-load-data-in-sequential-blocks.md) and [Vacuuming tables](t_Reclaiming_storage_space202.md).