Loading data into Aurora DSQL
Whether you are migrating from an existing database, importing files from Amazon Simple Storage Service, or loading data from your local system, Aurora DSQL provides multiple approaches for getting your data in. This section covers the recommended tools and techniques for data loads of all sizes, from gigabytes to hundreds of terabytes.
Choosing a loading approach
Aurora DSQL supports standard PostgreSQL data loading commands, but loading data efficiently at scale requires handling parallelization, connection management, and error recovery. The following table summarizes your options:
| Approach | Best for | Considerations |
|---|---|---|
| Aurora DSQL Loader - Open source utility that makes it easy to parallelize inserts when using Aurora DSQL | Most data loading scenarios, especially migrations and bulk imports | Handles parallelization, connection pooling, conflict resolution, and IAM authentication automatically. Available as source code or binary. |
PostgreSQL COPY - Standard PostgreSQL feature |
Simple loads when you are already connected via a compatible client | Requires your client to handle IAM authentication; you manage parallelization yourself |
INSERT transactions - Standard SQL DML |
Small datasets or application-driven inserts | Simplest approach but slowest for bulk data |
For most data loading tasks, use the Aurora DSQL Loader. It handles the operational complexity of loading data into a distributed database, including parallel execution across multiple connections and automatic retry of failed operations.
Aurora DSQL Loader
The Aurora DSQL Loader
Key features
The Aurora DSQL Loader provides the following capabilities:
- Parallel loading
-
Configurable worker threads enable concurrent data loading across multiple connections for improved performance.
- Connection pooling
-
Manages a pool of connections to your Aurora DSQL cluster, handling IAM authentication and connection lifecycle automatically.
- Multiple file format support
-
Supports CSV (comma-separated values), TSV (tab-separated values), and Apache Parquet columnar format. The loader automatically detects the file format based on the source URI extension.
- Automatic schema inference
-
When used with the
--if-not-existsflag, the loader can automatically create tables with appropriate column types based on the data. - Conflict handling
-
When your target table has unique constraints, configure how the loader handles conflicts using the
--on-conflictoption: skip duplicates, upsert records, or return an error. - Fault tolerance
-
Automatic retries and job resumption capabilities ensure that interrupted loads can continue from their stopping point rather than restarting entirely.
- Local and S3 sources
-
Load data from local file system paths or directly from Amazon S3 buckets using S3 URIs.
Prerequisites
Before using the Aurora DSQL Loader, ensure you have the following:
-
An active Aurora DSQL cluster with a valid endpoint.
-
AWS credentials configured through the AWS CLI (aws configure), AWS Single Sign-On (aws sso login), or IAM roles.
-
IAM permissions:
dsql:DbConnectAdminordsql:DbConnecton your Aurora DSQL cluster. -
For S3 sources, appropriate permissions to read from the source bucket.
Installation
Download the latest release from the GitHub releases page
Usage examples
The following examples demonstrate common use cases for the Aurora DSQL Loader.
Example Loading a local CSV file
This example loads a CSV file from your local file system into an existing table:
aurora-dsql-loader load \ --endpointcluster-id.dsql.region.on.aws \ --source-uridata.csv\ --tablemy_table
Example Loading data from Amazon S3
This example loads a Parquet file from an Amazon S3 bucket:
aurora-dsql-loader load \ --endpointcluster-id.dsql.region.on.aws \ --source-uri s3://my-bucket/data.parquet\ --tablemy_table
Example Automatic table creation
This example creates a new table automatically based on the data schema:
aurora-dsql-loader load \ --endpointcluster-id.dsql.region.on.aws \ --source-uridata.csv\ --tablemy_table\ --if-not-exists
Example Validating before loading
This example validates your configuration without actually loading data:
aurora-dsql-loader load \ --endpointcluster-id.dsql.region.on.aws \ --source-uridata.csv\ --tablemy_table\ --dry-run
Example Resuming an interrupted load
If a load operation is interrupted, you can resume it using the job ID from the previous run:
aurora-dsql-loader load \ --endpointcluster-id.dsql.region.on.aws \ --source-uridata.csv\ --tablemy_table\ --resume-job-idjob-id\ --manifest-dir./loader-state
Note
When resuming, the loader skips most already-completed work but may retry some records. If your target table has unique constraints, use the --on-conflict option to handle duplicates—for example, DO NOTHING to skip them or DO UPDATE to upsert.
Command-line options
The Aurora DSQL Loader supports the following command-line options:
--endpoint-
(Required) The Aurora DSQL cluster endpoint. Example:
cluster-id.dsql.region.on.aws --source-uri-
(Required) The path to the data file. Can be a local file path or an S3 URI (for example,
s3://).bucket-name/file.parquet --table-
(Required) The name of the target table in your Aurora DSQL database.
--if-not-exists-
(Optional) Automatically create the target table if it does not exist. The loader infers the schema from the data.
--dry-run-
(Optional) Validate the configuration and data without actually loading it into the database.
--resume-job-id-
(Optional) Resume a previously interrupted load operation using the specified job ID.
--manifest-dir-
(Optional) Directory for storing job state and manifests, used for job resumption.
--on-conflict-
(Optional) Specifies how to handle conflicts when inserting rows that violate unique constraints on the target table. Valid values are
error(return an error),do-nothing(skip duplicate rows), ordo-update(update existing rows with new values).
For a complete list of options and additional configuration parameters, run:
aurora-dsql-loader load --help
Best practices
-
Use dry-run for validation – Always test your load configuration with
--dry-runbefore loading data into production tables. -
Define unique constraints for resumption – If you need to resume interrupted loads, define unique constraints on your target tables and use the
--on-conflictoption to handle already-loaded records. -
Use Parquet for large datasets – Parquet's columnar format typically provides better compression and faster loading for large datasets compared to CSV or TSV.
-
Preserve manifest directories – Keep the manifest directory for load jobs until you confirm the load completed successfully, enabling resumption if needed.
Troubleshooting
- Authentication errors
-
Verify your AWS credentials are configured correctly and that your IAM identity has the required
dsql:DbConnectordsql:DbConnectAdminpermissions on the target cluster. - S3 access errors
-
Ensure your IAM identity has appropriate S3 read permissions for the source bucket and objects.
- Schema inference errors
-
When using
--if-not-exists, ensure your data file has consistent column types. Mixed types in a column may cause schema inference to fail. - Duplicate key errors on resume
-
If you encounter duplicate key errors when resuming a load, add unique constraints to your target table so the loader can use
ON CONFLICT DO NOTHINGto skip already-loaded records.
For additional troubleshooting information, see the Aurora DSQL Loader GitHub repository
Migration pathways
The following sections describe how to migrate data from common source systems into Aurora DSQL.
Migrating from PostgreSQL
To migrate data from an existing PostgreSQL database to Aurora DSQL:
-
Export your data from PostgreSQL to CSV or Parquet format. You can use the PostgreSQL
COPYcommand to export each table:COPYmy_tableTO '/path/to/my_table.csv' WITH (FORMAT csv, HEADER true); -
Create the target table in Aurora DSQL. You can either create the schema manually or use the loader's
--if-not-existsflag to infer the schema from your data. -
Load the exported data using the Aurora DSQL Loader:
aurora-dsql-loader load \ --endpointcluster-id.dsql.region.on.aws \ --source-uri/path/to/my_table.csv\ --tablemy_table
Tip
For large migrations, consider exporting to Parquet format for better compression and faster loading. Tools like DuckDB can convert CSV files to Parquet efficiently.
Migrating from MySQL
To migrate data from MySQL to Aurora DSQL:
-
Export your data from MySQL to CSV format using
SELECT INTO OUTFILEor a tool like mysqldump with the--taboption:SELECT * FROMmy_tableINTO OUTFILE '/path/to/my_table.csv' FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n'; -
Create the target table in Aurora DSQL with appropriate PostgreSQL-compatible data types.
-
Load the exported data using the Aurora DSQL Loader:
aurora-dsql-loader load \ --endpointcluster-id.dsql.region.on.aws \ --source-uri/path/to/my_table.csv\ --tablemy_table
Note
MySQL and PostgreSQL have different data type systems. Review your schema and adjust data types as needed when creating tables in Aurora DSQL.
Loading from Amazon S3
If your data is already in Amazon S3, you can load it directly without downloading to your local system. The Aurora DSQL Loader supports S3 URIs natively:
aurora-dsql-loader load \ --endpointcluster-id.dsql.region.on.aws \ --source-uri s3://my-bucket/path/to/data.parquet\ --tablemy_table
Ensure your IAM identity has s3:GetObject permission on the source objects.
Using PostgreSQL COPY
If you are already connected to Aurora DSQL through a client that handles IAM authentication, you can use the standard PostgreSQL COPY command to load data. This approach works well for simple, single-threaded loads.
Example Loading a CSV file with COPY
COPYmy_tableFROM '/path/to/data.csv' WITH (FORMAT csv, HEADER true);
When using COPY directly, you are responsible for:
-
Managing parallelization if loading multiple files or large datasets
-
Handling connection management and authentication token refresh
-
Implementing retry logic for failed operations
For most use cases, the Aurora DSQL Loader provides a simpler and more robust approach to data loading.
Additional resources
-
Aurora DSQL Loader on GitHub
– Source code, documentation, and issue tracking -
Generating an authentication token in Amazon Aurora DSQL – Learn about IAM authentication tokens for Aurora DSQL
-
Accessing Aurora DSQL with PostgreSQL-compatible clients – Connect to Aurora DSQL using various clients and tools