View a markdown version of this page

Initial troubleshooting for common PostgreSQL performance issues in RDS for PostgreSQL - Amazon Relational Database Service

Initial troubleshooting for common PostgreSQL performance issues in RDS for PostgreSQL

This guide covers the four most common performance issues that affect RDS for PostgreSQL databases: table and index bloat, parallel query resource exhaustion, high connection and authentication pressure, and autovacuum tuning. Use this guide as a first-pass diagnostic checklist when you experience performance degradation, before you begin deeper investigation.

Each section describes the symptoms you might observe, provides diagnostic queries to confirm the root cause, and recommends specific remediation steps.

Understanding "nothing changed" performance regressions

PostgreSQL workloads often run without issues for weeks or months, then experience sudden performance degradation even though the application code and query patterns appear unchanged. This happens because the database environment is never truly static — several invisible factors shift over time and can trigger plan changes or resource contention:

  • Bloat accumulation is a workload change. PostgreSQL's multiversion concurrency control (MVCC) retains old row versions until autovacuum reclaims them. When dead tuples accumulate faster than autovacuum can process them, tables and indexes grow physically larger. The query planner may then switch from efficient index scans to sequential scans because the cost estimates shift as the table size increases. Your SQL hasn't changed, but the data the planner sees has.

  • New parameter values are a workload change. A parameterized query that performs well for one range of values can perform poorly when the application begins using a different range. PostgreSQL may reuse a generic execution plan that doesn't account for data skew in the new range, or the planner's statistics may not accurately reflect the distribution for those values. When bloat is also present, the impact compounds — a suboptimal plan now scans significantly more dead data.

  • Statistics can be stale even when autovacuum runs. Autovacuum triggers ANALYZE based on the number of rows inserted or updated, not on whether the data distribution has meaningfully changed. If your application shifts to querying a different value range or time window, the planner's cost estimates may be inaccurate even though autovacuum has run recently.

  • Overall database growth is a workload change. As tables grow over time, the volume of data pages that queries must scan increases. Queries that performed well on smaller tables may develop latency as the table size grows, even when the query logic and indexes remain unchanged. Monitor FreeStorageSpace to track storage growth trends.

When you investigate performance regressions where "nothing changed," treat bloat accumulation, new parameter value ranges, overall database growth, and stale statistics as the most likely root causes. Use the diagnostic steps in this guide to confirm which factor applies.

For more information, see the following:

Quick diagnostic checklist

Use the following ordered triage steps when you first investigate a performance issue:

  1. Check pg_stat_activity. Look at connection count, idle-in-transaction sessions, and long-running queries. For more information, see Essential concepts for RDS for PostgreSQL tuning.

  2. Check for bloat. Look for high n_dead_tup in pg_stat_user_tables and consider using pgstattuple for precise measurement. For more information, see Removing bloat from tables with pg_repack.

  3. Check pg_stat_user_tables. Look for high n_dead_tup values and stale last_autovacuum timestamps. For more information, see Working with PostgreSQL autovacuum on Amazon RDS.

  4. Review EXPLAIN ANALYZE on slow queries. Look for parallel plans and sequential scans on large tables. For more information, see Best practices for parallel queries in RDS for PostgreSQL.

  5. Check CloudWatch and Performance Insights metrics. Review CPU utilization, connection count, IOPS, and freeable memory. For more information, see Monitoring Amazon RDS. For common wait events and corrective actions, see RDS for PostgreSQL wait events.

  6. Review your DB parameter group. Check max_parallel_workers_per_gather and autovacuum settings. For more information, see Tuning PostgreSQL parameters in Amazon RDS and Amazon Aurora.

Table and index bloat

Table and index bloat occurs when dead tuples accumulate in your tables faster than autovacuum can reclaim them. Over time, this causes gradual query performance degradation, increased storage usage, and suboptimal query plans.

Symptoms

  • Gradual query performance degradation over weeks or months

  • Storage usage growing despite stable data volume

  • The query planner chooses sequential scans over index scans due to stale statistics

  • High dead_tuple_count in table statistics

Diagnosis

You can estimate bloat across all tables by querying the system catalog. This approach does not require any extensions:

SELECT schemaname, relname, n_dead_tup, n_live_tup, ROUND(n_dead_tup::numeric / GREATEST(n_live_tup, 1) * 100, 2) AS dead_pct, last_autovacuum, last_autoanalyze FROM pg_stat_user_tables WHERE n_dead_tup > 10000 ORDER BY n_dead_tup DESC LIMIT 20;

To address bloat, you can use the pg_repack extension to reorganize tables and indexes with minimal locking. For more information, see Removing bloat from tables with pg_repack and Remove bloat from Amazon Aurora and RDS for PostgreSQL with pg_repack.

Important

Rather than relying on manual maintenance, ensure that autovacuum is enabled and properly tuned for your workload. See Autovacuum tuning for tuning recommendations.

Parallel query resource exhaustion

PostgreSQL can execute queries in parallel to improve performance for large sequential scans and aggregations. However, each parallel worker is a full backend process that counts against max_worker_processes (and the sub-limit max_parallel_workers) and allocates its own work_mem. A single query with 4 parallel workers can consume hundreds of megabytes of memory and significant CPU. Under high concurrency, excessive parallelism can exhaust CPU and memory rapidly.

Common symptoms include sudden CPU spikes, high memory usage per query, and elevated DatabaseConnections in CloudWatch without application changes. You may also observe wait events such as IPC:BgWorkerStartup, IPC:ExecuteGather, and IPC:ParallelFinish. For more information about these wait events, see IPC:parallel wait events.

For most OLTP and high-concurrency production workloads, disable automatic parallelism by setting max_parallel_workers_per_gather = 0 in your DB parameter group. You can then selectively enable parallelism for specific analytics or reporting sessions by setting the parameter per session or per role.

For detailed guidance on diagnosing and controlling parallel query behavior, see Best practices for parallel queries in RDS for PostgreSQL.

High connection and authentication pressure

Connection churn — frequent opening and closing of database connections without pooling — creates authentication overhead and can exhaust available connection slots. Idle connections that remain open also consume slots without performing useful work.

Symptoms

  • Elevated total_auth_attempts in Performance Insights monitoring. For more information, see Non-native counters for RDS for PostgreSQL.

  • Slow connection establishment times

  • FATAL: too many connections for role or remaining connection slots are reserved errors

  • CPU spikes correlated with connection churn

Diagnosis

Run the following query to check your current connection state:

SELECT setting::int AS max_connections, (SELECT count(*) FROM pg_stat_activity) AS current_connections, (SELECT count(*) FROM pg_stat_activity WHERE state = 'idle') AS idle_connections, (SELECT count(*) FROM pg_stat_activity WHERE state = 'idle in transaction') AS idle_in_txn FROM pg_settings WHERE name = 'max_connections';

A high number of idle or idle in transaction connections relative to max_connections indicates that connections are not being released properly. Idle-in-transaction connections are especially problematic because they hold locks and prevent autovacuum from reclaiming dead tuples.

Remediation

  • Deploy connection pooling. Use PgBouncer or Amazon RDS Proxy to reduce the number of direct connections to your database. Connection pooling reuses existing connections rather than creating new ones for each request.

  • Set idle_in_transaction_session_timeout. This parameter automatically terminates sessions that remain idle in a transaction beyond the specified duration. This prevents long-running idle transactions from holding locks and blocking autovacuum.

  • Review application connection handling. Ensure your application closes connections promptly and does not hold transactions open longer than necessary.

Note

Parallel query workers consume CPU and memory. If you observe resource exhaustion alongside parallel query activity, see Parallel query resource exhaustion for guidance on controlling parallel worker usage.

Using Performance Insights wait events for troubleshooting

Performance Insights captures wait events that show where your database is spending time. When you investigate performance issues, wait events help you identify whether the bottleneck is CPU, I/O, locking, network, or inter-process communication. Common wait event categories that appear during the issues described in this guide include:

  • CPU — The session is active on CPU or waiting for CPU. High CPU wait events often correlate with excessive parallelism or inefficient query plans scanning bloated tables.

  • IPC (inter-process communication) — Wait events such as IPC:BgWorkerStartup, IPC:ExecuteGather, and IPC:ParallelFinish indicate parallel query coordination overhead.

  • IO — Wait events such as IO:DataFileRead indicate that queries are reading data from storage because the required pages are not in shared memory. This is common when bloated tables exceed the buffer cache.

  • Lock — Wait events such as Lock:transactionid and Lock:tuple indicate contention between sessions. Idle-in-transaction connections can hold locks that block other queries and autovacuum.

  • Client — Wait events such as Client:ClientRead indicate the database is waiting for the application to send data. High client wait events can indicate connection churn or network latency.

For a complete reference of wait events that commonly indicate performance problems and their recommended corrective actions, see RDS for PostgreSQL wait events.

Autovacuum tuning

Autovacuum is the background process that reclaims dead tuples, prevents table and index bloat, updates planner statistics, and protects against transaction ID wraparound. The default autovacuum settings are conservative and designed for small databases. High-write production workloads almost always require tuning.

When autovacuum cannot keep up with your write workload, bloat accumulates, planner statistics become stale, and the risk of transaction ID wraparound increases. If age(relfrozenxid) approaches 2 billion, the database shuts down to prevent data corruption.

For detailed guidance on tuning autovacuum parameters, monitoring vacuum activity, and configuring per-table overrides, see Working with PostgreSQL autovacuum on Amazon RDS.

Related information