Measuring replication lag among pgactive members - Amazon Relational Database Service

Measuring replication lag among pgactive members

You can use the following query to view the replication lag among the pgactive members. Run this query on every pgactive node to get the full picture.

app=> SELECT * FROM pgactive.pgactive_get_replication_lag_info(); │-[ RECORD 1 ]--------+--------------------------------------------- │node_name | node2-app │node_sysid | 7481018224801653637 │application_name | pgactive:7481018224801653637:send │slot_name | pgactive_16385_7481018224801653637_0_16385__ │active | t │active_pid | 783486 │pending_wal_decoding | 0 │pending_wal_to_apply | 0 │restart_lsn | 0/2108150 │confirmed_flush_lsn | 0/2154690 │sent_lsn | 0/2154690 │write_lsn | 0/2154690 │flush_lsn | 0/2154690 │replay_lsn | 0/2154690 │-[ RECORD 2 ]--------+--------------------------------------------- │node_name | node1-app │node_sysid | 7481018033434600853 │application_name | pgactive:7481018033434600853:send │slot_name | pgactive_16385_7481018033434600853_0_16385__ │active | t │active_pid | 783488 │pending_wal_decoding | 0 │pending_wal_to_apply | 0 │restart_lsn | 0/20F5AD0 │confirmed_flush_lsn | 0/214EF68 │sent_lsn | 0/214EF68 │write_lsn | 0/214EF68 │flush_lsn | 0/214EF68 │replay_lsn | 0/214EF68

Monitor the following diagnostics at a minimum:

active

Set up alerts when active is false, which indicates that the slot isn't currently in use (the subscriber instance has disconnected from the publisher).

pending_wal_decoding

In PostgreSQL's logical replication, WAL files are stored in binary format. The publisher must decode these WAL changes and convert them into logical changes (such as insert, update, or delete operations).

The metric pending_wal_decoding shows the number of WAL files waiting to be decoded on the publisher side.

This number can increase due to these factors:

  • When the subscriber isn't connected, active status will be false and pending_wal_decoding will increase

  • The slot is active, but the publisher can't keep up with the volume of WAL changes

pending_wal_to_apply

The metric pending_wal_apply indicates the number of WAL files waiting to be applied on the subscriber side.

Several factors can prevent the subscriber from applying changes and potentially cause a disk full scenario:

  • Schema differences - for example, when you have changes in the WAL stream for a table named sample, but that table doesn't exist on the subscriber side

  • Values in the primary key columns were updated

  • Secondary unique indexes can cause data divergence