Replication slot lost after switchover? #25

a-dekker · 2023-09-01T15:01:49Z

Situation: a PostgreSQL 13 master/standby setup on-premises using repmgr.
We recently added a logical standby, by the use of AWS Data Migration Server (DMS) to replicate to a cloud instance.
To avoid the replication to fail after a switchover, I installed pg_failover_slots on both the master and standby. And after adding some pg_hba.conf rules, the logical replication_slot also gets visible on the standby node.

primary:

SELECT * FROM pg_replication_slots WHERE slot_type = 'logical';
-[ RECORD 1 ]-------+---------------------------------------------------------------
slot_name           | a2ngextskh5dxnxw_00139051_195c95de_355f_4ea8_8739_4185610c15b4
plugin              | test_decoding
slot_type           | logical
datoid              | 139051
database            | persoon
temporary           | f
active              | t
active_pid          | 59662
xmin                | ¤
catalog_xmin        | 175553299
restart_lsn         | 537/392661A8
confirmed_flush_lsn | 537/392661A8
wal_status          | reserved
safe_wal_size       | ¤

standby:

 SELECT * FROM pg_replication_slots WHERE slot_type = 'logical';
-[ RECORD 1 ]-------+---------------------------------------------------------------
slot_name           | a2ngextskh5dxnxw_00139051_195c95de_355f_4ea8_8739_4185610c15b4
plugin              | test_decoding
slot_type           | logical
datoid              | 139051
database            | persoon
temporary           | f
active              | t
active_pid          | 87075
xmin                | ¤
catalog_xmin        | 175553797
restart_lsn         | 537/3A000028
confirmed_flush_lsn | ¤
wal_status          | reserved
safe_wal_size       | ¤

For some reason I have to stop the replication task in AWS first, else the primary instance will not shutdown (but that is not related to pg_failover_slots) during a switchover. I see the "active" state turn to false on the primary after stopping the replication task, but no change on the standby. After a switchover, the replication slot gets lost on both instances. And the replication task turns in error state after restart.

Any clue why this is not working?

The text was updated successfully, but these errors were encountered:

ashucoek · 2023-09-21T11:23:47Z

An active state of true on standby indicates that the slot has not yet synchronized and is not safe for use. Hence when the failover happened the slot on the new primary was lost which is an expected. Did you see this below kind of error being emitted on standby?

"still waiting for remote slot %s lsn ... and catalog xmin ... to pass local slot lsn ... and catalog xmin ..."

a-dekker · 2023-09-22T07:09:41Z

That is correct, I see the message
2023-09-22 09:08:19.039 CEST [12114]: [12-1] db=,user=,app=,client= LOG: still waiting for remote slot ldeemxrxardzhhzw_02969648_1c8551d3_3dc8_4715_b9f4_56508bed95cb lsn (559/3D00CB78) and catalog xmin (176374235) to pass local slot lsn (559/3D01CBC0) and catalog xmin (176374441)

AikySay · 2023-10-07T12:24:33Z

I am confused about this. If I have multiple slots that need to be synchronized, after the first slot fails to be synchronized, will the rest not be processed?

a-dekker · 2023-10-09T08:37:01Z

An active state of true on standby indicates that the slot has not yet synchronized and is not safe for use.

Any idea of why it does not synchronize or how to debug this?

a-dekker · 2023-10-23T13:55:50Z

It looks like the state ends up in a desired 'false' on the standby side after all. Perhaps some commit is needed first, as I was testing in a setup that was isolated without any mutations.

raman-trantor · 2023-12-12T20:31:46Z

Hi @ashucoek,
Do you know of any way of forcing the passing of remote slot information or being able to debug any logs corresponding to what is wrong?
I have a CNPG cluster with PGSQL 16 and pg_failover_slots extension running for a few days and I am unable to make the active column on the standby change to false.

All that I can see in the logs is the line for still waiting for remote_slot: "2023-12-12 20:27:42.995 UTC,,,309,,6578b8aa.135,100,,2023-12-12 19:46:50 UTC,2/26,0,LOG,00000,"still waiting for remote slot fivetran_slot lsn (0/44B5620) and catalog xmin (751) to pass local slot lsn (0/800E820) and catalog xmin (755)",,,,,,,,,"pg_failover_slots worker","pg_failover_slots worker",,0"

nick-ivanov-edb · 2024-03-26T21:08:26Z

I have a CNPG cluster with PGSQL 16 and pg_failover_slots extension running for a few days and I am unable to make the active column on the standby change to false.

You need to have some activity on the primary so that the restart_lsn of the logical slot moves beyond that of the standby.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replication slot lost after switchover? #25

Replication slot lost after switchover? #25

a-dekker commented Sep 1, 2023

ashucoek commented Sep 21, 2023

a-dekker commented Sep 22, 2023

AikySay commented Oct 7, 2023

a-dekker commented Oct 9, 2023

a-dekker commented Oct 23, 2023

raman-trantor commented Dec 12, 2023 •

edited

Loading

nick-ivanov-edb commented Mar 26, 2024

Replication slot lost after switchover? #25

Replication slot lost after switchover? #25

Comments

a-dekker commented Sep 1, 2023

ashucoek commented Sep 21, 2023

a-dekker commented Sep 22, 2023

AikySay commented Oct 7, 2023

a-dekker commented Oct 9, 2023

a-dekker commented Oct 23, 2023

raman-trantor commented Dec 12, 2023 • edited Loading

nick-ivanov-edb commented Mar 26, 2024

raman-trantor commented Dec 12, 2023 •

edited

Loading