Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replication slot lost after switchover? #25

Open
a-dekker opened this issue Sep 1, 2023 · 7 comments
Open

Replication slot lost after switchover? #25

a-dekker opened this issue Sep 1, 2023 · 7 comments

Comments

@a-dekker
Copy link

a-dekker commented Sep 1, 2023

Situation: a PostgreSQL 13 master/standby setup on-premises using repmgr.
We recently added a logical standby, by the use of AWS Data Migration Server (DMS) to replicate to a cloud instance.
To avoid the replication to fail after a switchover, I installed pg_failover_slots on both the master and standby. And after adding some pg_hba.conf rules, the logical replication_slot also gets visible on the standby node.

primary:

SELECT * FROM pg_replication_slots WHERE slot_type = 'logical';
-[ RECORD 1 ]-------+---------------------------------------------------------------
slot_name           | a2ngextskh5dxnxw_00139051_195c95de_355f_4ea8_8739_4185610c15b4
plugin              | test_decoding
slot_type           | logical
datoid              | 139051
database            | persoon
temporary           | f
active              | t
active_pid          | 59662
xmin                | ¤
catalog_xmin        | 175553299
restart_lsn         | 537/392661A8
confirmed_flush_lsn | 537/392661A8
wal_status          | reserved
safe_wal_size       | ¤

standby:

 SELECT * FROM pg_replication_slots WHERE slot_type = 'logical';
-[ RECORD 1 ]-------+---------------------------------------------------------------
slot_name           | a2ngextskh5dxnxw_00139051_195c95de_355f_4ea8_8739_4185610c15b4
plugin              | test_decoding
slot_type           | logical
datoid              | 139051
database            | persoon
temporary           | f
active              | t
active_pid          | 87075
xmin                | ¤
catalog_xmin        | 175553797
restart_lsn         | 537/3A000028
confirmed_flush_lsn | ¤
wal_status          | reserved
safe_wal_size       | ¤

For some reason I have to stop the replication task in AWS first, else the primary instance will not shutdown (but that is not related to pg_failover_slots) during a switchover. I see the "active" state turn to false on the primary after stopping the replication task, but no change on the standby. After a switchover, the replication slot gets lost on both instances. And the replication task turns in error state after restart.

Any clue why this is not working?

@ashucoek
Copy link
Contributor

An active state of true on standby indicates that the slot has not yet synchronized and is not safe for use. Hence when the failover happened the slot on the new primary was lost which is an expected. Did you see this below kind of error being emitted on standby?

"still waiting for remote slot %s lsn ... and catalog xmin ... to pass local slot lsn ... and catalog xmin ..."

@a-dekker
Copy link
Author

That is correct, I see the message
2023-09-22 09:08:19.039 CEST [12114]: [12-1] db=,user=,app=,client= LOG: still waiting for remote slot ldeemxrxardzhhzw_02969648_1c8551d3_3dc8_4715_b9f4_56508bed95cb lsn (559/3D00CB78) and catalog xmin (176374235) to pass local slot lsn (559/3D01CBC0) and catalog xmin (176374441)

@AikySay
Copy link

AikySay commented Oct 7, 2023

I am confused about this. If I have multiple slots that need to be synchronized, after the first slot fails to be synchronized, will the rest not be processed?

@a-dekker
Copy link
Author

a-dekker commented Oct 9, 2023

An active state of true on standby indicates that the slot has not yet synchronized and is not safe for use.

Any idea of why it does not synchronize or how to debug this?

@a-dekker
Copy link
Author

It looks like the state ends up in a desired 'false' on the standby side after all. Perhaps some commit is needed first, as I was testing in a setup that was isolated without any mutations.

@raman-trantor
Copy link

raman-trantor commented Dec 12, 2023

Hi @ashucoek,
Do you know of any way of forcing the passing of remote slot information or being able to debug any logs corresponding to what is wrong?
I have a CNPG cluster with PGSQL 16 and pg_failover_slots extension running for a few days and I am unable to make the active column on the standby change to false.

All that I can see in the logs is the line for still waiting for remote_slot: "2023-12-12 20:27:42.995 UTC,,,309,,6578b8aa.135,100,,2023-12-12 19:46:50 UTC,2/26,0,LOG,00000,"still waiting for remote slot fivetran_slot lsn (0/44B5620) and catalog xmin (751) to pass local slot lsn (0/800E820) and catalog xmin (755)",,,,,,,,,"pg_failover_slots worker","pg_failover_slots worker",,0"

@nick-ivanov-edb
Copy link

I have a CNPG cluster with PGSQL 16 and pg_failover_slots extension running for a few days and I am unable to make the active column on the standby change to false.

You need to have some activity on the primary so that the restart_lsn of the logical slot moves beyond that of the standby.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants