-
Notifications
You must be signed in to change notification settings - Fork 6
Postgres crash in 9.5.4 once replication is fully caught up #17
Comments
What platform and PG version are you running? It might be high time to build some stress tests. |
It's trusty with pg 9.5.4. |
Also: it doesn't take stress to get it to crash .. it usually crashes pretty much immediately. I'll see if I can get a slick repro case together. |
This appears to be, from my local testing, reproducible once I do an update to a table that doesn't have _apid_scope on it. Does that make any sense? I'm having trouble making postgres dump a core file. |
I did get a stacktrace: |
I managed to get a repro case that involved our schema and test data, hooked up valgrind to postgres, did a bunch of code spelunking with @chuckg and have a fix. I would like to open a pull request, but I don't see to see my xid changes on master of the main project, so I'm a little confused as to the state of the code. The diff is here: Basically there's an off by 1 error in the tuple_to_proto function that's triggered when the expression
is true. After that trips, you index into the array again using [natt] which now is counting off the end of the allocation. I also moved the OutputPluginWrite() call to be outside the MemoryContextSwitch() to make it consistent with the other output plugins we could find. I'm not sure if it matters, but seemed good for safety and consistency sake. |
(i now see what's going on with master.. I'll get a clean diff & pull together tomorrow) |
Thank you for following up on this! On Tuesday, November 8, 2016, Kris Wehner [email protected] wrote:
Greg Brail | apigee https://apigee.com/ | twitter @gbrail |
One thing I noticed while working on the fix for #4 is that my postgres crashes regularly when (apparently) the replicator is almost caught up with the stream.
In debug mode, the changeserver emits:
DEBU[0580] Received message type CopyData
DEBU[0580] Received change 37106520 for scope
DEBU[0580] Got message type CopyData (100) length 177
DEBU[0580] Received message type CopyData
DEBU[0580] Received change 37126568 for scope
WARN[0580] Error reading from server: EOF
DEBU[0580] Got command 2
WARN[0580] Disconnected from Postgres.
DEBU[0580] Sending message type Terminate (88) length 5
DEBU[0580] Ignoring change 0.0.0 which we already processed
DEBU[0580] Closing TCP connection
The postgres logs look like this:
db_1 | 2016-11-04 23:50:29 UTC [561-1] postgres@seatme LOG: Logical decoding output in protobuf format
db_1 | 2016-11-04 23:50:29 UTC [561-2] postgres@seatme CONTEXT: slot "replication1", output plugin "transicator_output", in the startup callback
db_1 | 2016-11-04 23:50:29 UTC [561-3] postgres@seatme LOG: starting logical decoding for slot "replication1"
db_1 | 2016-11-04 23:50:29 UTC [561-4] postgres@seatme DETAIL: streaming transactions committing after 0/231DF98, reading WAL from 0/22A51C0
db_1 | 2016-11-04 23:50:29 UTC [561-5] postgres@seatme LOG: logical decoding found consistent point at 0/22A51C0
db_1 | 2016-11-04 23:50:29 UTC [561-6] postgres@seatme DETAIL: Logical decoding will begin using saved snapshot.
db_1 | 2016-11-04 23:50:30 UTC [1-330] LOG: server process (PID 561) was terminated by signal 11: Segmentation fault
db_1 | 2016-11-04 23:50:30 UTC [1-331] LOG: terminating any other active server processes
db_1 | 2016-11-04 23:50:30 UTC [1-332] LOG: all server processes terminated; reinitializing
db_1 | 2016-11-04 23:50:30 UTC [562-1] LOG: database system was interrupted; last known up at 2016-11-04 23:50:24 UTC
db_1 | 2016-11-04 23:50:30 UTC [562-2] LOG: database system was not properly shut down; automatic recovery in progress
The behavior of this is the same with & without my patch to #4. I think the answer is going to need to be some work with valgrind on the postgres server itself to find out where the memory error is.
The text was updated successfully, but these errors were encountered: