Skip to content
This repository has been archived by the owner on Aug 3, 2022. It is now read-only.

Postgres crash in 9.5.4 once replication is fully caught up #17

Closed
kriswehner opened this issue Nov 7, 2016 · 8 comments · Fixed by #20
Closed

Postgres crash in 9.5.4 once replication is fully caught up #17

kriswehner opened this issue Nov 7, 2016 · 8 comments · Fixed by #20

Comments

@kriswehner
Copy link

One thing I noticed while working on the fix for #4 is that my postgres crashes regularly when (apparently) the replicator is almost caught up with the stream.

In debug mode, the changeserver emits:
DEBU[0580] Received message type CopyData
DEBU[0580] Received change 37106520 for scope
DEBU[0580] Got message type CopyData (100) length 177
DEBU[0580] Received message type CopyData
DEBU[0580] Received change 37126568 for scope
WARN[0580] Error reading from server: EOF
DEBU[0580] Got command 2
WARN[0580] Disconnected from Postgres.
DEBU[0580] Sending message type Terminate (88) length 5
DEBU[0580] Ignoring change 0.0.0 which we already processed
DEBU[0580] Closing TCP connection

The postgres logs look like this:
db_1 | 2016-11-04 23:50:29 UTC [561-1] postgres@seatme LOG: Logical decoding output in protobuf format
db_1 | 2016-11-04 23:50:29 UTC [561-2] postgres@seatme CONTEXT: slot "replication1", output plugin "transicator_output", in the startup callback
db_1 | 2016-11-04 23:50:29 UTC [561-3] postgres@seatme LOG: starting logical decoding for slot "replication1"
db_1 | 2016-11-04 23:50:29 UTC [561-4] postgres@seatme DETAIL: streaming transactions committing after 0/231DF98, reading WAL from 0/22A51C0
db_1 | 2016-11-04 23:50:29 UTC [561-5] postgres@seatme LOG: logical decoding found consistent point at 0/22A51C0
db_1 | 2016-11-04 23:50:29 UTC [561-6] postgres@seatme DETAIL: Logical decoding will begin using saved snapshot.
db_1 | 2016-11-04 23:50:30 UTC [1-330] LOG: server process (PID 561) was terminated by signal 11: Segmentation fault
db_1 | 2016-11-04 23:50:30 UTC [1-331] LOG: terminating any other active server processes
db_1 | 2016-11-04 23:50:30 UTC [1-332] LOG: all server processes terminated; reinitializing
db_1 | 2016-11-04 23:50:30 UTC [562-1] LOG: database system was interrupted; last known up at 2016-11-04 23:50:24 UTC
db_1 | 2016-11-04 23:50:30 UTC [562-2] LOG: database system was not properly shut down; automatic recovery in progress

The behavior of this is the same with & without my patch to #4. I think the answer is going to need to be some work with valgrind on the postgres server itself to find out where the memory error is.

@gbrail
Copy link
Contributor

gbrail commented Nov 7, 2016

What platform and PG version are you running? It might be high time to build some stress tests.

@kriswehner
Copy link
Author

It's trusty with pg 9.5.4.

@kriswehner
Copy link
Author

Also: it doesn't take stress to get it to crash .. it usually crashes pretty much immediately. I'll see if I can get a slick repro case together.

@kriswehner
Copy link
Author

This appears to be, from my local testing, reproducible once I do an update to a table that doesn't have _apid_scope on it. Does that make any sense? I'm having trouble making postgres dump a core file.

@kriswehner
Copy link
Author

I did get a stacktrace:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 9.5/main: wal sender process postgres 192.168.99.1(54979) idle '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 protobuf_c_message_get_packed_size (message=0x2d383265392d3165) at protobuf-c/protobuf-c.c:643
643 ASSERT_IS_MESSAGE(message);
(gdb) bt
#0 protobuf_c_message_get_packed_size (message=0x2d383265392d3165) at protobuf-c/protobuf-c.c:643
#1 0x00007f860872da92 in repeated_field_get_packed_size (member=, count=86,
field=0x7f8608b3c9f8 <common.change_pb.field_descriptors+504>) at protobuf-c/protobuf-c.c:605
#2 protobuf_c_message_get_packed_size (message=message@entry=0x7ffe545c82c0) at protobuf-c/protobuf-c.c:660
#3 0x00007f8608939051 in common__change_pb__get_packed_size (message=message@entry=0x7ffe545c82c0) at transicator.pb-c.c:106
#4 0x00007f860893896c in transicatorOutputChangeProto (ctx=0x5566c36c9720, txn=, relation=,
change=0x5566c36da710, state=) at output_proto.c:235
#5 0x00007f8608937c05 in outputChange (ctx=, txn=, relation=,
change=) at transicator_output.c:111
#6 0x00005566c17c2d9a in change_cb_wrapper (cache=, txn=, relation=,
change=)
at /build/postgresql-9.5-lMB55R/postgresql-9.5-9.5.5/build/../src/backend/replication/logical/logical.c:716
#7 0x00005566c17c7a1f in ReorderBufferCommit (rb=0x5566c36da4f0, xid=xid@entry=961, commit_lsn=commit_lsn@entry=37954720,
end_lsn=end_lsn@entry=37954768, commit_time=commit_time@entry=531956369630037, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)

@kriswehner
Copy link
Author

I managed to get a repro case that involved our schema and test data, hooked up valgrind to postgres, did a bunch of code spelunking with @chuckg and have a fix.

I would like to open a pull request, but I don't see to see my xid changes on master of the main project, so I'm a little confused as to the state of the code.

The diff is here:
master...seatme:fix/off-by-one-crash#diff-80a5322076a3f16e1ab52f2b4c655a28L90

Basically there's an off by 1 error in the tuple_to_proto function that's triggered when the expression

if ((attr->attisdropped) || (attr->attnum < 0)) {
        continue;
}

is true. After that trips, you index into the array again using [natt] which now is counting off the end of the allocation.

I also moved the OutputPluginWrite() call to be outside the MemoryContextSwitch() to make it consistent with the other output plugins we could find. I'm not sure if it matters, but seemed good for safety and consistency sake.

@kriswehner
Copy link
Author

(i now see what's going on with master.. I'll get a clean diff & pull together tomorrow)

@gbrail
Copy link
Contributor

gbrail commented Nov 9, 2016

Thank you for following up on this!

On Tuesday, November 8, 2016, Kris Wehner [email protected] wrote:

(i now see what's going on with master.. I'll get a clean diff & pull
together tomorrow)


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#17 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAf0aw9fn6b3qS85jgPXMO43qXqA__Hrks5q8WvwgaJpZM4KriqD
.

Greg Brail | apigee https://apigee.com/ | twitter @gbrail
http://twitter.com/gbrail @apigee https://twitter.com/apigee

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants