Postgres crash in 9.5.4 once replication is fully caught up #17

kriswehner · 2016-11-07T18:19:26Z

One thing I noticed while working on the fix for #4 is that my postgres crashes regularly when (apparently) the replicator is almost caught up with the stream.

In debug mode, the changeserver emits:
DEBU[0580] Received message type CopyData
DEBU[0580] Received change 37106520 for scope
DEBU[0580] Got message type CopyData (100) length 177
DEBU[0580] Received message type CopyData
DEBU[0580] Received change 37126568 for scope
WARN[0580] Error reading from server: EOF
DEBU[0580] Got command 2
WARN[0580] Disconnected from Postgres.
DEBU[0580] Sending message type Terminate (88) length 5
DEBU[0580] Ignoring change 0.0.0 which we already processed
DEBU[0580] Closing TCP connection

The postgres logs look like this:
db_1 | 2016-11-04 23:50:29 UTC [561-1] postgres@seatme LOG: Logical decoding output in protobuf format
db_1 | 2016-11-04 23:50:29 UTC [561-2] postgres@seatme CONTEXT: slot "replication1", output plugin "transicator_output", in the startup callback
db_1 | 2016-11-04 23:50:29 UTC [561-3] postgres@seatme LOG: starting logical decoding for slot "replication1"
db_1 | 2016-11-04 23:50:29 UTC [561-4] postgres@seatme DETAIL: streaming transactions committing after 0/231DF98, reading WAL from 0/22A51C0
db_1 | 2016-11-04 23:50:29 UTC [561-5] postgres@seatme LOG: logical decoding found consistent point at 0/22A51C0
db_1 | 2016-11-04 23:50:29 UTC [561-6] postgres@seatme DETAIL: Logical decoding will begin using saved snapshot.
db_1 | 2016-11-04 23:50:30 UTC [1-330] LOG: server process (PID 561) was terminated by signal 11: Segmentation fault
db_1 | 2016-11-04 23:50:30 UTC [1-331] LOG: terminating any other active server processes
db_1 | 2016-11-04 23:50:30 UTC [1-332] LOG: all server processes terminated; reinitializing
db_1 | 2016-11-04 23:50:30 UTC [562-1] LOG: database system was interrupted; last known up at 2016-11-04 23:50:24 UTC
db_1 | 2016-11-04 23:50:30 UTC [562-2] LOG: database system was not properly shut down; automatic recovery in progress

The behavior of this is the same with & without my patch to #4. I think the answer is going to need to be some work with valgrind on the postgres server itself to find out where the memory error is.

gbrail · 2016-11-07T18:33:20Z

What platform and PG version are you running? It might be high time to build some stress tests.

kriswehner · 2016-11-07T18:36:28Z

It's trusty with pg 9.5.4.

kriswehner · 2016-11-07T18:59:18Z

Also: it doesn't take stress to get it to crash .. it usually crashes pretty much immediately. I'll see if I can get a slick repro case together.

kriswehner · 2016-11-08T21:42:11Z

This appears to be, from my local testing, reproducible once I do an update to a table that doesn't have _apid_scope on it. Does that make any sense? I'm having trouble making postgres dump a core file.

kriswehner · 2016-11-08T23:27:59Z

I did get a stacktrace:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 9.5/main: wal sender process postgres 192.168.99.1(54979) idle '.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 protobuf_c_message_get_packed_size (message=0x2d383265392d3165) at protobuf-c/protobuf-c.c:643
643 ASSERT_IS_MESSAGE(message);
(gdb) bt
#0 protobuf_c_message_get_packed_size (message=0x2d383265392d3165) at protobuf-c/protobuf-c.c:643
#1 0x00007f860872da92 in repeated_field_get_packed_size (member=, count=86,
field=0x7f8608b3c9f8 <common.change_pb.field_descriptors+504>) at protobuf-c/protobuf-c.c:605
#2 protobuf_c_message_get_packed_size (message=message@entry=0x7ffe545c82c0) at protobuf-c/protobuf-c.c:660
#3 0x00007f8608939051 in common__change_pb__get_packed_size (message=message@entry=0x7ffe545c82c0) at transicator.pb-c.c:106
#4 0x00007f860893896c in transicatorOutputChangeProto (ctx=0x5566c36c9720, txn=, relation=,
change=0x5566c36da710, state=) at output_proto.c:235
#5 0x00007f8608937c05 in outputChange (ctx=, txn=, relation=,
change=) at transicator_output.c:111
#6 0x00005566c17c2d9a in change_cb_wrapper (cache=, txn=, relation=,
change=)
at /build/postgresql-9.5-lMB55R/postgresql-9.5-9.5.5/build/../src/backend/replication/logical/logical.c:716
#7 0x00005566c17c7a1f in ReorderBufferCommit (rb=0x5566c36da4f0, xid=xid@entry=961, commit_lsn=commit_lsn@entry=37954720,
end_lsn=end_lsn@entry=37954768, commit_time=commit_time@entry=531956369630037, origin_id=origin_id@entry=0,
origin_lsn=origin_lsn@entry=0)

kriswehner · 2016-11-09T06:33:34Z

I managed to get a repro case that involved our schema and test data, hooked up valgrind to postgres, did a bunch of code spelunking with @chuckg and have a fix.

I would like to open a pull request, but I don't see to see my xid changes on master of the main project, so I'm a little confused as to the state of the code.

The diff is here:
master...seatme:fix/off-by-one-crash#diff-80a5322076a3f16e1ab52f2b4c655a28L90

Basically there's an off by 1 error in the tuple_to_proto function that's triggered when the expression

if ((attr->attisdropped) || (attr->attnum < 0)) {
        continue;
}

is true. After that trips, you index into the array again using [natt] which now is counting off the end of the allocation.

I also moved the OutputPluginWrite() call to be outside the MemoryContextSwitch() to make it consistent with the other output plugins we could find. I'm not sure if it matters, but seemed good for safety and consistency sake.

kriswehner · 2016-11-09T06:45:04Z

(i now see what's going on with master.. I'll get a clean diff & pull together tomorrow)

gbrail · 2016-11-09T15:21:06Z

Thank you for following up on this!

On Tuesday, November 8, 2016, Kris Wehner [email protected] wrote:

(i now see what's going on with master.. I'll get a clean diff & pull
together tomorrow)

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#17 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AAf0aw9fn6b3qS85jgPXMO43qXqA__Hrks5q8WvwgaJpZM4KriqD
.

Greg Brail | apigee https://apigee.com/ | twitter @gbrail
http://twitter.com/gbrail @apigee https://twitter.com/apigee

kriswehner mentioned this issue Nov 9, 2016

Fix postgres crash due to array indexing issue #20

Merged

gbrail closed this as completed in #20 Nov 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Postgres crash in 9.5.4 once replication is fully caught up #17

Postgres crash in 9.5.4 once replication is fully caught up #17

kriswehner commented Nov 7, 2016

gbrail commented Nov 7, 2016

kriswehner commented Nov 7, 2016

kriswehner commented Nov 7, 2016

kriswehner commented Nov 8, 2016

kriswehner commented Nov 8, 2016

kriswehner commented Nov 9, 2016

kriswehner commented Nov 9, 2016

gbrail commented Nov 9, 2016

Postgres crash in 9.5.4 once replication is fully caught up #17

Postgres crash in 9.5.4 once replication is fully caught up #17

Comments

kriswehner commented Nov 7, 2016

gbrail commented Nov 7, 2016

kriswehner commented Nov 7, 2016

kriswehner commented Nov 7, 2016

kriswehner commented Nov 8, 2016

kriswehner commented Nov 8, 2016

kriswehner commented Nov 9, 2016

kriswehner commented Nov 9, 2016

gbrail commented Nov 9, 2016