Fix transaction isolation issues #393

neekolas · 2024-05-28T06:17:39Z

tl;dr

We were getting an error when trying to perform concurrent identity updates, even when they were on different inbox_ids and with different addresses. The error was:

ERROR: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001)

This adds a test to detect these issues and resolves it by lowering the transaction isolation level.

Fixes

Identify source of API concurrency errors #391

Why was this happening?

I isolated the issue to the GetAllInboxLogs query. When replacing that with a fixed [] and running only CreateIdentity updates, the issue totally goes away. All other behaviour is the same for CreateIdentity since there are no existing inbox logs (it's the first update).

This post explains how you can run into this scenario with serializable transactions and empty queries. The gist is that serializable transactions try to lock data as narrow as your query, but sometimes they can only lock portions of the heap that are larger. I've confirmed in the query planner that sometimes this query leads to a heap scan that touches 4 rows and sometimes it uses an index scan that touches 0 rows.

What's the fix

I changed the way we do locking to use a pg_advisory_xact_lock instead of relying on SELECT ... FOR UPDATE. The advantage of using advisory locks is that it will behave the same whether or not there are existing inbox_log entries already in the DB or not, since it doesn't rely on locking any particular row. Instead it locks the inbox_id itself. All the locking really buys us is a guarantee that if it detects a conflict on the commit the previous transaction will be done when it retries. Otherwise you could retry 3X before the first transaction ever finished.

I also changed the transaction isolation level to repeatable_read. I don't actually think this gives us the protection we need, and this will have to be revisited. Namely that two transactions can both run with the same initial set of identity updates, both are valid updates, and both will get written.

What we could do to get us the proper protections is have a nullable last_sequence_id field on each row with a unique index that covers (inbox_id, last_sequence_id). That would guarantee postgres checks for conflict and fails if two rows both try and update with the same previous state. But it feels a little heavy-handed and I'd like to explore lighter ways of getting the same thing.

Another idea is to have a separate table that just stores (inbox_id, last_sequence_id) and have every update write to that table. That would also get the same result, but without bloating our actual table. Postgres will detect conflicts if two transactions try to write to the same row at REPEATABLE READ.

pkg/mls/store/store.go

richardhuaaa · 2024-05-28T18:34:24Z

Great finds on this investigation, simple solutions for a complex problem. Let me know if I can help anywhere. Have some comments/questions, also learning at the same time as you.

We were getting an error when trying to perform concurrent identity updates

Are we getting this error in spite of the retries? I would have expected the retries to succeed, otherwise we won't be able to recover in the event that we get a true serialization conflict. Would it be worth adding exponential backoff with jitter, to prevent the conflicting retries happening simultaneously?

The gist is that serializable transactions try to lock data as narrow as your query, but sometimes they can only lock portions of the heap that are larger

The second answer in the post you linked states: Since the table is small, both a and b would end up on the same index page (the root page). If the index is empty, the whole index will be locked... If you enter some more rows and use values that go on different index pages, there would be no serialization error. Is it simply the case that with a bigger table size we wouldn't have these conflicts, and relying on retries would do the trick while the table is small?

I also changed the transaction isolation level to repeatable_read. I don't actually think this gives us the protection we need, and this will have to be revisited. Namely that two transactions can both run with the same initial set of identity updates, both are valid updates, and both will get written.

Isn't this prevented by the advisory lock?

neekolas · 2024-05-28T18:39:13Z

Are we getting this error in spite of the retries?

Yes, but increasing the number of retries or decreasing the level of concurrency can get it to pass.

Is it simply the case that with a bigger table size we wouldn't have these conflicts, and relying on retries would do the trick while the table is small?

That's what I thought, so yesterday I wrote 10k entries into the table and tried it. Same issue. Maybe every empty result set lands on the same page, so it'll always be an issue for CreateInbox but subsequent updates will be fine? Idk

Isn't this prevented by the advisory lock?

In this case, Postgres isolation is actually too good. Both transactions will be working with a snapshot of the DB state before either transaction committed. This is true even on the lowest isolation level Postgres offers.

richardhuaaa · 2024-05-28T19:02:42Z

That's what I thought, so yesterday I wrote 10k entries into the table and tried it. Same issue.

Double-checking, that's 10k entries with different inbox_ids, right? At that point I would expect the pg_locks table to switch from recording the relation locktype to the page locktype. Serializable is such a beautiful abstraction that it would be a shame if it falls over just on this simple case.

In this case, Postgres isolation is actually too good. Both transactions will be working with a snapshot of the DB state before either transaction committed. This is true even on the lowest isolation level Postgres offers.

Good point! Would it solve the problem if we acquired the advisory lock before the transaction started, and released it after?

neekolas · 2024-05-28T19:26:45Z

Good point! Would it solve the problem if we acquired the advisory lock before the transaction started, and released it after?

Definitely might. I haven't tried that yet. There's a bit of nuance to acquiring locks outside of a transaction where they don't automatically expire. Think you need to set a timeout in the DB to make sure you can't perma-deadlock

neekolas · 2024-05-28T23:39:36Z

@richardhuaaa in any case, WDYT about merging this PR as-is (which will fix our issues with test parallelism and prevent issues with platform SDK tests).

Then we can tinker with some of the suggestions. It does feel worthwhile to try the lock-before-the-transaction-starts approach and see how it feels.

richardhuaaa

Happy to unblock for now! For clarity in the next stage - the last_sequence_id thing also seems fine. In order of my personal preference would be: 1) Stick to serializable isolation if it works properly with more data, 2) Acquire lock before transaction if it releases properly in edge cases, 3) Use last_sequence_id. But really any of them seems like they will work, it's just a matter of simplicity going forward

neekolas added 4 commits May 27, 2024 22:37

Deal with transaction serialization issues

c970081

Revert schema change

302b194

Use pg_advisory_xact_lock

f3a7d12

Check return value of lock

9bcd0ad

neekolas requested review from richardhuaaa and insipx May 28, 2024 06:21

neekolas marked this pull request as ready for review May 28, 2024 06:21

bwcDvorak added the inbox-id Work to support the creation of Inbox IDs label May 28, 2024

Add random sleep

9e3f588

richardhuaaa reviewed May 28, 2024

View reviewed changes

pkg/mls/store/store.go Show resolved Hide resolved

neekolas added 3 commits May 28, 2024 11:04

Remove unused line

91797e7

Merge branch 'main' into nm/transaction-isolation-issues

d9c83de

Shorten array

dae2e97

richardhuaaa approved these changes May 29, 2024

View reviewed changes

neekolas merged commit b0064d6 into main May 29, 2024
3 checks passed

neekolas deleted the nm/transaction-isolation-issues branch May 29, 2024 20:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix transaction isolation issues #393

Fix transaction isolation issues #393

neekolas commented May 28, 2024 •

edited

Loading

richardhuaaa commented May 28, 2024

neekolas commented May 28, 2024

richardhuaaa commented May 28, 2024

neekolas commented May 28, 2024

neekolas commented May 28, 2024

richardhuaaa left a comment •

edited

Loading

Fix transaction isolation issues #393

Fix transaction isolation issues #393

Conversation

neekolas commented May 28, 2024 • edited Loading

tl;dr

Fixes

Why was this happening?

What's the fix

richardhuaaa commented May 28, 2024

neekolas commented May 28, 2024

richardhuaaa commented May 28, 2024

neekolas commented May 28, 2024

neekolas commented May 28, 2024

richardhuaaa left a comment • edited Loading

Choose a reason for hiding this comment

neekolas commented May 28, 2024 •

edited

Loading

richardhuaaa left a comment •

edited

Loading