Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix transaction isolation issues #393

Merged
merged 8 commits into from
May 29, 2024
Merged

Conversation

neekolas
Copy link
Collaborator

@neekolas neekolas commented May 28, 2024

tl;dr

We were getting an error when trying to perform concurrent identity updates, even when they were on different inbox_ids and with different addresses. The error was:

ERROR: could not serialize access due to read/write dependencies among transactions (SQLSTATE 40001)

This adds a test to detect these issues and resolves it by lowering the transaction isolation level.

Fixes

Why was this happening?

I isolated the issue to the GetAllInboxLogs query. When replacing that with a fixed [] and running only CreateIdentity updates, the issue totally goes away. All other behaviour is the same for CreateIdentity since there are no existing inbox logs (it's the first update).

This post explains how you can run into this scenario with serializable transactions and empty queries. The gist is that serializable transactions try to lock data as narrow as your query, but sometimes they can only lock portions of the heap that are larger. I've confirmed in the query planner that sometimes this query leads to a heap scan that touches 4 rows and sometimes it uses an index scan that touches 0 rows.

What's the fix

I changed the way we do locking to use a pg_advisory_xact_lock instead of relying on SELECT ... FOR UPDATE. The advantage of using advisory locks is that it will behave the same whether or not there are existing inbox_log entries already in the DB or not, since it doesn't rely on locking any particular row. Instead it locks the inbox_id itself. All the locking really buys us is a guarantee that if it detects a conflict on the commit the previous transaction will be done when it retries. Otherwise you could retry 3X before the first transaction ever finished.

I also changed the transaction isolation level to repeatable_read. I don't actually think this gives us the protection we need, and this will have to be revisited. Namely that two transactions can both run with the same initial set of identity updates, both are valid updates, and both will get written.

What we could do to get us the proper protections is have a nullable last_sequence_id field on each row with a unique index that covers (inbox_id, last_sequence_id). That would guarantee postgres checks for conflict and fails if two rows both try and update with the same previous state. But it feels a little heavy-handed and I'd like to explore lighter ways of getting the same thing.

Another idea is to have a separate table that just stores (inbox_id, last_sequence_id) and have every update write to that table. That would also get the same result, but without bloating our actual table. Postgres will detect conflicts if two transactions try to write to the same row at REPEATABLE READ.

@neekolas neekolas requested review from richardhuaaa and insipx May 28, 2024 06:21
@neekolas neekolas marked this pull request as ready for review May 28, 2024 06:21
@bwcDvorak bwcDvorak added the inbox-id Work to support the creation of Inbox IDs label May 28, 2024
@richardhuaaa
Copy link
Contributor

Great finds on this investigation, simple solutions for a complex problem. Let me know if I can help anywhere. Have some comments/questions, also learning at the same time as you.

We were getting an error when trying to perform concurrent identity updates

Are we getting this error in spite of the retries? I would have expected the retries to succeed, otherwise we won't be able to recover in the event that we get a true serialization conflict. Would it be worth adding exponential backoff with jitter, to prevent the conflicting retries happening simultaneously?

The gist is that serializable transactions try to lock data as narrow as your query, but sometimes they can only lock portions of the heap that are larger

The second answer in the post you linked states: Since the table is small, both a and b would end up on the same index page (the root page). If the index is empty, the whole index will be locked... If you enter some more rows and use values that go on different index pages, there would be no serialization error. Is it simply the case that with a bigger table size we wouldn't have these conflicts, and relying on retries would do the trick while the table is small?

I also changed the transaction isolation level to repeatable_read. I don't actually think this gives us the protection we need, and this will have to be revisited. Namely that two transactions can both run with the same initial set of identity updates, both are valid updates, and both will get written.

Isn't this prevented by the advisory lock?

Copy link
Collaborator Author

Are we getting this error in spite of the retries?

Yes, but increasing the number of retries or decreasing the level of concurrency can get it to pass.

Is it simply the case that with a bigger table size we wouldn't have these conflicts, and relying on retries would do the trick while the table is small?

That's what I thought, so yesterday I wrote 10k entries into the table and tried it. Same issue. Maybe every empty result set lands on the same page, so it'll always be an issue for CreateInbox but subsequent updates will be fine? Idk

Isn't this prevented by the advisory lock?

In this case, Postgres isolation is actually too good. Both transactions will be working with a snapshot of the DB state before either transaction committed. This is true even on the lowest isolation level Postgres offers.

@richardhuaaa
Copy link
Contributor

That's what I thought, so yesterday I wrote 10k entries into the table and tried it. Same issue.

Double-checking, that's 10k entries with different inbox_ids, right? At that point I would expect the pg_locks table to switch from recording the relation locktype to the page locktype. Serializable is such a beautiful abstraction that it would be a shame if it falls over just on this simple case.

In this case, Postgres isolation is actually too good. Both transactions will be working with a snapshot of the DB state before either transaction committed. This is true even on the lowest isolation level Postgres offers.

Good point! Would it solve the problem if we acquired the advisory lock before the transaction started, and released it after?

Copy link
Collaborator Author

Good point! Would it solve the problem if we acquired the advisory lock before the transaction started, and released it after?

Definitely might. I haven't tried that yet. There's a bit of nuance to acquiring locks outside of a transaction where they don't automatically expire. Think you need to set a timeout in the DB to make sure you can't perma-deadlock

@neekolas
Copy link
Collaborator Author

@richardhuaaa in any case, WDYT about merging this PR as-is (which will fix our issues with test parallelism and prevent issues with platform SDK tests).

Then we can tinker with some of the suggestions. It does feel worthwhile to try the lock-before-the-transaction-starts approach and see how it feels.

Copy link
Contributor

@richardhuaaa richardhuaaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to unblock for now! For clarity in the next stage - the last_sequence_id thing also seems fine. In order of my personal preference would be: 1) Stick to serializable isolation if it works properly with more data, 2) Acquire lock before transaction if it releases properly in edge cases, 3) Use last_sequence_id. But really any of them seems like they will work, it's just a matter of simplicity going forward

@neekolas neekolas merged commit b0064d6 into main May 29, 2024
3 checks passed
@neekolas neekolas deleted the nm/transaction-isolation-issues branch May 29, 2024 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inbox-id Work to support the creation of Inbox IDs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants