-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug Report: Premature buffering stop during concurrent reparenting can lead to query failures #16438
Comments
@deepthi @vmg This wasn't an issue in v17 and earlier with I'm a bit at a loss of how this could be fixed. Buffering starts because the |
This is actually not exclusive to |
I'm trying to reproduce the problem in e2e tests in Vitess, but so far haven't had much success. Here is the test that I have added - #17013. I have tried running this in a loop to see if it fails, but after 237 runs, I haven't found a single failure. @arthurschreiber @deepthi could you take a look and see if there is something I'm missing 🙇 |
Okay, folks I have been working on this for a couple of days now, and here is my progress to report. From @arthurschreiber's comments, I was able to reproduce the problem in #17013. I was then able to find the underlying problem. Basically the order of steps are something like this -
I looked at the old health check implementation of buffering too and here are a couple of noteworthy observations -
With ☝️ information in mind, I have a proposed fix that handles the problem we're seeing. The underlying issue is that when we start buffering, we're not coordinating with the keyspace event watcher, and are relying on a healthcheck to arrive from a PRIMARY tablet stating that it is going non-serving. This is in general working fine for PRS calls (which is why I was unable to reproduce the bug with multiple simultaneous PRS calls), but it is still an issue if the health check from the Primary tablet gets lost when The proposed fix is to mark the shard non-serving in the keyspace event watcher when we start buffering. This is working really well for PRS calls, because this ensures we only unblock the queries after all shards are serving again (keyspace event watcher has seen serving PRIMARY health checks for all shards) This fix is however not working well for the case where the user only sets the primary to read-only by directly changing MySQL. Because the primary tablet doesn't know of this change, it still advertises itself as serving. So, even when we mark the shard non-serving (via the new fix), any subsequent health check update from the primary makes the keyspace event watcher mark the shard serving again. This wasn't a problem for the health check implementation because of the larger primary timestamp check! It was just ignoring these updates! This is where I am at. I can augment the current solution to instead of just marking the shard non-serving, to also see why we're starting buffering and for the read-only case, make it wait for an higher timestamp primary like the old code. But before I make that leap. I actually wanted to revisit, what are the operations we intend to support for buffering wrt to external reparents (the proposed solution already works for PRS). When users are doing external reparents do they mark the primary not-serving by calling DemotePrimary? Is this something we enforce? Or is it acceptable for the primary to advertise itself as serving even when the external reparent has started and it is not read-only? If the users call I have pushed my changes with the fix as a cc - @deepthi @arthurschreiber |
This is not how we used external reparenting. We are using |
@GuptaManan100 Oh, one more thing. We've encountered this issue even when using PRS. Our configuration allows one second grace period for transactions for finishing when doing a PlannedReparentShard. during this time, the primary healthchecks will report that the primary is healthy, but executing queries against the primary outside of a pre-existing transaction will fail and will start to trigger the buffering. I'm not sure your suggested fix will be compatible with this. |
@arthurschreiber Yes, you're right. Even during PRS, the primary tablet can send a serving health check when it is in the grace period. I have reworked the solution a little bit to account for both cases. This will handle the In this new proposed solution, when we receive an error that starts buffering, not only do we mark the shard not-serving in the shard state, we also explicity ask it to wait for the reparent to complete and to see a tablet with a higher primary timestamp. This solution works as intended, but it has a downside that even the I'm gonna write this up in the PR description as well, after adding a few more tests. I think this is a worthwhile trade-off to accept, but I still felt, it should be explicitly pointed out. |
@GuptaManan100 That doesn't sound like a terrible drawback. 😄 If I understand correctly, this will mean that if there's a PRS or external reparent that fails, buffering will continue until the buffer timeout is reached. If the buffer runs full during this time, older queries will be evicted and will potentially be successful as long as the old primary becomes writable again. When the buffer timeout is reached, all queries will be evicted and they will again be potentially successful (depending on the state of the old primary). So, if my assumptions are correct, we'll see a small jump in query latency, but will otherwise not be in a worse position than before. 👍 |
Yes, 100% correct. |
Overview of the Issue
With the
keyspace_events
buffer implementation, we see that sometimes buffering stops before the failover has actually been detected and processed by thehealthcheck
stream, causing buffered queries to be sent to the demoted primary.Here's the log output from one
vtgate
process:I think what's happening here is that primaries of the
20-30
and30-40
shard went into read-only mode due to the external failover at roughly the same time, which in turn caused buffering to start on both these shards in quick succession.Once the primary failover on shard
20-30
was done and Vitess was notified about the new primary via aTabletExternallyReparented
orPlannedReparentShard
call, the whole keyspace was detected as being consistent again - including the30-40
shard which was still in the midst of a failover. This caused the buffering on the20-30
and the30-40
shard to stop, while the30-40
shard was not failed over yet.Queries that performed write operations against the
30-40
shard started noticeably failing, until the failover was finished.Reproduction Steps
N/A
Binary Version
Operating System and Environment details
Log Fragments
The text was updated successfully, but these errors were encountered: