Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Premature buffering stop during concurrent reparenting can lead to query failures #16438

Closed
arthurschreiber opened this issue Jul 22, 2024 · 9 comments · Fixed by #17013

Comments

@arthurschreiber
Copy link
Contributor

arthurschreiber commented Jul 22, 2024

Overview of the Issue

With the keyspace_events buffer implementation, we see that sometimes buffering stops before the failover has actually been detected and processed by the healthcheck stream, causing buffered queries to be sent to the demoted primary.

Here's the log output from one vtgate process:

{"_time":"2024-07-22T13:23:30.708+00:00","message":"Starting buffering for shard: <redacted>/20-30 (window: 5s, size: 1000, max failover duration: 5s) (A failover was detected by this seen error: vttablet: rpc error: code = Code(17) desc = The MySQL server is running with the --super-read-only option so it cannot execute this statement (errno 1290) (sqlstate HY000) (CallerID: issues_pull_requests_rw_1).)"}
{"_time":"2024-07-22T13:23:33.840+00:00","message":"Starting buffering for shard: <redacted>/30-40 (window: 5s, size: 1000, max failover duration: 5s) (A failover was detected by this seen error: vttablet: rpc error: code = Code(17) desc = The MySQL server is running with the --super-read-only option so it cannot execute this statement (errno 1290) (sqlstate HY000) (CallerID: issues_pull_requests_rw_1).)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"Adding 1 to PrimaryPromoted counter for target: keyspace:\"<redacted>\" shard:\"20-30\" tablet_type:REPLICA, tablet: <redacted>-0171233832, tabletType: PRIMARY"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/80-90 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/c0-d0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/b0-c0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/e0-f0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/-10 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/f0- resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/60-70 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/40-50 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/d0-e0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/50-60 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/90-a0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/10-20 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/a0-b0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/70-80 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/20-30 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"Stopping buffering for shard: <redacted>/20-30 after: 3.1 seconds due to: a primary promotion has been detected. Draining 50 buffered requests now."}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"disruption in shard <redacted>/30-40 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"Stopping buffering for shard: <redacted>/30-40 after: 0.0 seconds due to: a primary promotion has been detected. Draining 1 buffered requests now."}
{"_time":"2024-07-22T13:23:33.856+00:00","message":"Draining finished for shard: <redacted>/30-40 Took: 195.979µs for: 1 requests."}
{"_time":"2024-07-22T13:23:33.960+00:00","message":"FailoverTooRecent-<redacted>/30-40: NOT starting buffering for shard: <redacted>/30-40 because the last failover which triggered buffering is too recent (104.215357ms < 1m0s). (A failover was detected by this seen error: Code: CLUSTER_EVENT"}
{"_time":"2024-07-22T13:23:34.016+00:00","message":"Draining finished for shard: <redacted>/20-30 Took: 159.349351ms for: 50 requests."}
{"_time":"2024-07-22T13:23:34.642+00:00","message":"not marking healthy primary <redacted>-0171231759 as Up for <redacted>/20-30 because its PrimaryTermStartTime is smaller than the highest known timestamp from previous PRIMARYs <redacted>-0171233832: -62135596800 < 1721654613 "}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"Adding 1 to PrimaryPromoted counter for target: keyspace:\"<redacted>\" shard:\"30-40\" tablet_type:REPLICA, tablet: <redacted>-0171233041, tabletType: PRIMARY"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.743+00:00","message":"keyspace event resolved: <redacted>/<redacted> is now consistent (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/50-60 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/60-70 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/40-50 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/d0-e0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/70-80 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/90-a0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/10-20 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/a0-b0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/30-40 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/20-30 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/f0- resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/80-90 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/c0-d0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/b0-c0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/e0-f0 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:36.744+00:00","message":"disruption in shard <redacted>/-10 resolved (serving: true)"}
{"_time":"2024-07-22T13:23:37.282+00:00","message":"not marking healthy primary <redacted>-0171232808 as Up for <redacted>/30-40 because its PrimaryTermStartTime is smaller than the highest known timestamp from previous PRIMARYs <redacted>-0171233041: -62135596800 < 1721654616 "}
{"_time":"2024-07-22T13:23:38.961+00:00","message":"FailoverTooRecent-<redacted>/30-40: skipped 6 log messages"}
{"_time":"2024-07-22T13:23:38.961+00:00","message":"Execute: skipped 3 log messages"}

I think what's happening here is that primaries of the 20-30 and 30-40 shard went into read-only mode due to the external failover at roughly the same time, which in turn caused buffering to start on both these shards in quick succession.

Once the primary failover on shard 20-30 was done and Vitess was notified about the new primary via a TabletExternallyReparented or PlannedReparentShard call, the whole keyspace was detected as being consistent again - including the 30-40 shard which was still in the midst of a failover. This caused the buffering on the 20-30 and the 30-40 shard to stop, while the 30-40 shard was not failed over yet.

Queries that performed write operations against the 30-40 shard started noticeably failing, until the failover was finished.

Reproduction Steps

N/A

Binary Version

v17+

Operating System and Environment details

N/A

Log Fragments

N/A
@arthurschreiber arthurschreiber added Type: Bug Needs Triage This issue needs to be correctly labelled and triaged Component: VTGate and removed Needs Triage This issue needs to be correctly labelled and triaged labels Jul 22, 2024
@arthurschreiber
Copy link
Contributor Author

@deepthi @vmg This wasn't an issue in v17 and earlier with --buffer_implementation=healthcheck - but that implementation was deprecated and removed in v18.

I'm a bit at a loss of how this could be fixed. Buffering starts because the vtgate notices that the vttablet is in read-only mode (but still serving), but keyspace events don't know about this and instead make decisions based solely on the serving state of the primary (which in this case is happily reporting that it's up and healthy even though it's in readonly mode during the external failover).

@arthurschreiber
Copy link
Contributor Author

This is actually not exclusive to TabletExternallyReparented calls. We're seeing the same behaviour when PlannedReparentShard is executed in parallel across multiple shards.

@arthurschreiber arthurschreiber changed the title Bug Report: Premature buffering stop during concurrent external reparenting can lead to query failures Bug Report: Premature buffering stop during concurrent reparenting can lead to query failures Oct 4, 2024
@GuptaManan100
Copy link
Member

I'm trying to reproduce the problem in e2e tests in Vitess, but so far haven't had much success. Here is the test that I have added - #17013. I have tried running this in a loop to see if it fails, but after 237 runs, I haven't found a single failure. @arthurschreiber @deepthi could you take a look and see if there is something I'm missing 🙇

@GuptaManan100
Copy link
Member

GuptaManan100 commented Oct 23, 2024

Okay, folks I have been working on this for a couple of days now, and here is my progress to report. From @arthurschreiber's comments, I was able to reproduce the problem in #17013. I was then able to find the underlying problem. Basically the order of steps are something like this -

  1. A vttablet is marked read-only. Any write that comes to it fails with the read-only error and starts buffering.
  2. Now let's say a PRS happens for a different shard that is successful.
  3. The keyspace even watcher sees that a PRS has started, and stores that this shard is non-serving.
  4. Later when PRS completes, it sees that the PRS has completed. It however doesn't know that the first vttablet from a different shard than the one that just completed PRS is actually still read-only. So, it just stops buffering!
  5. This causes the queries target towards the first vttablet to see failures on the application side.

I looked at the old health check implementation of buffering too and here are a couple of noteworthy observations -

  1. The previous health-check implementation was starting and stopping buffering per shard. i.e. If two PRS's start at the same time, and then the first finishes before the second. The previous implementation would stop buffering for the first shard when its PRS call ends, and would stop buffering for the second shard when the second ones PRS ends. This is contrast to the new keys pace events implementation that only stops buffering for an entire keyspace when it is consistent. This means, that in this example scenario we would stop buffering for both the shards when the second PRS completes.
  2. The previous health-check implementation would only stop buffering on a shard when it saw a new primary with a larger reparent timestamp. This is not the case for the new implementation. It doesn't check for a larger timestamp, but marks a shard serving whenever it receives a health check from a serving primary tablet.

With ☝️ information in mind, I have a proposed fix that handles the problem we're seeing. The underlying issue is that when we start buffering, we're not coordinating with the keyspace event watcher, and are relying on a healthcheck to arrive from a PRIMARY tablet stating that it is going non-serving. This is in general working fine for PRS calls (which is why I was unable to reproduce the bug with multiple simultaneous PRS calls), but it is still an issue if the health check from the Primary tablet gets lost when DemotePrimary is called. In this case, keyspace event watcher still thinks this shard is serving, and if a different shard completes a PRS call, it sees that the entire keyspace is consistent and unblocks all the writes.

The proposed fix is to mark the shard non-serving in the keyspace event watcher when we start buffering. This is working really well for PRS calls, because this ensures we only unblock the queries after all shards are serving again (keyspace event watcher has seen serving PRIMARY health checks for all shards)

This fix is however not working well for the case where the user only sets the primary to read-only by directly changing MySQL. Because the primary tablet doesn't know of this change, it still advertises itself as serving. So, even when we mark the shard non-serving (via the new fix), any subsequent health check update from the primary makes the keyspace event watcher mark the shard serving again. This wasn't a problem for the health check implementation because of the larger primary timestamp check! It was just ignoring these updates!

This is where I am at. I can augment the current solution to instead of just marking the shard non-serving, to also see why we're starting buffering and for the read-only case, make it wait for an higher timestamp primary like the old code. But before I make that leap. I actually wanted to revisit, what are the operations we intend to support for buffering wrt to external reparents (the proposed solution already works for PRS). When users are doing external reparents do they mark the primary not-serving by calling DemotePrimary? Is this something we enforce? Or is it acceptable for the primary to advertise itself as serving even when the external reparent has started and it is not read-only?

If the users call DemotePrimary (or any other way that makes the primary report itself as non-serving while the external reparent is ongoing), then the proposed solution would work. But, if not, we're gonna need to do something more like proposed ☝️. I think it's a good idea to define how external reparents should proceed and what we expect from them. Maybe this already exists, and I don't know off it 😆.

I have pushed my changes with the fix as a wip commit on #17013. There are some debug logs sprinkled in the code too that I was using to see what was going on.

cc - @deepthi @arthurschreiber

@arthurschreiber
Copy link
Contributor Author

arthurschreiber commented Oct 23, 2024

This is where I am at. I can augment the current solution to instead of just marking the shard non-serving, to also see why we're starting buffering and for the read-only case, make it wait for an higher timestamp primary like the old code. But before I make that leap. I actually wanted to revisit, what are the operations we intend to support for buffering wrt to external reparents (the proposed solution already works for PRS). When users are doing external reparents do they mark the primary not-serving by calling DemotePrimary? Is this something we enforce? Or is it acceptable for the primary to advertise itself as serving even when the external reparent has started and it is not read-only?

This is not how we used external reparenting. We are using orchestrator to manage unplanned failovers, and are simply calling TabletExternallyReparented in a post failover hook. I don't think it's feasible to have to also call DemotePrimary before the failover, because then if the failover fails / does not happen, we can't undo that easily.

@arthurschreiber
Copy link
Contributor Author

@GuptaManan100 Oh, one more thing.

We've encountered this issue even when using PRS. Our configuration allows one second grace period for transactions for finishing when doing a PlannedReparentShard. during this time, the primary healthchecks will report that the primary is healthy, but executing queries against the primary outside of a pre-existing transaction will fail and will start to trigger the buffering. I'm not sure your suggested fix will be compatible with this.

@GuptaManan100
Copy link
Member

@arthurschreiber Yes, you're right. Even during PRS, the primary tablet can send a serving health check when it is in the grace period.

I have reworked the solution a little bit to account for both cases. This will handle the PRS and also TabletExternallyReparented. This solution is somewhat similar to what the healthcheck implementation used to do.

In this new proposed solution, when we receive an error that starts buffering, not only do we mark the shard not-serving in the shard state, we also explicity ask it to wait for the reparent to complete and to see a tablet with a higher primary timestamp.

This solution works as intended, but it has a downside that even the healthcheck implementation had. Once the buffering has started, and if the reparent operation fails, the buffering won't be stopped, and we would continue to buffer it until we reach the buffering timeout!

I'm gonna write this up in the PR description as well, after adding a few more tests. I think this is a worthwhile trade-off to accept, but I still felt, it should be explicitly pointed out.

@arthurschreiber
Copy link
Contributor Author

@GuptaManan100 That doesn't sound like a terrible drawback. 😄

If I understand correctly, this will mean that if there's a PRS or external reparent that fails, buffering will continue until the buffer timeout is reached.

If the buffer runs full during this time, older queries will be evicted and will potentially be successful as long as the old primary becomes writable again.

When the buffer timeout is reached, all queries will be evicted and they will again be potentially successful (depending on the state of the old primary).

So, if my assumptions are correct, we'll see a small jump in query latency, but will otherwise not be in a worse position than before. 👍

@GuptaManan100
Copy link
Member

Yes, 100% correct.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants