Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix hearbeatWriter Close being stuck if waiting for a semi-sync ACK #14823

Merged
merged 3 commits into from
Dec 28, 2023

Conversation

GuptaManan100
Copy link
Member

@GuptaManan100 GuptaManan100 commented Dec 19, 2023

Description

This PR fixes the issue pointed out in #14637. On investigating the issue, it was found that if the heartbeat writer is stuck while writing due to the requirement of semi-sync ACKs, it will also block Close() call to itself. This causes a problem since we aren't able to end the primary term, causing the cluster to see 2 primary tablets. VTOrc is unable to repair this failure either, because the problem is in the hearbeat writer of vttablet, and no external RPC can unblock it.

The proposed fix in this PR, is to do what we do for the user queries i.e., when we try to close the heartbeat writer, we try and kill the ongoing write. This ensures that we aren't indefinitely stuck. The code has been written with correctness in mind, and it is still blocking the Close() call from finishing, until we have guaranteed that no write is in progress. This is essential to prevent errant GTIDs from happening.

The heartbeat writer already had a connection pool with all the privileges. Earlier it was being used to run DDL queries in the withDDL struct that we had, but all of that was cleaned up with sidecardb related changes. However, the pool was left even though it wasn't being used. In this PR we have repurposed that pool and use it for kill queries that we need to run.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

…ck being closed if blocked on a semi-sync ack

Signed-off-by: Manan Gupta <[email protected]>
Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logic seems correct to me. A couple thoughts:

  • Use Ticker instead of Sleep so that we can break early when context is Done().
  • Feels like we should have a mechanism within Conn to handle that. Can't we use cancelFunc mechanism?

vitess/go/mysql/conn.go

Lines 212 to 214 in af6a08c

// cancel keep the cancel function for the current executing query.
// this is used by `kill [query|connection] ID` command from other connection.
cancel context.CancelFunc

@GuptaManan100
Copy link
Member Author

I have made the changes to replace time.Sleep() with a ticker as you pointed out.

Feels like we should have a mechanism within Conn to handle that. Can't we use cancelFunc mechanism?

I checked the cancelFunc and its usage, and from what I can tell, it is only being used in vtgates. The connection ID that vtgate advertises to the users don't necessarily match with the connection IDs on the vttablet connections (more often they won't, cause one vtgate query would be running on multiple shards, so they'll all have different connection ids). So when a user issues a kill query to vtgate, they are doing so with the connection id that vtgate advertised. So the way we implemented kill was to store the context and the cancel function for all the vtgate queries that are running, and cancel the context on a kill query causing the underlying StreamExecute/Execute RPC calls to be cancelled. This isn't actually killing anything per say. The streamExecute and other RPCs, are handling that on context cancellations on the vttablet side. But all of this does mean, that this is not the correct way to kill a query in the heartbeat writer. I am not sure if what I have written is the best way for killing the queries, but I think @harshit-gangal would be able to advise us on that, and tell us if there is a better alternate. I don't see any explicit Kill methods or functions to use.

@GuptaManan100 GuptaManan100 merged commit 47de203 into vitessio:main Dec 28, 2023
104 checks passed
@GuptaManan100 GuptaManan100 deleted the fix-heartbeat-writer-stuck branch December 28, 2023 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug Report: vtorc leaves shard with two primaries, both accepting queries
3 participants