-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix hearbeatWriter Close being stuck if waiting for a semi-sync ACK #14823
Fix hearbeatWriter Close being stuck if waiting for a semi-sync ACK #14823
Conversation
…ck being closed if blocked on a semi-sync ack Signed-off-by: Manan Gupta <[email protected]>
…as been asked to close Signed-off-by: Manan Gupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic seems correct to me. A couple thoughts:
- Use
Ticker
instead ofSleep
so that we can break early when context isDone()
. - Feels like we should have a mechanism within
Conn
to handle that. Can't we usecancelFunc
mechanism?
Lines 212 to 214 in af6a08c
// cancel keep the cancel function for the current executing query. | |
// this is used by `kill [query|connection] ID` command from other connection. | |
cancel context.CancelFunc |
Signed-off-by: Manan Gupta <[email protected]>
I have made the changes to replace time.Sleep() with a ticker as you pointed out.
I checked the cancelFunc and its usage, and from what I can tell, it is only being used in |
Description
This PR fixes the issue pointed out in #14637. On investigating the issue, it was found that if the heartbeat writer is stuck while writing due to the requirement of semi-sync ACKs, it will also block
Close()
call to itself. This causes a problem since we aren't able to end the primary term, causing the cluster to see 2 primary tablets. VTOrc is unable to repair this failure either, because the problem is in the hearbeat writer of vttablet, and no external RPC can unblock it.The proposed fix in this PR, is to do what we do for the user queries i.e., when we try to close the heartbeat writer, we try and kill the ongoing write. This ensures that we aren't indefinitely stuck. The code has been written with correctness in mind, and it is still blocking the
Close()
call from finishing, until we have guaranteed that no write is in progress. This is essential to prevent errant GTIDs from happening.The heartbeat writer already had a connection pool with all the privileges. Earlier it was being used to run
DDL
queries in thewithDDL
struct that we had, but all of that was cleaned up with sidecardb related changes. However, the pool was left even though it wasn't being used. In this PR we have repurposed that pool and use it forkill
queries that we need to run.Related Issue(s)
Checklist
Deployment Notes