Fix hearbeatWriter Close being stuck if waiting for a semi-sync ACK #14823

GuptaManan100 · 2023-12-19T16:36:39Z

Description

This PR fixes the issue pointed out in #14637. On investigating the issue, it was found that if the heartbeat writer is stuck while writing due to the requirement of semi-sync ACKs, it will also block Close() call to itself. This causes a problem since we aren't able to end the primary term, causing the cluster to see 2 primary tablets. VTOrc is unable to repair this failure either, because the problem is in the hearbeat writer of vttablet, and no external RPC can unblock it.

The proposed fix in this PR, is to do what we do for the user queries i.e., when we try to close the heartbeat writer, we try and kill the ongoing write. This ensures that we aren't indefinitely stuck. The code has been written with correctness in mind, and it is still blocking the Close() call from finishing, until we have guaranteed that no write is in progress. This is essential to prevent errant GTIDs from happening.

The heartbeat writer already had a connection pool with all the privileges. Earlier it was being used to run DDL queries in the withDDL struct that we had, but all of that was cleaned up with sidecardb related changes. However, the pool was left even though it wasn't being used. In this PR we have repurposed that pool and use it for kill queries that we need to run.

Related Issue(s)

Fixes Bug Report: vtorc leaves shard with two primaries, both accepting queries #14637

Checklist

"Backport to:" labels have been added if this change should be back-ported to release branches
If this change is to be back-ported to previous releases, a justification is included in the PR description
Tests were added or are not required
Did the new or modified tests pass consistently locally and on CI?
Documentation was added or is not required

Deployment Notes

…ck being closed if blocked on a semi-sync ack Signed-off-by: Manan Gupta <[email protected]>

…as been asked to close Signed-off-by: Manan Gupta <[email protected]>

shlomi-noach

The logic seems correct to me. A couple thoughts:

Use Ticker instead of Sleep so that we can break early when context is Done().
Feels like we should have a mechanism within Conn to handle that. Can't we use cancelFunc mechanism?

vitess/go/mysql/conn.go

Lines 212 to 214 in af6a08c

    
           // cancel keep the cancel function for the current executing query. 
        
           // this is used by `kill [query|connection] ID` command from other connection. 
        
           cancel context.CancelFunc

go/vt/vttablet/tabletserver/repltracker/writer.go

Signed-off-by: Manan Gupta <[email protected]>

GuptaManan100 · 2023-12-20T16:01:11Z

I have made the changes to replace time.Sleep() with a ticker as you pointed out.

Feels like we should have a mechanism within Conn to handle that. Can't we use cancelFunc mechanism?

I checked the cancelFunc and its usage, and from what I can tell, it is only being used in vtgates. The connection ID that vtgate advertises to the users don't necessarily match with the connection IDs on the vttablet connections (more often they won't, cause one vtgate query would be running on multiple shards, so they'll all have different connection ids). So when a user issues a kill query to vtgate, they are doing so with the connection id that vtgate advertised. So the way we implemented kill was to store the context and the cancel function for all the vtgate queries that are running, and cancel the context on a kill query causing the underlying StreamExecute/Execute RPC calls to be cancelled. This isn't actually killing anything per say. The streamExecute and other RPCs, are handling that on context cancellations on the vttablet side. But all of this does mean, that this is not the correct way to kill a query in the heartbeat writer. I am not sure if what I have written is the best way for killing the queries, but I think @harshit-gangal would be able to advise us on that, and tell us if there is a better alternate. I don't see any explicit Kill methods or functions to use.

go/vt/vttablet/tabletserver/repltracker/writer.go

test: add failing test to ensure heartbeat writer is indefinitely stu…

85c278a

…ck being closed if blocked on a semi-sync ack Signed-off-by: Manan Gupta <[email protected]>

GuptaManan100 added Type: Bug Component: Cluster management labels Dec 19, 2023

github-actions bot added this to the v19.0.0 milestone Dec 19, 2023

feat: add logic to heartbeat writer to kill the ongoing write if it h…

2cb09a6

…as been asked to close Signed-off-by: Manan Gupta <[email protected]>

GuptaManan100 marked this pull request as ready for review December 20, 2023 06:33

GuptaManan100 requested review from harshit-gangal, systay, shlomi-noach, rohit-nayak-ps and mattlord as code owners December 20, 2023 06:33

shlomi-noach approved these changes Dec 20, 2023

View reviewed changes

go/vt/vttablet/tabletserver/repltracker/writer.go Show resolved Hide resolved

go/vt/vttablet/tabletserver/repltracker/writer.go Show resolved Hide resolved

go/vt/vttablet/tabletserver/repltracker/writer.go Outdated Show resolved Hide resolved

refactor: use ticker instead of sleep

2bd6cae

Signed-off-by: Manan Gupta <[email protected]>

harshit-gangal reviewed Dec 26, 2023

View reviewed changes

go/vt/vttablet/tabletserver/repltracker/writer.go Show resolved Hide resolved

harshit-gangal approved these changes Dec 26, 2023

View reviewed changes

GuptaManan100 merged commit 47de203 into vitessio:main Dec 28, 2023
104 checks passed

GuptaManan100 deleted the fix-heartbeat-writer-stuck branch December 28, 2023 03:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hearbeatWriter Close being stuck if waiting for a semi-sync ACK #14823

Fix hearbeatWriter Close being stuck if waiting for a semi-sync ACK #14823

GuptaManan100 commented Dec 19, 2023 •

edited

Loading

shlomi-noach left a comment

GuptaManan100 commented Dec 20, 2023

	// cancel keep the cancel function for the current executing query.
	// this is used by `kill [query\|connection] ID` command from other connection.
	cancel context.CancelFunc

Fix hearbeatWriter Close being stuck if waiting for a semi-sync ACK #14823

Fix hearbeatWriter Close being stuck if waiting for a semi-sync ACK #14823

Conversation

GuptaManan100 commented Dec 19, 2023 • edited Loading

Description

Related Issue(s)

Checklist

Deployment Notes

shlomi-noach left a comment

Choose a reason for hiding this comment

GuptaManan100 commented Dec 20, 2023

GuptaManan100 commented Dec 19, 2023 •

edited

Loading