Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix v18 tablets removed from healthcheck when topo server GetTablet call fails #201

Conversation

brendar
Copy link

@brendar brendar commented Nov 6, 2024

This is the v18 fix for this bug: vitessio#15632

The code for getting tablets for healthchecks was significantly altered between v18 and v19, so even though the bug is similar, the fix is different.

Description
In v18, TopologyWatcher.loadTablets() roughly works like this:

  • Allocates a newTablets map
  • Pulls a list of tablet aliases from the topology server
  • Iterates through the tablet aliases
    • Spawns a goroutine to call GetTablet to get the tablet record
    • If the GetTablet call succeeds, the tablet is added to newTablets
  • Waits for all the GetTablet goroutines to finish
  • Iterates through newTablets:
    • Attempts to find each tablet in tw.tablets (the current set of tracked tablets)
    • If it finds an existing tablet, it will update info to match, otherwise it will add that tablet to the healthcheck
  • Iterates through tw.tablets (the current set of tracked tablets)
    • Attempts to find each tablet in newTablets
    • If it does not find the tablet, it will remove it from the healthcheck

So the bug in v18 is that if a GetTablet call fails, the tablet is not added to newTablets, and thus removed from the healthcheck when iterating tw.tablets.

Changes

  • Pulls in the memorytopo changes from the v19+ fix (link). These allow us to force topo errors on specific calls for testing.
  • Updates TopologyWatcher.loadTablets() to handle GetTablet errors by backfilling newTablets with the current value from tw.tablets. This is the same approach used in the v19+ fix.
  • Pulls in the TestGetTabletErrorDoesNotRemoveFromHealthcheck test from the v19+ fix
  • Adds TestGetTabletNoNodeErrorRemovesFromHealthcheck to cover the case where a tablet is removed from the topo server bewteen the call to getTablets and the call to GetTablet for that alias. That scenario is handled differently in v19+ so I added explicit handling for it in this fix.

@brendar brendar marked this pull request as ready for review November 6, 2024 18:06
@@ -190,6 +190,17 @@ func (tw *TopologyWatcher) loadTablets() {
topologyWatcherOperations.Add(topologyWatcherOpGetTablet, 1)
<-tw.sem // Done; enable next request to run
if err != nil {
if !topo.IsErrType(err, topo.NoNode) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would the error be a NoNode even when it's a network partition?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, the NoNode error means the path does not exist in zookeeper. We would get a NoNode error if the tablet was removed from the topo server between the call to get all the tablet aliases (tw.getTablets(tw)) and the call to get that tablet (tw.topoServer.GetTablet(tw.ctx, alias)). So we allow the tablet to be removed from the healthcheck on a NoNode error.

@brendar brendar merged commit a46ede6 into v18.0.5-shopify-7-candidate Nov 7, 2024
206 of 207 checks passed
@brendar brendar mentioned this pull request Nov 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants