Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

increase vtctlclient backupShard command success rate #14604

Merged
merged 13 commits into from
Nov 30, 2023
19 changes: 17 additions & 2 deletions go/vt/vtctl/grpcvtctldserver/server.go
Original file line number Diff line number Diff line change
Expand Up @@ -430,9 +430,24 @@ func (s *VtctldServer) BackupShard(req *vtctldatapb.BackupShardRequest, stream v
span.Annotate("concurrency", req.Concurrency)
span.Annotate("incremental_from_pos", req.IncrementalFromPos)

tablets, stats, err := reparentutil.ShardReplicationStatuses(ctx, s.ts, s.tmc, req.Keyspace, req.Shard)
shardTablets, stats, err := reparentutil.ShardReplicationStatuses(ctx, s.ts, s.tmc, req.Keyspace, req.Shard)

var tablets []*topo.TabletInfo
// Instead of return on err directly, only return when no healthy tablets at all
if err != nil {
return err
for i, stat := range stats {
// shardTablets[i] and stats[i] is 1:1 mapping
// Healthy shardTablets[i] will be added to tablets
if stat != nil {
tablets = append(tablets, shardTablets[i])
}
}
// Only return err when all tablets have errors
if len(tablets) == 0 {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is a small chance that we are doing BackupShard --allow-primary and the shard has exactly one tablet (the primary), and this could incorrectly fail. i'd suggest: if len(tablets) < len(shardTablets) instead

Copy link
Contributor Author

@jwangace jwangace Nov 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @ajm188 do you mean that the shard is built only with 1 tablet, so it's the only 1 and primary?

if the code changed to:

if err != nil {
...
    if len(tablets) < len(shardTablets) {
        return err
    }
}

In normal cases (tablets >= 2), then it should equivelant to before and does not fix the bug, becase len(tablets) < len(shardTablets) should always true when err != nil

if err != nil {
    return err
}

Instead, can I suggest tablets should always include the PRIMARY and no matter it's stat nil or non-nil?

shardTablets, stats, err := reparentutil.ShardReplicationStatuses(ctx, s.ts, s.tmc, req.Keyspace, req.Shard)

var tablets []*topo.TabletInfo
// Instead of return on err directly, only return when no healthy tablets at all
if err != nil {
	for i, stat := range stats {
		// Always include TabletType_PRIMARY
		if shardTablets[i].Type == topodatapb.TabletType_PRIMARY {
			tablets = append(tablets, shardTablets[i])
			continue
		}
		// shardTablets[i] and stats[i] is 1:1 mapping
		// Healthy shardTablets[i] will be added to tablets
		if stat != nil {
			tablets = append(tablets, shardTablets[i])
		}
	}
	// Only return err when no usable tablet
	if len(tablets) == 0 {
		return err
	}
} else {
	tablets = shardTablets
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah good point.

do you mean that the shard is built only with 1 tablet, so it's the only 1 and primary?

yes, that's what i mean.

i missed that the if len(tablets) == 0 happened inside the if err != nil check, not outside, so i think this is a moot point. sorry for the confusion!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, given the intricate logic here I'd like to suggest extracting the logic into a separate function and adding a unit test suite to validate expected outcome of different scenarios 🙏

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Shlomi here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated PR

}
} else {
tablets = shardTablets
}

var (
Expand Down
Loading