Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve error messages in tablet gateway when primary tablets are not serving #125

Conversation

austenLacy
Copy link

@austenLacy austenLacy commented Oct 12, 2023

Description

The error message returned from the tablet gateway is misleading. For example while a primary was not serving and down Vitess was returning current keyspace is being resharded to an application.

I swapped the tablet checks to check if the primary is serving before checking the resharding state.

For example, during a recent incident where a primary mysql was down Vitess was returning current keyspace is being resharded error messages back to an application which is very confusing. If we look at the tablet serving state however we can see that the primary is NOT_SERVING.

ro@cluster(replica)> show vitess_tablets;
+-------------+----------------------------------+-------+------------+-------------+------------------------+---------------+----------------------+
| Cell        | Keyspace                         | Shard | TabletType | State       | Alias                  | Hostname      | PrimaryTermStartTime |
+-------------+----------------------------------+-------+------------+-------------+------------------------+---------------+----------------------+
...
| us-east1    | keyspace1 | -10   | PRIMARY    | NOT_SERVING | us-east1-0000000097    | 247.4.117.26  | 2023-10-11T19:16:00Z |
| us-east1    | keyspace2 | -10   | REPLICA    | NOT_SERVING | us-east1-0000000102    | 247.2.54.58   |                      |
...
+-------------+----------------------------------+-------+------------+-------------+------------------------+---------------+----------------------+
68 rows in set (0.00 sec)

With this change we should at least return an error message of primary is not serving, there could be a reparent operation in progress which is more valuable than a "potential" resharding message.

Comment on lines +304 to 306
if kev.TargetIsBeingResharded(target) {
err = vterrors.Errorf(vtrpcpb.Code_CLUSTER_EVENT, "current keyspace is potentially being resharded")
continue

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this return false if it's not actually being resharded? Feel like we should address the bug at that level?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shanth96 I considered changing the TargetIsBeingResharded logic, but I'm also not familiar enough with the logic that determines if it's actually being resharded and if we would have enough information to determine that from the keyspace watch events. We can revisit that though if we think this change isn't as helpful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough

@brendar
Copy link

brendar commented Oct 13, 2023

I have reservations about changing the order of error checks and their messages unless we can get this upstreamed. If we need community support for issues in the future, there's going to be confusion if we are mentioning error messages that don't exist in Vitess upstream.

@austenLacy
Copy link
Author

@brendar

I have reservations about changing the order of error checks and their messages unless we can get this upstreamed.

I'm happy to try upstreaming these changes to see what the maintainers think.

@brendar
Copy link

brendar commented Oct 16, 2023

I'm happy to try upstreaming these changes to see what the maintainers think.

Cool, let's try

@austenLacy
Copy link
Author

I found this PR that's in main that improves the heuristic for determining if a reshard is actually in progress. Should definitely help with this issue: vitessio#13856.

related issue: vitessio#13854

@austenLacy austenLacy closed this Nov 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants