-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OnlineDDL: fix scenarios where migration hangs instead of directly failing #14290
OnlineDDL: fix scenarios where migration hangs instead of directly failing #14290
Conversation
Signed-off-by: Shlomi Noach <[email protected]>
…column' as a Code_INTERNAL error (hence unrecoverable) Signed-off-by: Shlomi Noach <[email protected]>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
if vterrors.Code(err) == vtrpc.Code_INTERNAL { | ||
return true | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has potentially far reaching effects:
Lines 158 to 161 in 8575b17
// INTERNAL errors. Means some invariants expected by underlying | |
// system has been broken. If you see one of these errors, | |
// something is very broken. | |
INTERNAL = 13; |
I typically use that when there's something very unexpected, which may be e.g. some random bit flip or unknown edge case that may be fine on retry. Please see the error code discussion as we should use a code which indicates that you cannot simply retry w/o fixing the underlying issue (FAILED_PRECONDITION seems like the best one).
This will affect all vreplication usage and will almost certainly lead to some unintended changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not debating which name is more appropriate, but I did a quick grep and we are using Code_internal
today in both vstreamer and vreplication for conditions that are non-recoverable. FAILED_PRECONDITION
is not used anywhere as yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense. I'll change to FAILED_PRECONDITION
.
@@ -629,7 +631,7 @@ func (tpb *tablePlanBuilder) analyzeExtraSourcePkCols(colInfos []*ColumnInfo, so | |||
if !col.IsGenerated { | |||
// We shouldn't get here in any normal scenario. If a column is part of colInfos, | |||
// then it must also exist in tpb.colExprs. | |||
return fmt.Errorf("column %s not found in table expressions", col.Name) | |||
return vterrors.Errorf(vtrpc.Code_INTERNAL, "column %s not found in table expressions", col.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use another code IMO:
Lines 74 to 127 in 8575b17
// INVALID_ARGUMENT indicates client specified an invalid argument. | |
// Note that this differs from FAILED_PRECONDITION. It indicates arguments | |
// that are problematic regardless of the state of the system | |
// (e.g., a malformed file name). | |
INVALID_ARGUMENT = 3; | |
// DEADLINE_EXCEEDED means operation expired before completion. | |
// For operations that change the state of the system, this error may be | |
// returned even if the operation has completed successfully. For | |
// example, a successful response from a server could have been delayed | |
// long enough for the deadline to expire. | |
DEADLINE_EXCEEDED = 4; | |
// NOT_FOUND means some requested entity (e.g., file or directory) was | |
// not found. | |
NOT_FOUND = 5; | |
// ALREADY_EXISTS means an attempt to create an entity failed because one | |
// already exists. | |
ALREADY_EXISTS = 6; | |
// PERMISSION_DENIED indicates the caller does not have permission to | |
// execute the specified operation. It must not be used for rejections | |
// caused by exhausting some resource (use RESOURCE_EXHAUSTED | |
// instead for those errors). It must not be | |
// used if the caller cannot be identified (use Unauthenticated | |
// instead for those errors). | |
PERMISSION_DENIED = 7; | |
// RESOURCE_EXHAUSTED indicates some resource has been exhausted, perhaps | |
// a per-user quota, or perhaps the entire file system is out of space. | |
RESOURCE_EXHAUSTED = 8; | |
// FAILED_PRECONDITION indicates operation was rejected because the | |
// system is not in a state required for the operation's execution. | |
// For example, directory to be deleted may be non-empty, an rmdir | |
// operation is applied to a non-directory, etc. | |
// | |
// A litmus test that may help a service implementor in deciding | |
// between FAILED_PRECONDITION, ABORTED, and UNAVAILABLE: | |
// (a) Use UNAVAILABLE if the client can retry just the failing call. | |
// (b) Use ABORTED if the client should retry at a higher-level | |
// (e.g., restarting a read-modify-write sequence). | |
// (c) Use FAILED_PRECONDITION if the client should not retry until | |
// the system state has been explicitly fixed. E.g., if an "rmdir" | |
// fails because the directory is non-empty, FAILED_PRECONDITION | |
// should be returned since the client should not retry unless | |
// they have first fixed up the directory by deleting files from it. | |
// (d) Use FAILED_PRECONDITION if the client performs conditional | |
// REST Get/Update/Delete on a resource and the resource on the | |
// server does not match the condition. E.g., conflicting | |
// read-modify-write on the same resource. | |
FAILED_PRECONDITION = 9; |
One that reflects the general case. I think FAILED_PRECONDITION might be the right one. Although NOT_FOUND would also fit.
Signed-off-by: Shlomi Noach <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
Signed-off-by: Matt Lord <[email protected]>
Description
This PR fixes a couple scenarios where an Online DDL
ALTER
vitess migration should fail, but instead infinitely keeps retrying behind the scenes. While retrying, it does not provide any meaninful info inSHOW VITESS_MIGRATIONS
. See #14285 and #14289With this PR, a few specific scenarios are treated as fatal/unrecoverable, bot in Online DDL's state machine, as well as in
vreplication
.Code_INTERNAL
error is treated as unrecoverable. A couple schema-related errors are now set asCode_INTERNAL
.I've added an
endtoend
test that covers the OnlineDDL fix. I'm unsure about avreplication
test.Related Issue(s)
queued
state without proper error message #14285Checklist
Deployment Notes