Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OnlineDDL: fix scenarios where migration hangs instead of directly failing #14290

Merged
merged 4 commits into from
Oct 17, 2023

Conversation

shlomi-noach
Copy link
Contributor

Description

This PR fixes a couple scenarios where an Online DDL ALTER vitess migration should fail, but instead infinitely keeps retrying behind the scenes. While retrying, it does not provide any meaninful info in SHOW VITESS_MIGRATIONS. See #14285 and #14289

With this PR, a few specific scenarios are treated as fatal/unrecoverable, bot in Online DDL's state machine, as well as in vreplication.

  • With Online DDL: an error in reviewing a migration immediately fails the migration.
  • With vreplication: a Code_INTERNAL error is treated as unrecoverable. A couple schema-related errors are now set as Code_INTERNAL.

I've added an endtoend test that covers the OnlineDDL fix. I'm unsure about a vreplication test.

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

@vitess-bot
Copy link
Contributor

vitess-bot bot commented Oct 16, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Oct 16, 2023
@github-actions github-actions bot added this to the v19.0.0 milestone Oct 16, 2023
@shlomi-noach shlomi-noach removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Oct 16, 2023
@rohit-nayak-ps rohit-nayak-ps added Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) and removed Component: Online DDL Online DDL (vitess/native/gh-ost/pt-osc) labels Oct 16, 2023
go/vt/vttablet/tabletmanager/vreplication/utils.go Outdated Show resolved Hide resolved
Comment on lines 128 to 130
if vterrors.Code(err) == vtrpc.Code_INTERNAL {
return true
}
Copy link
Contributor

@mattlord mattlord Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has potentially far reaching effects:

vitess/proto/vtrpc.proto

Lines 158 to 161 in 8575b17

// INTERNAL errors. Means some invariants expected by underlying
// system has been broken. If you see one of these errors,
// something is very broken.
INTERNAL = 13;

I typically use that when there's something very unexpected, which may be e.g. some random bit flip or unknown edge case that may be fine on retry. Please see the error code discussion as we should use a code which indicates that you cannot simply retry w/o fixing the underlying issue (FAILED_PRECONDITION seems like the best one).

This will affect all vreplication usage and will almost certainly lead to some unintended changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not debating which name is more appropriate, but I did a quick grep and we are using Code_internal today in both vstreamer and vreplication for conditions that are non-recoverable. FAILED_PRECONDITION is not used anywhere as yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense. I'll change to FAILED_PRECONDITION.

@@ -629,7 +631,7 @@ func (tpb *tablePlanBuilder) analyzeExtraSourcePkCols(colInfos []*ColumnInfo, so
if !col.IsGenerated {
// We shouldn't get here in any normal scenario. If a column is part of colInfos,
// then it must also exist in tpb.colExprs.
return fmt.Errorf("column %s not found in table expressions", col.Name)
return vterrors.Errorf(vtrpc.Code_INTERNAL, "column %s not found in table expressions", col.Name)
Copy link
Contributor

@mattlord mattlord Oct 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better to use another code IMO:

vitess/proto/vtrpc.proto

Lines 74 to 127 in 8575b17

// INVALID_ARGUMENT indicates client specified an invalid argument.
// Note that this differs from FAILED_PRECONDITION. It indicates arguments
// that are problematic regardless of the state of the system
// (e.g., a malformed file name).
INVALID_ARGUMENT = 3;
// DEADLINE_EXCEEDED means operation expired before completion.
// For operations that change the state of the system, this error may be
// returned even if the operation has completed successfully. For
// example, a successful response from a server could have been delayed
// long enough for the deadline to expire.
DEADLINE_EXCEEDED = 4;
// NOT_FOUND means some requested entity (e.g., file or directory) was
// not found.
NOT_FOUND = 5;
// ALREADY_EXISTS means an attempt to create an entity failed because one
// already exists.
ALREADY_EXISTS = 6;
// PERMISSION_DENIED indicates the caller does not have permission to
// execute the specified operation. It must not be used for rejections
// caused by exhausting some resource (use RESOURCE_EXHAUSTED
// instead for those errors). It must not be
// used if the caller cannot be identified (use Unauthenticated
// instead for those errors).
PERMISSION_DENIED = 7;
// RESOURCE_EXHAUSTED indicates some resource has been exhausted, perhaps
// a per-user quota, or perhaps the entire file system is out of space.
RESOURCE_EXHAUSTED = 8;
// FAILED_PRECONDITION indicates operation was rejected because the
// system is not in a state required for the operation's execution.
// For example, directory to be deleted may be non-empty, an rmdir
// operation is applied to a non-directory, etc.
//
// A litmus test that may help a service implementor in deciding
// between FAILED_PRECONDITION, ABORTED, and UNAVAILABLE:
// (a) Use UNAVAILABLE if the client can retry just the failing call.
// (b) Use ABORTED if the client should retry at a higher-level
// (e.g., restarting a read-modify-write sequence).
// (c) Use FAILED_PRECONDITION if the client should not retry until
// the system state has been explicitly fixed. E.g., if an "rmdir"
// fails because the directory is non-empty, FAILED_PRECONDITION
// should be returned since the client should not retry unless
// they have first fixed up the directory by deleting files from it.
// (d) Use FAILED_PRECONDITION if the client performs conditional
// REST Get/Update/Delete on a resource and the resource on the
// server does not match the condition. E.g., conflicting
// read-modify-write on the same resource.
FAILED_PRECONDITION = 9;

One that reflects the general case. I think FAILED_PRECONDITION might be the right one. Although NOT_FOUND would also fit.

Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Signed-off-by: Matt Lord <[email protected]>
@shlomi-noach shlomi-noach merged commit 46f6ae5 into vitessio:main Oct 17, 2023
115 checks passed
@shlomi-noach shlomi-noach deleted the onlineddl-fix-failed-queued branch October 17, 2023 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants