Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a retry to remove the vttablet directory during upgrade/downgrade backup tests #14753

Merged
merged 3 commits into from
Dec 12, 2023

Conversation

frouioui
Copy link
Member

@frouioui frouioui commented Dec 12, 2023

Description

This PR fixes a small issue during the manual upgrade/downgrade backup tests. The issue happens when for some reason the deletion of the VTTablet directory fails due to used files, leading to an error code 39 from rm -Rf. This issue can be seen as follows in the logs of the Stop tablets step:

Shutting down tablet zone1-301
Stopping vttablet...
Shutting down mysql zone1-301
Shutting down MySQL for tablet zone1-0000000301...
Removing tablet directory zone1-301
Successfully deleted 1 tablets
rm: cannot remove '/tmp//vt_0000000301': Directory not empty

This silent error leads to the failure of step Start new tablet and restore. Since the VTTablet's directory was partially removed, VTTablet thinks it has to start from an existing data dir (which is not what we want). And because the VTTablet directory has been mostly emptied already, the my.cnf file does not exist, leading to a fatal error (also visible in the logs):

Starting MySQL for tablet zone1-0000000301...
Resuming from existing vttablet dir:
    /tmp//vt_0000000301
E1208 15:33:57.111914   44151 mysqlctl.go:276] failed to find mysql config: couldn't read my.cnf file: open /tmp/vt_0000000301/my.cnf: no such file or directory

After some local tests, it seems like adding a sleep before the rm -Rf is sufficient. I then decided to add a 30 second-long retry loop where we attempt to remove the VTTablet directory every second.

Moreover, the list of files to watch in the backups manual workflows have changed to include the examples/** path as we use it in our tests.

@frouioui frouioui requested a review from deepthi as a code owner December 12, 2023 01:03
Copy link
Contributor

vitess-bot bot commented Dec 12, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Dec 12, 2023
@frouioui frouioui added Flakes Backport to: release-16.0 and removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Dec 12, 2023
@github-actions github-actions bot added this to the v19.0.0 milestone Dec 12, 2023
Copy link
Contributor

@mattlord mattlord left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Only had a small nit about quoting the variable when used (it could potentially have whitespace). FWIW, I also think it's more descriptive to say that we're adding a retry rather than a timeout.

I remember running into this same Directory not empty error when dealing some other backup related test flakiness some time ago. I think it might have been here: https://github.com/vitessio/vitess/pull/11352/files#diff-4f8ae786620a21df8ced1435fc0318b8abba570dfea4c2177b1ecb725b8c5073L236-R244

Thank you for working on this!


if grep -q 'Directory not empty' $temp_file; then
echo "Directory not empty, retrying..."
elif [ ! -s $temp_file ]; then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should quote the $temp_file variable everywhere to be safe (can run shellcheck on the file too).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed via 8841067

@frouioui frouioui changed the title Add timeout to remove the vttablet directory during upgrade/downgrade backup tests Add a retry to remove the vttablet directory during upgrade/downgrade backup tests Dec 12, 2023
@frouioui
Copy link
Member Author

Only had a small nit about quoting the variable when used (it could potentially have whitespace).

I will go ahead and fix it, thanks!

I also think it's more descriptive to say that we're adding a retry rather than a timeout

Got it! I modified both title and description.

I remember running into this same Directory not empty error when dealing some other backup related test flakiness some time ago. I think it might have been here: #11352 (files)

Nice, thank you! This is useful.

@harshit-gangal harshit-gangal mentioned this pull request Dec 12, 2023
24 tasks
frouioui pushed a commit that referenced this pull request Dec 12, 2023
…grade/downgrade backup tests (#14753) (#14758)

Co-authored-by: vitess-bot[bot] <108069721+vitess-bot[bot]@users.noreply.github.com>
frouioui added a commit that referenced this pull request Dec 12, 2023
frouioui added a commit that referenced this pull request Dec 12, 2023
frouioui added a commit that referenced this pull request Dec 12, 2023
…grade/downgrade backup tests (#14753) (#14757)

Co-authored-by: Florent Poinsard <[email protected]>
deepthi pushed a commit that referenced this pull request Dec 13, 2023
…grade/downgrade backup tests (#14753) (#14756)

Signed-off-by: Florent Poinsard <[email protected]>
Co-authored-by: Florent Poinsard <[email protected]>
ejortegau pushed a commit to slackhq/vitess that referenced this pull request Dec 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants