Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mysqld: check for correct return value from WAIT_FOR_EXECUTED_GTID_SET #14739

Closed
wants to merge 1 commit into from

Conversation

deepthi
Copy link
Member

@deepthi deepthi commented Dec 9, 2023

Description

In #14612, we changed how we wait for a primary candidate to catch up to a specific replication position. Specifically for the mysql flavor, instead of using WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS we now use WAIT_FOR_EXECUTED_GTID_SET.
However, the two functions return different values on timeout. The old one returns -1, whereas the new one returns 1.
This was causing us to assume that WaitForSourcePos was successful when in fact it had timed out, leading to the situation described in #14738.

I have ignored the MariaDB flavor since we dropped support in v16. In any case, that flavor is executing this command with no timeout specified, which means it will wait indefinitely and return 0.

Right now the only "affected" version is main. However, because #14612 was back ported to all release branches, this fix also needs to be back ported and needs to go into the next v18 patch release we plan to revert it in all the release branches. Luckily, we haven't actually made any new patch releases after #14612 was merged.

I spent some time trying to come up with a way to unit test this, but ran into the old problem of the complexity of mocking a real mysql. I'll probably spend more time on that later on, but right now we need to get this fixed ASAP.

Proof of the correctness of this change (MySQL only):

mysql> SELECT WAIT_FOR_EXECUTED_GTID_SET('b68c0b3c-964e-11ee-812c-32b90e164722:1-42,b95c3986-964e-11ee-a897-d342cf47f4f4:1-66', 4);
+----------------------------------------------------------------------------------------------------------------------+
| WAIT_FOR_EXECUTED_GTID_SET('b68c0b3c-964e-11ee-812c-32b90e164722:1-42,b95c3986-964e-11ee-a897-d342cf47f4f4:1-66', 4) |
+----------------------------------------------------------------------------------------------------------------------+
|                                                                                                                    1 |
+----------------------------------------------------------------------------------------------------------------------+
1 row in set (4.01 sec)

mysql> SELECT WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS('b68c0b3c-964e-11ee-812c-32b90e164722:1-42,b95c3986-964e-11ee-a897-d342cf47f4f4:1-66', 1);
+-----------------------------------------------------------------------------------------------------------------------------+
| WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS('b68c0b3c-964e-11ee-812c-32b90e164722:1-42,b95c3986-964e-11ee-a897-d342cf47f4f4:1-66', 1) |
+-----------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                          -1 |
+-----------------------------------------------------------------------------------------------------------------------------+
1 row in set, 1 warning (1.01 sec)

Related Issue(s)

Checklist

  • "Backport to:" labels have been added if this change should be back-ported
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on the CI
  • Documentation was added or is not required

Deployment Notes

EDIT: instead of back porting this PR, we are reverting the original PR on all release branches. That gives us time to debate the code changes and get them right on main for v19.

Copy link
Contributor

vitess-bot bot commented Dec 9, 2023

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Dec 9, 2023
@github-actions github-actions bot added this to the v19.0.0 milestone Dec 9, 2023
@deepthi
Copy link
Member Author

deepthi commented Dec 9, 2023

For anyone who is interested, here's how I simulated this

  • run local example, insert some rows
  • change rdonly to replica because we are running with semi-sync
  • connect to a replica mysqld and stop io_thread
  • insert more rows into primary
  • execute the commands listed in the description on the replica where io_thread is stopped

@deepthi
Copy link
Member Author

deepthi commented Dec 9, 2023

@mattlord one thing to check during review is where we still use WaitForPosition etc. in vreplication imports and whether this breaks those for MariaDB / FilePos.
Is FilePos still even relevant, or is it something we should be dropping?

@deepthi
Copy link
Member Author

deepthi commented Dec 9, 2023

For now I've marked this for backports, but we also have the option of reverting the change on release branches while we fix forward on main.
We should consider that option.

@rohit-nayak-ps
Copy link
Contributor

@mattlord one thing to check during review is where we still use WaitForPosition etc. in vreplication imports and whether this breaks those for MariaDB / FilePos. Is FilePos still even relevant, or is it something we should be dropping?

Both the mariadb and file pos wait for position commands return -1 for timeout.
https://dev.mysql.com/doc/refman/5.7/en/miscellaneous-functions.html#function_master-pos-wait
https://mariadb.com/kb/en/master_gtid_wait/

We could enhance the flavor interface to get the value expected for timeout and use that rather than hardcoding it here.

@@ -377,7 +377,7 @@ func (mysqld *Mysqld) WaitSourcePos(ctx context.Context, targetPos replication.P
if result.IsNull() {
return fmt.Errorf("%v(%v) failed: replication is probably stopped", waitCommandName, query)
}
if result.ToString() == "-1" {
if result.ToString() == "1" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should invert the check here and validate success.

For any functions that return C style integer arguments where one value specifically indicates success, we should always check against that value. I’ve see trying to check against failure values to cause problems too many times.

So either here do a != “0” or switch the return and fallback and do == “0”.

Copy link
Contributor

@mattlord mattlord Dec 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this way would work with 8.0 and 8.2 (and MariaDB). 0 is success, non-zero value is failure. And I think that we should add a const for it so that it's more obvious.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. I'll push a change.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See later comments from Rohit.

@deepthi deepthi removed NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsWebsiteDocsUpdate What it says NeedsIssue A linked issue is missing for this Pull Request labels Dec 9, 2023
@deepthi
Copy link
Member Author

deepthi commented Dec 9, 2023

While we discuss the right way to fix this for all flavors and work on that, I'm going to revert the backport PRs on release branches. Given that the original PR was not actually fixing a bug, we did not need to backport it.

EDIT: Created #14742 #14743 and #14744 and removed the backport labels.

@rohit-nayak-ps
Copy link
Contributor

rohit-nayak-ps commented Dec 9, 2023

Unfortunately for FilePos the return value can be non-zero on success: The return value is the number of log events the replica had to wait for to advance to the specified position.

We should delegate the return value interpretation to the flavor interface and let that interface return SUCCESS, TIMEOUT or ERROR?

MASTER_POS_WAIT(log_name,log_pos[,timeout][,channel])

This function is useful for control of source-replica synchronization. It blocks until the 
replica has read and applied all updates up to the specified position in the source log. 
The return value is the number of log events the replica had to wait for to advance to 
the specified position. The function returns NULL if the replica SQL thread is not 
started, the replica's source information is not initialized, the arguments are incorrect, 
or an error occurs. It returns -1 if the timeout has been exceeded. If the replica SQL 
thread stops while MASTER_POS_WAIT() is waiting, the function returns NULL. If the 
replica is past the specified position, the function returns immediately.

Copy link
Contributor

@shlomi-noach shlomi-noach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Trying to understand the logic for backporting / not backporting the fix, wouldn't this be needed in earlier Vitess versions that might need to work with later MySQL versions?

@dbussink
Copy link
Contributor

Closing in favor of #14745

@dbussink dbussink closed this Dec 11, 2023
@dbussink dbussink deleted the ds-fix-14738 branch December 11, 2023 08:37
@deepthi
Copy link
Member Author

deepthi commented Dec 11, 2023

Trying to understand the logic for backporting / not backporting the fix, wouldn't this be needed in earlier Vitess versions that might need to work with later MySQL versions?

We have talked about this before. There is no guarantee that an earlier Vitess version will work with a MySQL version that comes along later. In fact, we only guarantee that it works with the versions that we are providing in the docker images. We probably need to write this down somewhere to avoid confusion / incorrect assumptions.

@shlomi-noach
Copy link
Contributor

There is no guarantee that an earlier Vitess version will work with a MySQL version that comes along later.

Got it! I'll keep that in mind.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants