Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-set 5s on-demand heartbeat if --enable_heartbeat is disabled #15099

Closed

Conversation

shlomi-noach
Copy link
Contributor

@shlomi-noach shlomi-noach commented Jan 31, 2024

Description

This work started with #14978 and #14980. And while the discussion was about the examples suite, it applies to all environments.

After #14980 was merged, we realized examples could no longer facilitate the tablet throttler with its default setup. That's because the throttler, by default, relies on replication lag calculated from _vt.heartbeat.

And while the throttler can be enabled/disabled dynamically, the heartbeat configuration is set on startup.

Following #14980 we asked ourselves: should the throttler be available for operation in examples? And we agreed that it should -- same as any other component in Vitess. To that effect, the throttler should be able to read valid heartbeat information.

There are multiple ways to go about it, and some considerations are:

  • examples are expected to run on local laptops or otherwise small hosts. examples should not generate any excessive load.
  • Likewise, any testing or even prod Vitess environment should not generate any excessive load.
  • The issue in Bug Report: Vitess local examples high replica heartbeat lag #14978 should be addressed. In Bug Report: Vitess local examples high replica heartbeat lag #14978, the tablet /debug/status page reported lag due to stale heartbeats, and even though there was no actual lag.
  • We don't want to introduce programmatic complexity, such as a dependency or control of the throttler over the heartbeat mechanism (there's already some relationship between the two, it should not be even larger).

The solution in this PR is simple: if --heartbeat_enable is set, great, heartbeats are on, and nothing else to do. But if it is unset, and neither is heartbeat_on_demand_duration, then we set heartbeat_on_demand_duration to 5s. Meaning, there has to be some sort of heartbeat.

As a reminder, the rules for heartbeat_on_demand_duration are:

  • One single injection of a heartbeat value on tablet's startup/Open().
  • A lease of heartbeats, limited in time, e.g. 5s, upon request.
  • The only entity right now that requests heartbeats is the tablet throttler.
  • The throttler will only do so when a consumer (e.g. vreplication) actively asks it for throttling feedback.

In short, with this new 5s on-demand heartbeat default value, the the throttler will be able to work correctly in examples and in any other Vitess setup that is otherwise not configured for heartbeats. Heartbeats will only be generated while the throttler is both enabled and actively engaged with some vreplication workflow or Online DDL operation etc.

And, back to examples, because heartbeat_enable is unset, then heartbeatReader is not enabled, and as result /debug/status reports replication lag based on MySQL replication as opposed to based on _vt.heartbeat. So this PR still respects #14978.

Related Issue(s)

Backporting

Like #15099, this should be backported to 18.0 and 17.0 and for the same reasons.

Checklist

  • "Backport to:" labels have been added if this change should be back-ported to release branches
  • If this change is to be back-ported to previous releases, a justification is included in the PR description
  • Tests were added or are not required
  • Did the new or modified tests pass consistently locally and on CI?
  • Documentation was added or is not required

Deployment Notes

Copy link
Contributor

vitess-bot bot commented Jan 31, 2024

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • Ensure there is a link to an issue (except for internal cleanup and flaky test fixes), new features should have an RFC that documents use cases and test cases.

Tests

  • Bug fixes should have at least one unit or end-to-end test, enhancement and new features should have a sufficient number of tests.

Documentation

  • Apply the release notes (needs details) label if users need to know about this change.
  • New features should be documented.
  • There should be some code comments as to why things are implemented the way they are.
  • There should be a comment at the top of each new or modified test to explain what the test does.

New flags

  • Is this flag really necessary?
  • Flag names must be clear and intuitive, use dashes (-), and have a clear help text.

If a workflow is added or modified:

  • Each item in Jobs should be named in order to mark it as required.
  • If the workflow needs to be marked as required, the maintainer team must be notified.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • RPC changes should be compatible with vitess-operator
  • If a flag is removed, then it should also be removed from vitess-operator and arewefastyet, if used there.
  • vtctl command output order should be stable and awk-able.

@vitess-bot vitess-bot bot added NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says labels Jan 31, 2024
@github-actions github-actions bot added this to the v19.0.0 milestone Jan 31, 2024
@deepthi deepthi requested review from maxenglander and mattlord and removed request for a team, systay and harshit-gangal January 31, 2024 10:49
Signed-off-by: Shlomi Noach <[email protected]>
Copy link

codecov bot commented Jan 31, 2024

Codecov Report

Attention: 427 lines in your changes are missing coverage. Please review.

Comparison is base (eddb39e) 47.29% compared to head (0229785) 47.72%.
Report is 86 commits behind head on main.

Files Patch % Lines
go/vt/mysqlctl/builtinbackupengine.go 12.85% 61 Missing ⚠️
go/vt/vtgate/evalengine/cached_size.go 0.00% 41 Missing ⚠️
go/vt/vtctl/workflow/traffic_switcher.go 0.00% 25 Missing ⚠️
go/vt/vtgate/engine/delete_with_input.go 53.70% 23 Missing and 2 partials ⚠️
go/vt/mysqlctl/backupengine.go 0.00% 24 Missing ⚠️
go/vt/topo/keyspace.go 64.81% 14 Missing and 5 partials ⚠️
go/vt/vtgate/evalengine/expr_tuple_bvar.go 62.50% 14 Missing and 4 partials ⚠️
go/vt/mysqlctl/xtrabackupengine.go 0.00% 15 Missing ⚠️
go/vt/sqlparser/ast_funcs.go 42.30% 15 Missing ⚠️
go/vt/vtenv/cached_size.go 0.00% 14 Missing ⚠️
... and 44 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #15099      +/-   ##
==========================================
+ Coverage   47.29%   47.72%   +0.42%     
==========================================
  Files        1137     1155      +18     
  Lines      238684   240275    +1591     
==========================================
+ Hits       112895   114673    +1778     
+ Misses     117168   116999     -169     
+ Partials     8621     8603      -18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@maxenglander maxenglander left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The description seems pretty geared towards making heartbeats available in examples. With that context in mind, I think changing the default value (and behavior) of Vitess to meet an examples need creates added risk. The default value being dynamic here adds more complexity that will become difficult to reason about when someone encounters behavior they can't explain.

I'm not sure the added risk and complexity is worth it to satisfy an examples need, if we're able to satisfy it just as well by setting an explicit flag in the examples config?

@shlomi-noach
Copy link
Contributor Author

@maxenglander I see your point and it is valid. Still, some further thoughts:

The description seems pretty geared towards making heartbeats available in examples.

Yes and no. As I went back and forth with the configuration, I realized the heartbeat configuration emerged organically but ended up in a bit of a confusing state. Like the fact it takes either of two flags to make heartbeats possible, or the fact that one flag (on_demand_heartbeats) overrides the behavior of the other flag (heartbeat_enable). In retrospect we would have designed this differently. But the end result is that whether on examples or on any other vitess setup, when you enable the throttler, you expect it to be able to work. The idea that the throttler will not work unless you pre-configure some other variable presents an inconsistency, because the throttler can be enabled and disabled dynamically.
What we say in examples applies just as well to any other setup.

A different approach could be that, upon activation, the throttler could programmatically enforce the activation of heartbeats.

Comment on lines 185 to +186
fs.DurationVar(&heartbeatInterval, "heartbeat_interval", 1*time.Second, "How frequently to read and write replication heartbeat.")
fs.DurationVar(&heartbeatOnDemandDuration, "heartbeat_on_demand_duration", 0, "If non-zero, heartbeats are only written upon consumer request, and only run for up to given duration following the request. Frequent requests can keep the heartbeat running consistently; when requests are infrequent heartbeat may completely stop between requests")
fs.DurationVar(&heartbeatOnDemandDuration, "heartbeat_on_demand_duration", 0, "If non-zero, heartbeats are only written upon consumer request, and only run for up to given duration following the request. Frequent requests can keep the heartbeat running consistently. Automatically set to 5s when --heartbeat_enable is unset.")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we make the two flags mutually exclusive in pFlag? That might simplify the logic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a great idea, irrespective of this PR! I only wonder if it's going to break someone's existing setup?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it might... 😢 We could potentially add it to the 19.0.0 changelog as a breaking change though.

Maybe we should take a documentation driven approach here. How would we document the new/proposed configuration here? https://vitess.io/docs/19.0/reference/features/tablet-throttler/#configuration

  1. IMO we should remove all of the old tablet level config parts/noise there in the v19 docs
  2. That seems a little misleading now as it says "Enabling the lag throttler also automatically enables heartbeat injection." and it does not note that you must set --heartbeat_enable=false (which I think was true before given the noted issues reported with the examples) and --heartbeat_on_demand_duration to a non-zero value — although it does describe what it is and recommend values for it

@shlomi-noach
Copy link
Contributor Author

I'm not sure the added risk and complexity is worth it to satisfy an examples need, if we're able to satisfy it just as well by setting an explicit flag in the examples config?

In the meantime I'll open a new PR adding the flag in examples scripts.

@shlomi-noach
Copy link
Contributor Author

Followup in #15204.

@shlomi-noach
Copy link
Contributor Author

Meanwhile converting this to Draft as we continue to think about the correct solution.

@shlomi-noach
Copy link
Contributor Author

Please see this followup issue: #15303

@shlomi-noach
Copy link
Contributor Author

Closing for now. Follow #15303 for related work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
NeedsBackportReason If backport labels have been applied to a PR, a justification is required NeedsDescriptionUpdate The description is not clear or comprehensive enough, and needs work NeedsIssue A linked issue is missing for this Pull Request NeedsWebsiteDocsUpdate What it says
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants