Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: tablet throttler starvation scenario #15397

Closed
shlomi-noach opened this issue Mar 3, 2024 · 1 comment
Closed

Bug Report: tablet throttler starvation scenario #15397

shlomi-noach opened this issue Mar 3, 2024 · 1 comment

Comments

@shlomi-noach
Copy link
Contributor

Overview of the Issue

We've identified a throttler starvation scenario, where the throttler needlessly rejects a waiting app. This happens in v17 and v18. It does not happen in v19 only due to some unintentional change, as explained below.

The nature of starvation is the combination of multiple load shedding optimizations. Specifically, the scenario can happen when:

  • On-demand heartbeat is set.
  • An app asks for throttling on a REPLICA tablet, e.g. this could be rowstreamer.
  • For whatever reason, the app doesn't make requests for 1 minute.

Before explaining how the starvation can happen, here's some recap of throttler's behavior:

  • PRIMARY polls replica tablet throttler's and checks whether they've been recently checked. Normally it polls replicas multiple times per second.
  • A replica responds with "I've been recently checked" if some client checked for throttler state within the last 1-2 seconds.
  • If so, it asks for a heartbeat lease.
  • If, for over 1min, no "checks" are made on the PRIMARY and likewise the PRIMARY doesn't see any recent checks on the replicas, the PRIMARY throttler goes into dormant mode. In this mode it saves resources, and only polls the replica tablets once per minute.

The starvation scenario is when the PRIMARY is in dormant mode, and only polls the replica once per minute. When it polls the replica, it asks whether it was recently checked. However, the client which checks the replica, for its own reasons, does not keep retrying every second. It may choose to only retry in 1 minute. And so the chances are, when the PRIMARY polls the replica, the replica will have been checked some 20s-30s ago, and it replica responds with "I have not been recently checked". This keeps the PRIMARY in dormant mode, does not renew heartbeats.

At some point the PRIMARY polling collides with the client's check, and the issue is resolved. However, if the client happens to have a 1min retry, the two can keep missing each other for hours, until some shift converges them.

In v19 this does not happen due to an unintentional change. In v19 the PRIMARY's self-checks unintentionally flag the PRIMARY's own last check time, which means the PRIMARY does not go dormant.

Proposed solutions:

  • Instead of relying on accidental hit or miss, the replica can proactively let the primary know it's been checked. This does not have to happen every time the replica gets checked: once every dormant period (1min) is sufficient.
  • Another approach would be to force time intervals to overlap. For example, if "recently checked" applied for any cehck taking place in the past 1min (as opposed to 1s-2s) then even a dormant PRIMARY polling the replica every 1min will hit a "I've been recently checked" response.

Reproduction Steps

Binary Version

- `17`
- `18`
- Does not happen on `19` due to unintentional behavior.
- `20`, once unintentional behavior is removed.

Operating System and Environment details

-

Log Fragments

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant