Bug Report: tablet throttler starvation scenario #15397

shlomi-noach · 2024-03-03T12:33:13Z

Overview of the Issue

We've identified a throttler starvation scenario, where the throttler needlessly rejects a waiting app. This happens in v17 and v18. It does not happen in v19 only due to some unintentional change, as explained below.

The nature of starvation is the combination of multiple load shedding optimizations. Specifically, the scenario can happen when:

On-demand heartbeat is set.
An app asks for throttling on a REPLICA tablet, e.g. this could be rowstreamer.
For whatever reason, the app doesn't make requests for 1 minute.

Before explaining how the starvation can happen, here's some recap of throttler's behavior:

PRIMARY polls replica tablet throttler's and checks whether they've been recently checked. Normally it polls replicas multiple times per second.
A replica responds with "I've been recently checked" if some client checked for throttler state within the last 1-2 seconds.
If so, it asks for a heartbeat lease.
If, for over 1min, no "checks" are made on the PRIMARY and likewise the PRIMARY doesn't see any recent checks on the replicas, the PRIMARY throttler goes into dormant mode. In this mode it saves resources, and only polls the replica tablets once per minute.

The starvation scenario is when the PRIMARY is in dormant mode, and only polls the replica once per minute. When it polls the replica, it asks whether it was recently checked. However, the client which checks the replica, for its own reasons, does not keep retrying every second. It may choose to only retry in 1 minute. And so the chances are, when the PRIMARY polls the replica, the replica will have been checked some 20s-30s ago, and it replica responds with "I have not been recently checked". This keeps the PRIMARY in dormant mode, does not renew heartbeats.

At some point the PRIMARY polling collides with the client's check, and the issue is resolved. However, if the client happens to have a 1min retry, the two can keep missing each other for hours, until some shift converges them.

In v19 this does not happen due to an unintentional change. In v19 the PRIMARY's self-checks unintentionally flag the PRIMARY's own last check time, which means the PRIMARY does not go dormant.

Proposed solutions:

Instead of relying on accidental hit or miss, the replica can proactively let the primary know it's been checked. This does not have to happen every time the replica gets checked: once every dormant period (1min) is sufficient.
Another approach would be to force time intervals to overlap. For example, if "recently checked" applied for any cehck taking place in the past 1min (as opposed to 1s-2s) then even a dormant PRIMARY polling the replica every 1min will hit a "I've been recently checked" response.

Reproduction Steps

Binary Version

- `17`
- `18`
- Does not happen on `19` due to unintentional behavior.
- `20`, once unintentional behavior is removed.

Operating System and Environment details

Log Fragments

No response

The text was updated successfully, but these errors were encountered:

shlomi-noach · 2024-03-20T15:35:22Z

Closed by #15398

shlomi-noach added Type: Bug Component: Throttler labels Mar 3, 2024

shlomi-noach self-assigned this Mar 3, 2024

shlomi-noach mentioned this issue Mar 3, 2024

Tablet throttler: starvation fix and consolidation of logic. #15398

Merged

5 tasks

shlomi-noach mentioned this issue Mar 11, 2024

Bug Report: tablet throttler starvation scenario II #15433

Closed

shlomi-noach closed this as completed Mar 20, 2024

shlomi-noach mentioned this issue May 23, 2024

Tablet throttler: recent check diff fix #16001

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug Report: tablet throttler starvation scenario #15397

Bug Report: tablet throttler starvation scenario #15397

shlomi-noach commented Mar 3, 2024

shlomi-noach commented Mar 20, 2024

Bug Report: tablet throttler starvation scenario #15397

Bug Report: tablet throttler starvation scenario #15397

Comments

shlomi-noach commented Mar 3, 2024

Overview of the Issue

Reproduction Steps

Binary Version

Operating System and Environment details

Log Fragments

shlomi-noach commented Mar 20, 2024