You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've identified a throttler starvation scenario, where the throttler needlessly rejects a waiting app. This happens in v17 and v18. It does not happen in v19 only due to some unintentional change, as explained below.
The nature of starvation is the combination of multiple load shedding optimizations. Specifically, the scenario can happen when:
On-demand heartbeat is set.
An app asks for throttling on a REPLICA tablet, e.g. this could be rowstreamer.
For whatever reason, the app doesn't make requests for 1 minute.
Before explaining how the starvation can happen, here's some recap of throttler's behavior:
PRIMARY polls replica tablet throttler's and checks whether they've been recently checked. Normally it polls replicas multiple times per second.
A replica responds with "I've been recently checked" if some client checked for throttler state within the last 1-2 seconds.
If so, it asks for a heartbeat lease.
If, for over 1min, no "checks" are made on the PRIMARY and likewise the PRIMARY doesn't see any recent checks on the replicas, the PRIMARY throttler goes into dormant mode. In this mode it saves resources, and only polls the replica tablets once per minute.
The starvation scenario is when the PRIMARY is in dormant mode, and only polls the replica once per minute. When it polls the replica, it asks whether it was recently checked. However, the client which checks the replica, for its own reasons, does not keep retrying every second. It may choose to only retry in 1 minute. And so the chances are, when the PRIMARY polls the replica, the replica will have been checked some 20s-30s ago, and it replica responds with "I have not been recently checked". This keeps the PRIMARY in dormant mode, does not renew heartbeats.
At some point the PRIMARY polling collides with the client's check, and the issue is resolved. However, if the client happens to have a 1min retry, the two can keep missing each other for hours, until some shift converges them.
In v19 this does not happen due to an unintentional change. In v19 the PRIMARY's self-checks unintentionally flag the PRIMARY's own last check time, which means the PRIMARY does not go dormant.
Proposed solutions:
Instead of relying on accidental hit or miss, the replica can proactively let the primary know it's been checked. This does not have to happen every time the replica gets checked: once every dormant period (1min) is sufficient.
Another approach would be to force time intervals to overlap. For example, if "recently checked" applied for any cehck taking place in the past 1min (as opposed to 1s-2s) then even a dormant PRIMARY polling the replica every 1min will hit a "I've been recently checked" response.
Reproduction Steps
Binary Version
- `17`
- `18`
- Does not happen on `19` due to unintentional behavior.
- `20`, once unintentional behavior is removed.
Operating System and Environment details
-
Log Fragments
No response
The text was updated successfully, but these errors were encountered:
Overview of the Issue
We've identified a throttler starvation scenario, where the throttler needlessly rejects a waiting app. This happens in
v17
andv18
. It does not happen inv19
only due to some unintentional change, as explained below.The nature of starvation is the combination of multiple load shedding optimizations. Specifically, the scenario can happen when:
REPLICA
tablet, e.g. this could berowstreamer
.Before explaining how the starvation can happen, here's some recap of throttler's behavior:
PRIMARY
polls replica tablet throttler's and checks whether they've been recently checked. Normally it polls replicas multiple times per second.1min
, no "checks" are made on thePRIMARY
and likewise thePRIMARY
doesn't see any recent checks on the replicas, thePRIMARY
throttler goes into dormant mode. In this mode it saves resources, and only polls the replica tablets once per minute.The starvation scenario is when the
PRIMARY
is in dormant mode, and only polls the replica once per minute. When it polls the replica, it asks whether it was recently checked. However, the client which checks the replica, for its own reasons, does not keep retrying every second. It may choose to only retry in 1 minute. And so the chances are, when thePRIMARY
polls the replica, the replica will have been checked some20s
-30s
ago, and it replica responds with "I have not been recently checked". This keeps thePRIMARY
in dormant mode, does not renew heartbeats.At some point the
PRIMARY
polling collides with the client's check, and the issue is resolved. However, if the client happens to have a1min
retry, the two can keep missing each other for hours, until some shift converges them.In
v19
this does not happen due to an unintentional change. Inv19
thePRIMARY
's self-checks unintentionally flag thePRIMARY
's own last check time, which means thePRIMARY
does not go dormant.Proposed solutions:
1min
(as opposed to1s
-2s
) then even a dormantPRIMARY
polling the replica every1min
will hit a "I've been recently checked" response.Reproduction Steps
Binary Version
Operating System and Environment details
Log Fragments
No response
The text was updated successfully, but these errors were encountered: