Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server] Increased Wait After Unsubscribe During State Transitions #1213

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

KaiSernLim
Copy link
Contributor

@KaiSernLim KaiSernLim commented Oct 3, 2024

Summary

Problem

The problem is that StoreIngestionTask does not always wait for all inflight messages to be processed before transitioning the leader-follower state in the PartitionConsumptionState.

waitAfterUnsubscribe() waits up to 10 seconds for the consumers next poll() (which would indicate that the inflight messages from the last poll() were processed). This can lead to state mismatches such as from the leader-follower transition and follower-leader transition. The 10s timeout has been hit 150K times in the past month.

Mitigation

Several possible solutions were discussed but they all could be complicated. As an immediate action, we can increase the timeout value so that the consumer will more frequently safely unsubscribe instead of timing out. According to the values of the metric consumer_records_producing_to_write_buffer_latency, increasing the timeout to 1800s should cover almost 100% of all cases.

Changes

  • Added server config server.wait.after.unsubscribe.timeout.ms to turn the timeout wait in waitAfterUnsubscribe() into a configurable setting, and also increased the timeout:
    • Increased the default value of this timeout from 10s to 1800s / 30m based on the consumer_records_producing_to_write_buffer_latency metric.
    • During shutdown / termination scenarios when KafkaConsumerService#unsubscribeAll() is called, the timeout will remain 10s in order to not block shutdown. If the server config is lower than 10s, then that value will be used instead.

How was this PR tested?

GHCI

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.

@KaiSernLim KaiSernLim changed the title Increased Unsubscribe Wait WIP: Increased Unsubscribe Wait Oct 3, 2024
@KaiSernLim KaiSernLim self-assigned this Oct 4, 2024
@KaiSernLim KaiSernLim changed the title WIP: Increased Unsubscribe Wait [server] Increased Unsubscribe Wait Oct 7, 2024
@KaiSernLim KaiSernLim marked this pull request as ready for review October 7, 2024 20:40
@KaiSernLim
Copy link
Contributor Author

Closing to consider approach in #1251 instead.

@KaiSernLim KaiSernLim closed this Oct 17, 2024
…ugh `VeniceServerConfig` and directly to the `SharedKafkaConsumer` from `SharedConsumerAssignmentStrategy`. 📸

Added metric `wait_after_unsubscribe_latency` to see how long the wait after unsubscribe is. ⏳
…rite_buffer_latency` metric should be sufficient. 📒
…cs data from `consumer_records_producing_to_write_buffer_latency`. 📬
Copy link
Contributor

@gaojieliu gaojieliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding, we don't need to change the constructor of KafkaConsumerService or SharedKafkaConsumer, but just add one param: waitingTimeMS to unsubscribe function call, and when the SIT calls unsubscribe, it knows when to use a longer timeout and when to use a shorter timeout.
By default, it should use a short timeout, and only for leader <-> follower transition, SIT needs to pass a long timeout to unsubscribe call.

Let me know if I have any misunderstanding.

@KaiSernLim KaiSernLim changed the title [server] Increased Unsubscribe Wait [server] Increased Wait After Unsubscribe During State Transitions Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants