[server] ReadWriteLock on LeaderFollowerState #1251

KaiSernLim · 2024-10-17T20:55:56Z

Problem

The problem is that the StoreIngestionTask does not always wait for all inflight messages to be processed before transitioning the PartitionConsumptionState.

As part of the unsubscribing process, waitAfterUnsubscribe() waits up to 10 seconds for the consumer’s next call to poll(). Each poll() fetches a batch of messages from Kafka for processing and indicates that the previous set of inflight messages (fetched under the old partition state) was done processing. This is when state transition can safely proceed.

If a state transition happens while those inflight messages are still being processed, it can cause incompatibilities, cause a thrown exception, crash the SIT, halt ingestion progress, and result in no LEADER until the host is restarted.

Here are two examples of such issues, but there could be additional unknown issues:

Leader -> Follower Transition

The Venice Server is the partition leader, so there are inflight messages from RT
Helix transitions the partition state to FOLLOWER
SIT fails a sanity check upon encountering an UPDATE message (intended only for leaders) while in the FOLLOWER state (link)

Follower -> Leader Transition

The Venice Server is a partition follower, so the inflight messages are from local VT
Helix transitions the partition state to LEADER
SIT fails a sanity check upon encountering a local VT message before producing back to local VT while in the LEADER state (link)

Alternatively, if the timeout is reached but the unsubscribe was not induced by a state transition, I cannot think of any issues, but there is always the possibility of the unknown. Thus, this is primarily a problem during state transitions.

Solution

To solve this problem, we introduce a ReadWriteLock to the LeaderFollowerState. This is the perfect use-case for such a mechanism because the state is read extremely often but updated very infrequently.

Additionally, the consumer must maintain the read lock throughout the duration of processing of message in order to protect against another thread modifying the state.

Changes

Added a ReadWriteLock to the LeaderFollowerState in PartitionConsumptionState, which guards the usage of the state
The consumer maintains the read lock throughout the duration of processing a message in produceToStoreBufferServiceOrKafka()
Added a unit test testShouldProcessRecord() which simulates the following scenario:
- The consumer thread processes a batch of polled messages while another thread modifies the leader-follower state in the PCS.
- This specifically tests a follower to leader transition and verifies that the leader-follower state in the PCS can't be modified while the consumer thread is processing messages.

Correctness

The read lock must be acquired for the duration of the message's processing (on the consumer's code path in produceToStoreBufferServiceOrKafka()) order for the condition of shouldProcessMessage() to hold.
- This means that another thread trying to modify LeaderFollowerState as part of a state transition would need to wait for this consumer thread to finish processing the message and release the read lock.
- This also applies to the batch version of that method: produceToStoreBufferServiceOrKafkaInBatch().

Performance Impact

Performance is bottlenecked by produceToStoreBufferServiceOrKafka() and its batch version, because that is the only location where the lock is held for a long period of time. If a writer is waiting on the write lock, it must wait for all readers to release their locks. All future readers will also need to wait until the writer is done because the writer has priority, so they are also bottlenecked by produceToStoreBufferServiceOrKafka().
Readers should not affect the performance of other readers. Therefore, there should be no performance impact until a state transition occurs, and the state needs to be overwritten.
Since the writer simply needs to set an enum value, the critical section should be instantaneous once it manages to acquire the lock, and any slowdown while locking the writer lock should be minimal.

How was this PR tested?

CI

Does this PR introduce any user-facing changes?

No. You can skip the rest of this section.

...inci-client/src/main/java/com/linkedin/davinci/kafka/consumer/PartitionConsumptionState.java

lluwm

Thanks for fixing this big issue. It will make SIT thread more stable. Change looks good and clean to me. Please address other comments before checking in.

…ock()`, which can be error-prone. 💹

haoxu07

Thanks for preparing this! Left a few comments, generally this looks good!

haoxu07 · 2024-11-02T00:03:00Z

...inci-client/src/main/java/com/linkedin/davinci/kafka/consumer/PartitionConsumptionState.java

+
+    LeaderFollowerState() {
+      this.state = LeaderFollowerStateType.STANDBY;
+      this.rwLock = new ReentrantReadWriteLock();


Maybe here we should use the venice flavor of ReentrantReadWriteLock: VeniceReentrantReadWriteLock for debugging purpose.

Besides, this lock was by default as unfair, it does not prefer writing. But for our use-case, we may need to prefer write a bit as for heavy ingestion case we do not want to block state transition for long time.

Omg that class it not used anywhere else in the codebase...

I moved the VeniceReentrantReadWriteLock from venice-controller to venice-client-common and used it in this file.

Regarding the default as fair or unfair, what's your preference?

haoxu07 · 2024-11-02T00:04:31Z

...ts/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTask.java

+     * acquire the write lock to modify the leader-follower state, since it would need to wait for this to finish.
+     */
+    final ReadWriteLock leaderFollowerStateLock = partitionConsumptionState.getLeaderFollowerStateLock();
+    try (AutoCloseableLock ignore = AutoCloseableSingleLock.of(leaderFollowerStateLock.readLock())) {


Here you acquire lock for entire loop but inside produceToStoreBufferServiceOrKafka lock is acquired for every message processing, maybe there is a reason for this?

Because the condition of shouldProcessMessage() needs to hold for the entire duration for which the message is processed by the consumer.

In the batch version (produceToStoreBufferServiceOrKafkaInBatch()), the condition is checked to determine the batches of selected messages, and then processing begins with ingestionBatchProcessor.process(). Thus, it is easiest and safest to protect the entire section with the read lock. Perhaps this can be improved in the future if needed?

haoxu07 · 2024-11-02T00:06:21Z

...a-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/StoreIngestionTaskTest.java

+         * Wait for the main test thread to go past shouldProcessRecord() and reach waitUntilValueSchemaAvailable()
+         * inside waitReadyToProcessRecord(), then free the main test thread by making the value schema available
+         */
+        Utils.sleep(1000L);


This time you set here is not quite deterministic in test, not sure if there is a way to make this wait deterministic.

Yeah, I would've liked to make the wait way shorter than 1 second, but I can't think of a way to make it deterministic either.

Maybe set a flag inside SIT to indicate waitUntilValueSchemaAvailable and then wait until that flag is turned on?

…roller` into `internal/venice-client-common` and used it in `PartitionConsumptionState`. 🍿

gaojieliu · 2024-11-04T17:57:46Z

The code would work from correctness POV, but I am not convinced that we need this lock in this code path.
The infinite or long wait we discussed can achieve the similar goal with minimal changes, and @lluwm pointed out that we only need to do long wait for leader<->follower transition, and for others, maybe we can just terminate the wait immediately, such as during graceful shutdown.
Here are my reasons not to introduce this lock:

Infinite wait would avoid this race condition completely in theory.
The performance is roughly same with or without this lock.
This new lock introduces more complexity to the ingestion path, which is already complex.
We introduce two mechanism this code path: timed wait + lock, which is not necessary, and one solution can solve it.

Please let me know the motivation of this change.

KaiSernLim · 2024-11-06T19:05:21Z

Closing and going with increased timeout approach in #1213.

KaiSernLim mentioned this pull request Oct 17, 2024

[server] Increased Wait After Unsubscribe During State Transitions #1213

Open

1 task

KaiSernLim self-assigned this Oct 18, 2024

sixpluszero reviewed Oct 18, 2024

View reviewed changes

...inci-client/src/main/java/com/linkedin/davinci/kafka/consumer/PartitionConsumptionState.java Outdated Show resolved Hide resolved

KaiSernLim force-pushed the partition-state-mismatch branch from 5a05bdb to 808d633 Compare October 28, 2024 21:31

lluwm previously approved these changes Oct 31, 2024

View reviewed changes

KaiSernLim added 6 commits October 31, 2024 13:31

The leader to follower transition test is working. 🍾

569f66a

Improved readability and reliability. 🪭🪭

9b02b5f

Added ReadWriteLock to LeaderFollowerStateType. 🔒

9e68345

Added lots of comments explaining what's going on. ⛸️

af7fd9a

Did a bit more cleaning up. 🧹

e20d24c

Using AutoCloseableLock instead of manually using lock() and `unl…

a4b050c

…ock()`, which can be error-prone. 💹

KaiSernLim dismissed lluwm’s stale review via a4b050c October 31, 2024 20:41

KaiSernLim force-pushed the partition-state-mismatch branch from 808d633 to a4b050c Compare October 31, 2024 20:41

haoxu07 reviewed Nov 2, 2024

View reviewed changes

Moved VeniceReentrantReadWriteLock from `venice.service.venice-cont…

5394f52

…roller` into `internal/venice-client-common` and used it in `PartitionConsumptionState`. 🍿

KaiSernLim closed this Nov 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[server] ReadWriteLock on LeaderFollowerState #1251

[server] ReadWriteLock on LeaderFollowerState #1251

KaiSernLim commented Oct 17, 2024 •

edited

Loading

lluwm left a comment

haoxu07 left a comment

haoxu07 Nov 2, 2024

KaiSernLim Nov 2, 2024

haoxu07 Nov 4, 2024

haoxu07 Nov 2, 2024

KaiSernLim Nov 2, 2024 •

edited

Loading

haoxu07 Nov 2, 2024

KaiSernLim Nov 2, 2024

haoxu07 Nov 4, 2024

gaojieliu commented Nov 4, 2024

KaiSernLim commented Nov 6, 2024

[server] ReadWriteLock on LeaderFollowerState #1251

[server] ReadWriteLock on LeaderFollowerState #1251

Conversation

KaiSernLim commented Oct 17, 2024 • edited Loading

Problem

Leader -> Follower Transition

Follower -> Leader Transition

Solution

Changes

Correctness

Performance Impact

How was this PR tested?

Does this PR introduce any user-facing changes?

lluwm left a comment

Choose a reason for hiding this comment

haoxu07 left a comment

Choose a reason for hiding this comment

haoxu07 Nov 2, 2024

Choose a reason for hiding this comment

KaiSernLim Nov 2, 2024

Choose a reason for hiding this comment

haoxu07 Nov 4, 2024

Choose a reason for hiding this comment

haoxu07 Nov 2, 2024

Choose a reason for hiding this comment

KaiSernLim Nov 2, 2024 • edited Loading

Choose a reason for hiding this comment

haoxu07 Nov 2, 2024

Choose a reason for hiding this comment

KaiSernLim Nov 2, 2024

Choose a reason for hiding this comment

haoxu07 Nov 4, 2024

Choose a reason for hiding this comment

gaojieliu commented Nov 4, 2024

KaiSernLim commented Nov 6, 2024

KaiSernLim commented Oct 17, 2024 •

edited

Loading

KaiSernLim Nov 2, 2024 •

edited

Loading