Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[server] ReadWriteLock on LeaderFollowerState #1251

Closed
wants to merge 7 commits into from

Conversation

KaiSernLim
Copy link
Contributor

@KaiSernLim KaiSernLim commented Oct 17, 2024

PartitionState

Problem

The problem is that the StoreIngestionTask does not always wait for all inflight messages to be processed before transitioning the PartitionConsumptionState.

As part of the unsubscribing process, waitAfterUnsubscribe() waits up to 10 seconds for the consumer’s next call to poll(). Each poll() fetches a batch of messages from Kafka for processing and indicates that the previous set of inflight messages (fetched under the old partition state) was done processing. This is when state transition can safely proceed.

If a state transition happens while those inflight messages are still being processed, it can cause incompatibilities, cause a thrown exception, crash the SIT, halt ingestion progress, and result in no LEADER until the host is restarted.

Here are two examples of such issues, but there could be additional unknown issues:

Leader -> Follower Transition

  1. The Venice Server is the partition leader, so there are inflight messages from RT
  2. Helix transitions the partition state to FOLLOWER
  3. SIT fails a sanity check upon encountering an UPDATE message (intended only for leaders) while in the FOLLOWER state (link)

Follower -> Leader Transition

  1. The Venice Server is a partition follower, so the inflight messages are from local VT
  2. Helix transitions the partition state to LEADER
  3. SIT fails a sanity check upon encountering a local VT message before producing back to local VT while in the LEADER state (link)

Alternatively, if the timeout is reached but the unsubscribe was not induced by a state transition, I cannot think of any issues, but there is always the possibility of the unknown. Thus, this is primarily a problem during state transitions.

Solution

To solve this problem, we introduce a ReadWriteLock to the LeaderFollowerState. This is the perfect use-case for such a mechanism because the state is read extremely often but updated very infrequently.

Additionally, the consumer must maintain the read lock throughout the duration of processing of message in order to protect against another thread modifying the state.

Changes

  • Added a ReadWriteLock to the LeaderFollowerState in PartitionConsumptionState, which guards the usage of the state
  • The consumer maintains the read lock throughout the duration of processing a message in produceToStoreBufferServiceOrKafka()
  • Added a unit test testShouldProcessRecord() which simulates the following scenario:
    • The consumer thread processes a batch of polled messages while another thread modifies the leader-follower state in the PCS.
    • This specifically tests a follower to leader transition and verifies that the leader-follower state in the PCS can't be modified while the consumer thread is processing messages.

Correctness

  • The read lock must be acquired for the duration of the message's processing (on the consumer's code path in produceToStoreBufferServiceOrKafka()) order for the condition of shouldProcessMessage() to hold.
    • This means that another thread trying to modify LeaderFollowerState as part of a state transition would need to wait for this consumer thread to finish processing the message and release the read lock.
    • This also applies to the batch version of that method: produceToStoreBufferServiceOrKafkaInBatch().

Performance Impact

  • Performance is bottlenecked by produceToStoreBufferServiceOrKafka() and its batch version, because that is the only location where the lock is held for a long period of time. If a writer is waiting on the write lock, it must wait for all readers to release their locks. All future readers will also need to wait until the writer is done because the writer has priority, so they are also bottlenecked by produceToStoreBufferServiceOrKafka().
  • Readers should not affect the performance of other readers. Therefore, there should be no performance impact until a state transition occurs, and the state needs to be overwritten.
  • Since the writer simply needs to set an enum value, the critical section should be instantaneous once it manages to acquire the lock, and any slowdown while locking the writer lock should be minimal.

How was this PR tested?

CI

Does this PR introduce any user-facing changes?

  • No. You can skip the rest of this section.

lluwm
lluwm previously approved these changes Oct 31, 2024
Copy link
Contributor

@lluwm lluwm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this big issue. It will make SIT thread more stable. Change looks good and clean to me. Please address other comments before checking in.

Copy link
Contributor

@haoxu07 haoxu07 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for preparing this! Left a few comments, generally this looks good!


LeaderFollowerState() {
this.state = LeaderFollowerStateType.STANDBY;
this.rwLock = new ReentrantReadWriteLock();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe here we should use the venice flavor of ReentrantReadWriteLock: VeniceReentrantReadWriteLock for debugging purpose.

Besides, this lock was by default as unfair, it does not prefer writing. But for our use-case, we may need to prefer write a bit as for heavy ingestion case we do not want to block state transition for long time.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omg that class it not used anywhere else in the codebase...

I moved the VeniceReentrantReadWriteLock from venice-controller to venice-client-common and used it in this file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding the default as fair or unfair, what's your preference?

* acquire the write lock to modify the leader-follower state, since it would need to wait for this to finish.
*/
final ReadWriteLock leaderFollowerStateLock = partitionConsumptionState.getLeaderFollowerStateLock();
try (AutoCloseableLock ignore = AutoCloseableSingleLock.of(leaderFollowerStateLock.readLock())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here you acquire lock for entire loop but inside produceToStoreBufferServiceOrKafka lock is acquired for every message processing, maybe there is a reason for this?

Copy link
Contributor Author

@KaiSernLim KaiSernLim Nov 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because the condition of shouldProcessMessage() needs to hold for the entire duration for which the message is processed by the consumer.

In the batch version (produceToStoreBufferServiceOrKafkaInBatch()), the condition is checked to determine the batches of selected messages, and then processing begins with ingestionBatchProcessor.process(). Thus, it is easiest and safest to protect the entire section with the read lock. Perhaps this can be improved in the future if needed?

* Wait for the main test thread to go past shouldProcessRecord() and reach waitUntilValueSchemaAvailable()
* inside waitReadyToProcessRecord(), then free the main test thread by making the value schema available
*/
Utils.sleep(1000L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This time you set here is not quite deterministic in test, not sure if there is a way to make this wait deterministic.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I would've liked to make the wait way shorter than 1 second, but I can't think of a way to make it deterministic either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe set a flag inside SIT to indicate waitUntilValueSchemaAvailable and then wait until that flag is turned on?

…roller` into `internal/venice-client-common` and used it in `PartitionConsumptionState`. 🍿
@gaojieliu
Copy link
Contributor

The code would work from correctness POV, but I am not convinced that we need this lock in this code path.
The infinite or long wait we discussed can achieve the similar goal with minimal changes, and @lluwm pointed out that we only need to do long wait for leader<->follower transition, and for others, maybe we can just terminate the wait immediately, such as during graceful shutdown.
Here are my reasons not to introduce this lock:

  1. Infinite wait would avoid this race condition completely in theory.
  2. The performance is roughly same with or without this lock.
  3. This new lock introduces more complexity to the ingestion path, which is already complex.
  4. We introduce two mechanism this code path: timed wait + lock, which is not necessary, and one solution can solve it.

Please let me know the motivation of this change.

@KaiSernLim
Copy link
Contributor Author

Closing and going with increased timeout approach in #1213.

@KaiSernLim KaiSernLim closed this Nov 6, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants