Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Enhancement] WriteAheadLog with sequentially callback #1718

Open
superhx opened this issue Aug 3, 2024 · 4 comments
Open

[Enhancement] WriteAheadLog with sequentially callback #1718

superhx opened this issue Aug 3, 2024 · 4 comments
Labels
enhancement New feature or request Stale

Comments

@superhx
Copy link
Collaborator

superhx commented Aug 3, 2024

Problem

Currently, BlockWALService persists data blocks in parallel, responding directly to the upper layer with success as soon as any data block is persisted, even if the previous data block has not completed persistence. Although this method provides "better" write latency, it shifts the responsibility of ensuring sequentiality to the upper layer, making the upper layer logic more complex.

Expectation

Implement a SequetialBlockWALService:

  • Provide the semantics of sequentially callback (the underlying can still be concurrent writing);
  • Through optimization of locks and models, the throughput and latency can match(or even better than) that of the original BlockWALService on Aliyun ESSD PL1 20GB 120MiB/s 2800 IOPS with write concurrency 8/16/32/64;
  • The format of the WAL needs to remain consistent with the previous one, and support the recovery logic with data holes of BlockWALService so that it can be directly replaced in the future;
@CLFutureX
Copy link
Contributor

@superhx Hey, after careful consideration of the current write model on my end, I've found that to retain the current parallel write model while implementing sequential callback semantics on top,
it might be more appropriate to process it at the stage after the block has been written.
image

After the block is written to the WAL, consider serializing the process of writing the record to the logCache to reduce the complexity at the upper layer.

  1. Create a single-threaded callbackExecutor.
  2. When the I/O thread completes the write, trigger the callback operation of the Request, hand over the handleAppendCallback to the callbackExecutor to complete, and perform the sequential callback within the callbackExecutor.

for example:
com.automq.stream.s3.S3Storage#append0
image

This approach can avoid concurrency control of the callbackSequencer, while also decoupling business-related operations from I/O operations: since the current I/O thread is not only responsible for writing the block but also for handling the request callback processing chain;

Potential issues: Considering under a large number of requests, processing callbacks may lead to significant processing pressure, but for now, they are all pure memory operations. If there is concern about delays in ack responses, then it is actually possible to consider designing the callbackExecutor as a multi-threaded model, routing to threads in the object thread pool based on streamId,thereby ensuring the orderliness maintained by the stream.
Alternatively, consider setting up an ack response thread pool, dedicated to handling ack requests upwards.

@Chillax-0v0
Copy link
Contributor

@CLFutureX Hi, I think you might have misunderstood @superhx's idea. In his description,

it shifts the responsibility of ensuring sequentiality to the upper layer, making the upper layer logic more complex

the "upper layer" mentioned here refers to S3Storage rather than the caller of S3Storage.

Specifically, you can see that due to the current non-sequentially callback, we have added a lot of complex logic in S3Storage, such as WALCallbackSequencer and WALConfirmOffsetCalculator. This has made the S3Storage code overly complex and difficult to maintain.

What we want to do now is to thoroughly refactor BlockWALService to make it callback sequentially, thereby reducing the complexity of S3Storage.

@CLFutureX
Copy link
Contributor

@CLFutureX Hi, I think you might have misunderstood @superhx's idea. In his description,

it shifts the responsibility of ensuring sequentiality to the upper layer, making the upper layer logic more complex

the "upper layer" mentioned here refers to S3Storage rather than the caller of S3Storage.

Specifically, you can see that due to the current non-sequentially callback, we have added a lot of complex logic in S3Storage, such as WALCallbackSequencer and WALConfirmOffsetCalculator. This has made the S3Storage code overly complex and difficult to maintain.

What we want to do now is to thoroughly refactor BlockWALService to make it callback sequentially, thereby reducing the complexity of S3Storage.

Ye, this will be a big project

@CLFutureX
Copy link
Contributor

CLFutureX commented Aug 20, 2024

@superhx @Chillax-0v0
hey, please take a look

Design for Ordered Return with Concurrent Writing

Expectations:

  1. Maintain the model of concurrent writing
  2. Ensure ordered return
    So the order here can be designed to be stream-dimension order, which means ensuring the orderliness of partitioned writing.

image

Plan Design:

  1. Based on the existing block, abstract the internal logical partition;
  2. Utilize the multi-thread model EventLoopGroup to replace the existing pollBlockScheduler + ioExecutor for data writing;
  3. Add a single ioThread, responsible for adhering to IOPS limits to complete data force operations.

Block

Continuing the current block design, it remains unchanged externally, but internally builds logical partitions, routing to different blockPartitions through streamid, thus achieving imperceptible to the outside.

Based on partition distribution:
After a block is written, it is directly routed and distributed to different eventloop threads based on block partitioning.

Calculation of blockPartition's startOffset:
Combining the current writing logic: writing based on the startOffset of the block, so during the distribution phase, it is necessary to calculate the startOffset for each partition.

How to calculate it?
example:
blockPartition1.startOffset = block.startOffset;
blockPartition2.startOffest = block.startOffset +alignLargeByBlockSize(blockPartition1.size)

eventloopGroup

Parallel Writing:
The event loop can perform parallel writing based on the existing writing model.
Completion Notification:
Writing does not necessarily mean that the actual writing is complete; it requires waiting for the IO Thread to execute the force method. Thus, the CompletableFuture object after writing is transferred to an internal queue for completed and pending notification items.
In this way, an orderly parallel writing of streams is achieved at the upper layer.

IO thread

Free IOPS Limit:
To continue the limitation of IOPS, the IO thread is responsible for executing a force once per 1/3000 by default. (Further optimization is possible here: when the event loop writes, it updates a globally visible flag, and the IO thread decides whether to execute force based on the write flag during execution, to avoid empty force executions.)
Notification to Upper Layers:
How to notify the upper event loop to complete the corresponding CompletableFuture response after the force?
Plan 1: Lock control, when the IO thread executes force, it prohibits the event loop from continuing to write. After writing is finished, notify the event loop group. After the event loop receives the notification, it traverses the completed response collection.
Plan 2: Special task flag When the IO thread starts writing, it inserts a special object into the completed notification collection of each event loop. After the force is completed, notify the event loop that the writing is finished.
So, after receiving the notification, the event loop starts to traverse the completed notification collection until it encounters the special object.
In this way, the notification is completed based on the task flag. With this, the parallel writing and ordered return of WAL is completed.

The benefits of the plan include:

● Achieved orderly writing based on partitioning.
● The entire process is lock-free.
● Clear division of responsibilities and well-defined boundaries at process nodes: Writing operations, event loops, and I/O threads each take on independent responsibilities with clear functional boundaries, and there is no interference between them (for example the current interference issues between writing operations and polling operations).
● It is simple and easy to maintain.
Following this, we can then eliminate the current WALCallbackSequencer class.

@github-actions github-actions bot added the Stale label Nov 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Stale
Projects
None yet
Development

No branches or pull requests

3 participants