Use non-contiguous buffer for codec and transport #1559

xiaoyawei · 2023-10-25T12:26:53Z

Motivation

This PR aims to address performance concerns related to memory copying when transferring large data chunks, as outlined in #1558

Solution

To minimize memory copying, we've introduced a specialized buffer backed by non-contiguous memory. Taking cues from the C++ gRPC solution's SliceBuffer, this new buffer structure aims to optimize memory handling, especially in scenarios such as Arrow Flight, which involves transferring substantial data volumes.

Detailed changes and considerations are discussed in #1558

Perf benchmark

I both run the existing benchmarks for decoder, and also put together an end-to-end benchmarks for the Arrow Flight scenarios (which is elaborated in #1558).

Decoder benchmark

Comparative results between the tonic master branch and this PR show performance improvements across most scenarios:

Master Branch Results:

test chunk_size_100   ... bench:       1,069 ns/iter (+/- 201) = 940 MB/s
test chunk_size_1005  ... bench:         704 ns/iter (+/- 260) = 1427 MB/s
test chunk_size_500   ... bench:         758 ns/iter (+/- 212) = 1325 MB/s
test message_count_1  ... bench:         681 ns/iter (+/- 33) = 741 MB/s
test message_count_10 ... bench:       2,605 ns/iter (+/- 906) = 1938 MB/s
test message_count_20 ... bench:       4,577 ns/iter (+/- 1,339) = 2206 MB/s
test message_size_10k ... bench:       2,463 ns/iter (+/- 423) = 8124 MB/s
test message_size_1k  ... bench:         921 ns/iter (+/- 173) = 2182 MB/s
test message_size_5k  ... bench:       1,345 ns/iter (+/- 293) = 7442 MB/s

This PR Results:

test chunk_size_100   ... bench:       1,174 ns/iter (+/- 55) = 856 MB/s
test chunk_size_1005  ... bench:         580 ns/iter (+/- 37) = 1732 MB/s
test chunk_size_500   ... bench:         808 ns/iter (+/- 89) = 1243 MB/s
test message_count_1  ... bench:         585 ns/iter (+/- 39) = 863 MB/s
test message_count_10 ... bench:       1,303 ns/iter (+/- 149) = 3875 MB/s
test message_count_20 ... bench:       2,097 ns/iter (+/- 195) = 4816 MB/s
test message_size_10k ... bench:       2,080 ns/iter (+/- 657) = 9620 MB/s
test message_size_1k  ... bench:         660 ns/iter (+/- 72) = 3045 MB/s
test message_size_5k  ... bench:       1,393 ns/iter (+/- 48) = 7185 MB/s

It's notable that in alignment with h2's implementation, message_size always equals chunk_size in the benchmark. This PR improves performance across all related scenarios.

Arrow Flight End-to-End Benchmark

Simulated scenarios involve the client invoking the DoExchange RPC method, creating a bidi-streaming channel with the server, with multiple FlightData objects exchanged.

Scenario: `data_body` size 64KB, 10 chunks exchanged

Master Branch: Throughput: 462 MB/s
This PR: Throughput: 600 MB/s

Scenario: `data_body` size 1MB, 10 chunks exchanged

Master Branch: Throughput: 992 MB/s
This PR: Throughput: 1375 MB/s

The benchmarks conclusively show that this PR enhances throughput across various scenarios, offering a more efficient solution for data transfer in Arrow Flight use cases and beyond.

xiaoyawei force-pushed the zero_copy_bytes branch 3 times, most recently from 35cfa8e to f481616 Compare November 16, 2023 06:25

Use non-contiguous buffer for codec and transport

750336a

xiaoyawei force-pushed the zero_copy_bytes branch from f481616 to 750336a Compare November 16, 2023 06:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use non-contiguous buffer for codec and transport #1559

Use non-contiguous buffer for codec and transport #1559

xiaoyawei commented Oct 25, 2023 •

edited

Loading

Use non-contiguous buffer for codec and transport #1559

Are you sure you want to change the base?

Use non-contiguous buffer for codec and transport #1559

Conversation

xiaoyawei commented Oct 25, 2023 • edited Loading

Motivation

Solution

Perf benchmark

Decoder benchmark

Arrow Flight End-to-End Benchmark

Scenario: data_body size 64KB, 10 chunks exchanged

Scenario: data_body size 1MB, 10 chunks exchanged

xiaoyawei commented Oct 25, 2023 •

edited

Loading

Scenario: `data_body` size 64KB, 10 chunks exchanged

Scenario: `data_body` size 1MB, 10 chunks exchanged