Arrow Flight Performance -- Rust vs Python (C++) #6670

vichry2 · 2024-11-01T18:22:27Z

Which part is this question about
Arrow flight, FlightDataEncoderBuilder, do_get

Describe your question
Is it expected that Arrow's Python (C++) Flight implementation encodes data more efficiently than arrow-rs?

Additional context
Hello.
After discussion with @alamb, I am filing an issue here.

Unsure if this is a bug, or if it's expected, or if there's just an issue with my code, but after running some tests, it seems that Rust's encoding takes more time and resources than Python.

I am running two servers, one in Python and the other in Rust, with the same simple design:
-Create a Table/RecordBatch before starting the flight service, which the service will hold in memory when running.
-When receiving a request (in do_get), simply provide a view of the data to fl.RecordBatchStream in Python / FlightDataEncoderBuilder in Rust.

Because nothing is really happening on the Python side (just providing a view to a Table), and a single request is not holding the GIL for a significant amount of time, I imagine I'm ultimately measuring the C++ Arrow Flight implementation.

I have run two tests:

Python script which sends n requests sequentially to each server, consuming the entire stream (flightclient.do_get().read_all()) and displays the average response time for each server.
Using the Locust framework, load testing the maximum RPS capabilities of the servers (used taskset -c to seperate locust users and server).

I observe the following from the tests:

As the size of data sent to the client increases, the difference of average response time between Python and Rust servers also increases (in favor of Python server).
Similarily, as the amount of data increases, Python is able to achieve a higher RPS than Rust.

The Rust server's CPUs are fully utilized (using more than Python server in certain cases). After profiling with perf, I am seeing a lot of CPU usage related to memory movement.

You can access my code here: https://github.com/vichry2/flight-benchmark

Thank you for your help!

The text was updated successfully, but these errors were encountered:

tustvold · 2024-11-01T18:55:18Z

Just to double check, did you compile the Rust code in release mode?

Presuming that is the case, I suspect what you're running into is - https://docs.rs/arrow-flight/latest/arrow_flight/encode/struct.FlightDataEncoderBuilder.html#method.with_max_flight_data_size

In particular if you feed a massive RecordBatch into the arrow-flight, it will break it up to better match gRPC best-practices, I don't believe that pyarrow does something similar. The solution here may be to return multiple smaller RecordBatch instead of one gigantic batch.

vichry2 · 2024-11-01T20:19:16Z

Yes, I am compiling in release mode.

I have already attempted to chunk the record batches into "optimal sizes" (so that a record batch corresponds to close to 2MB of FlightData). Although this did show some slight improvements, performance did not drastically increase.

I've added this to the repo attached to the issue, as well as a README file, so you can reproduce the issue.

Thanks!

tustvold · 2024-11-01T21:23:41Z

Perhaps you could capture a trace of the Rust server using hotspot or similar, it may show where the bottleneck is

alamb · 2024-11-03T12:59:57Z

Thank you @vichry2 -- this is great

Right now i am not 100% sure how to reproduce what you are seeing (https://github.com/vichry2/flight-benchmark is great, but I don't have time to understand how to run it exactly as you are).

Could you either:

share the exact commands you are using to run the Rust benchmarks
provide us additional profiling information?

In terms of profiling:

The Rust server's CPUs are fully utilized (using more than Python server in certain cases). After profiling with perf, I am seeing a lot of CPU usage related to memory movement.

Would it be possible to post a screen shot / dump if what you are seeing?

Alternately, perhaps you can capture a flamegraph with cargo flamegraph https://github.com/flamegraph-rs/flamegraph or similiar. Here is a short video tutorial of how I do it: https://youtu.be/2z11xtYw_xs

alamb · 2024-11-03T13:05:20Z

I strongly suspect there are some unecessary copies or something going on in Rust and with a flamegraph or similar we'll be able to fix them quickly

vichry2 · 2024-11-03T15:55:54Z

Hi @alamb,

Thanks for the flamegraph tutorial!

I've generated a flamegraph for my Rust server which I've attached here:

You can also find it on the flight-benchmark repo (assets folder).

As mentioned previously, it appears that the bottleneck is related to memory movement (memmove) during batch encoding.

Additionally, I’ve added commands in the README.md file of my repo that explain how to run the client.py script on the servers. This script compares the average response times for serving a single request in both languages, showing that Python is quicker.

Regarding the screenshot of the heavy CPU usage during the Locust load test, I can get that to you within the week. I prefer using taskset to separate client CPUs from server CPUs, so you can see the server CPUs saturating before the client CPUs using htop. However, on my home computer (MacOS), I haven't found an easy equivalent for that.

vichry2 · 2024-11-03T16:06:05Z

After observing the problematic function write_buffer, I suspect the extend_from_slice calls to be the issue.

tustvold · 2024-11-03T16:07:28Z

It looks like the IPCDataGenerator is not doing a good job, if any, at estimating buffer sizes ahead of time and is relying on bump allocation. This is what is leading to a large amount of time spent in realloc.

Hooking up buffer estimation based on https://docs.rs/arrow-data/latest/arrow_data/struct.ArrayData.html#method.get_slice_memory_size would likely help.

That being said I'm not aware of a way to avoid at least some memmove, as our gRPC implementation requires the response payloads to be contiguous blocks of memory, and does not have a mechanism to allow vectored writes.

hyperium/tonic#1558

vichry2 · 2024-11-03T17:55:17Z

I was able to reduce libsystem_malloc overhead, visible in the previous flamegraph by doing something similar to what you proposed @tustvold.
This resulted in the following flamegraph:

I used https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatch.html#method.get_array_memory_size to estimate the size of the arrow_data vec in record_batch_to_bytes -- and initialized the vec with Vec::with_capacity(array_memory_size). This would likely overestimate the size of the arrow_data vector created, hence not be optimal for memory consumption, but that's beside the point.
The point being that the bottleneck is related to memory movement, and as you mentioned, if the response requires that data be in contiguous blocks of memory, not sure there is a solution to this.

Do you happen to know why the C++ implementation does not seem to have this overhead? Is contiguous memory not a requirement?

Thanks!

tustvold · 2024-11-03T18:08:03Z

Do you happen to know why the C++ implementation does not seem to have this overhead? Is contiguous memory not a requirement?

The C++ gRPC implementation has a mechanism for providing a list of buffers. I'm not familiar with arrow-cpp but I imagine this is what they're using.

We would need a similar mechanism in tonic, and then we could look to hook up the IPC machinery to take advantage of it.

tustvold · 2024-11-03T18:17:51Z

Something worth putting into context though is that in most scenarios this copy will be irrelevant, as the transfer will be bottlenecked on network IO.

This won't the case when running benchmarks against a local machine, but then arrow flight is an odd choice for such a deployment, where FFI or shared memory IPC would be more appropriate.

alamb · 2024-11-04T20:15:21Z

I used https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatch.html#method.get_array_memory_size to estimate the size of the arrow_data vec in record_batch_to_bytes -- and initialized the vec with Vec::with_capacity(array_memory_size). This would likely overestimate the size of the arrow_data vector created, hence not be optimal for memory consumption, but that's beside the point.

Thank you @vichry2 -- is this something we could make a PR in arrow-rs to improve? Or maybe add a documentation example or something?

vichry2 · 2024-11-05T18:30:34Z

Thank you @vichry2 -- is this something we could make a PR in arrow-rs to improve?

I think that would be good. Although just a minor improvement now, if ever non-contiguous memory buffers are implemented in tonic and incorporated in FlightDataEncoder, the recurring malloc calls would become more evident.

I can investigate the difference in memory allocation of the size of arrow_data initialized with RecordBatch.get_array_memory_size and the current arrow_data Vec, to ensure we don’t reserve an excessive amount of unnecessary memory.

vichry2 added the question Further information is requested label Nov 1, 2024

alamb mentioned this issue Nov 4, 2024

Implement Non-contiguous Memory Buffer to Optimize Data Transfer hyperium/tonic#1558

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow Flight Performance -- Rust vs Python (C++) #6670

Arrow Flight Performance -- Rust vs Python (C++) #6670

vichry2 commented Nov 1, 2024

tustvold commented Nov 1, 2024 •

edited

Loading

vichry2 commented Nov 1, 2024 •

edited

Loading

tustvold commented Nov 1, 2024

alamb commented Nov 3, 2024

alamb commented Nov 3, 2024

vichry2 commented Nov 3, 2024

vichry2 commented Nov 3, 2024

tustvold commented Nov 3, 2024 •

edited

Loading

vichry2 commented Nov 3, 2024 •

edited

Loading

tustvold commented Nov 3, 2024

tustvold commented Nov 3, 2024 •

edited

Loading

alamb commented Nov 4, 2024

vichry2 commented Nov 5, 2024 •

edited

Loading

Arrow Flight Performance -- Rust vs Python (C++) #6670

Arrow Flight Performance -- Rust vs Python (C++) #6670

Comments

vichry2 commented Nov 1, 2024

tustvold commented Nov 1, 2024 • edited Loading

vichry2 commented Nov 1, 2024 • edited Loading

tustvold commented Nov 1, 2024

alamb commented Nov 3, 2024

alamb commented Nov 3, 2024

vichry2 commented Nov 3, 2024

vichry2 commented Nov 3, 2024

tustvold commented Nov 3, 2024 • edited Loading

vichry2 commented Nov 3, 2024 • edited Loading

tustvold commented Nov 3, 2024

tustvold commented Nov 3, 2024 • edited Loading

alamb commented Nov 4, 2024

vichry2 commented Nov 5, 2024 • edited Loading

tustvold commented Nov 1, 2024 •

edited

Loading

vichry2 commented Nov 1, 2024 •

edited

Loading

tustvold commented Nov 3, 2024 •

edited

Loading

vichry2 commented Nov 3, 2024 •

edited

Loading

tustvold commented Nov 3, 2024 •

edited

Loading

vichry2 commented Nov 5, 2024 •

edited

Loading