-
Notifications
You must be signed in to change notification settings - Fork 795
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arrow Flight Performance -- Rust vs Python (C++) #6670
Comments
Just to double check, did you compile the Rust code in release mode? Presuming that is the case, I suspect what you're running into is - https://docs.rs/arrow-flight/latest/arrow_flight/encode/struct.FlightDataEncoderBuilder.html#method.with_max_flight_data_size In particular if you feed a massive RecordBatch into the arrow-flight, it will break it up to better match gRPC best-practices, I don't believe that pyarrow does something similar. The solution here may be to return multiple smaller RecordBatch instead of one gigantic batch. |
Yes, I am compiling in release mode. I have already attempted to chunk the record batches into "optimal sizes" (so that a record batch corresponds to close to 2MB of FlightData). Although this did show some slight improvements, performance did not drastically increase. I've added this to the repo attached to the issue, as well as a README file, so you can reproduce the issue. Thanks! |
Perhaps you could capture a trace of the Rust server using hotspot or similar, it may show where the bottleneck is |
Thank you @vichry2 -- this is great Right now i am not 100% sure how to reproduce what you are seeing (https://github.com/vichry2/flight-benchmark is great, but I don't have time to understand how to run it exactly as you are). Could you either:
In terms of profiling:
Would it be possible to post a screen shot / dump if what you are seeing? Alternately, perhaps you can capture a flamegraph with |
I strongly suspect there are some unecessary copies or something going on in Rust and with a flamegraph or similar we'll be able to fix them quickly |
Hi @alamb, Thanks for the flamegraph tutorial! I've generated a flamegraph for my Rust server which I've attached here: You can also find it on the As mentioned previously, it appears that the bottleneck is related to memory movement ( Additionally, I’ve added commands in the Regarding the screenshot of the heavy CPU usage during the Locust load test, I can get that to you within the week. I prefer using |
After observing the problematic function |
It looks like the IPCDataGenerator is not doing a good job, if any, at estimating buffer sizes ahead of time and is relying on bump allocation. This is what is leading to a large amount of time spent in realloc. Hooking up buffer estimation based on https://docs.rs/arrow-data/latest/arrow_data/struct.ArrayData.html#method.get_slice_memory_size would likely help. That being said I'm not aware of a way to avoid at least some memmove, as our gRPC implementation requires the response payloads to be contiguous blocks of memory, and does not have a mechanism to allow vectored writes. |
I was able to reduce I used https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatch.html#method.get_array_memory_size to estimate the size of the Do you happen to know why the C++ implementation does not seem to have this overhead? Is contiguous memory not a requirement? Thanks! |
The C++ gRPC implementation has a mechanism for providing a list of buffers. I'm not familiar with arrow-cpp but I imagine this is what they're using. We would need a similar mechanism in tonic, and then we could look to hook up the IPC machinery to take advantage of it. |
Something worth putting into context though is that in most scenarios this copy will be irrelevant, as the transfer will be bottlenecked on network IO. This won't the case when running benchmarks against a local machine, but then arrow flight is an odd choice for such a deployment, where FFI or shared memory IPC would be more appropriate. |
Thank you @vichry2 -- is this something we could make a PR in arrow-rs to improve? Or maybe add a documentation example or something? |
I think that would be good. Although just a minor improvement now, if ever non-contiguous memory buffers are implemented in tonic and incorporated in I can investigate the difference in memory allocation of the size of |
Which part is this question about
Arrow flight, FlightDataEncoderBuilder, do_get
Describe your question
Is it expected that Arrow's Python (C++) Flight implementation encodes data more efficiently than arrow-rs?
Additional context
Hello.
After discussion with @alamb, I am filing an issue here.
Unsure if this is a bug, or if it's expected, or if there's just an issue with my code, but after running some tests, it seems that Rust's encoding takes more time and resources than Python.
I am running two servers, one in Python and the other in Rust, with the same simple design:
-Create a
Table
/RecordBatch
before starting the flight service, which the service will hold in memory when running.-When receiving a request (in
do_get
), simply provide a view of the data tofl.RecordBatchStream
in Python /FlightDataEncoderBuilder
in Rust.Because nothing is really happening on the Python side (just providing a view to a
Table
), and a single request is not holding the GIL for a significant amount of time, I imagine I'm ultimately measuring the C++ Arrow Flight implementation.I have run two tests:
flightclient.do_get().read_all()
) and displays the average response time for each server.taskset -c
to seperate locust users and server).I observe the following from the tests:
The Rust server's CPUs are fully utilized (using more than Python server in certain cases). After profiling with
perf
, I am seeing a lot of CPU usage related to memory movement.You can access my code here: https://github.com/vichry2/flight-benchmark
Thank you for your help!
The text was updated successfully, but these errors were encountered: