Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arrow Flight Performance -- Rust vs Python (C++) #6670

Open
vichry2 opened this issue Nov 1, 2024 · 13 comments
Open

Arrow Flight Performance -- Rust vs Python (C++) #6670

vichry2 opened this issue Nov 1, 2024 · 13 comments
Labels
question Further information is requested

Comments

@vichry2
Copy link

vichry2 commented Nov 1, 2024

Which part is this question about
Arrow flight, FlightDataEncoderBuilder, do_get

Describe your question
Is it expected that Arrow's Python (C++) Flight implementation encodes data more efficiently than arrow-rs?

Additional context
Hello.
After discussion with @alamb, I am filing an issue here.

Unsure if this is a bug, or if it's expected, or if there's just an issue with my code, but after running some tests, it seems that Rust's encoding takes more time and resources than Python.

I am running two servers, one in Python and the other in Rust, with the same simple design:
-Create a Table/RecordBatch before starting the flight service, which the service will hold in memory when running.
-When receiving a request (in do_get), simply provide a view of the data to fl.RecordBatchStream in Python / FlightDataEncoderBuilder in Rust.

Because nothing is really happening on the Python side (just providing a view to a Table), and a single request is not holding the GIL for a significant amount of time, I imagine I'm ultimately measuring the C++ Arrow Flight implementation.

I have run two tests:

  1. Python script which sends n requests sequentially to each server, consuming the entire stream (flightclient.do_get().read_all()) and displays the average response time for each server.
  2. Using the Locust framework, load testing the maximum RPS capabilities of the servers (used taskset -c to seperate locust users and server).

I observe the following from the tests:

  1. As the size of data sent to the client increases, the difference of average response time between Python and Rust servers also increases (in favor of Python server).
  2. Similarily, as the amount of data increases, Python is able to achieve a higher RPS than Rust.

The Rust server's CPUs are fully utilized (using more than Python server in certain cases). After profiling with perf, I am seeing a lot of CPU usage related to memory movement.

You can access my code here: https://github.com/vichry2/flight-benchmark

Thank you for your help!

@vichry2 vichry2 added the question Further information is requested label Nov 1, 2024
@tustvold
Copy link
Contributor

tustvold commented Nov 1, 2024

Just to double check, did you compile the Rust code in release mode?

Presuming that is the case, I suspect what you're running into is - https://docs.rs/arrow-flight/latest/arrow_flight/encode/struct.FlightDataEncoderBuilder.html#method.with_max_flight_data_size

In particular if you feed a massive RecordBatch into the arrow-flight, it will break it up to better match gRPC best-practices, I don't believe that pyarrow does something similar. The solution here may be to return multiple smaller RecordBatch instead of one gigantic batch.

@vichry2
Copy link
Author

vichry2 commented Nov 1, 2024

Yes, I am compiling in release mode.

I have already attempted to chunk the record batches into "optimal sizes" (so that a record batch corresponds to close to 2MB of FlightData). Although this did show some slight improvements, performance did not drastically increase.

I've added this to the repo attached to the issue, as well as a README file, so you can reproduce the issue.

Thanks!

@tustvold
Copy link
Contributor

tustvold commented Nov 1, 2024

Perhaps you could capture a trace of the Rust server using hotspot or similar, it may show where the bottleneck is

@alamb
Copy link
Contributor

alamb commented Nov 3, 2024

Thank you @vichry2 -- this is great

Right now i am not 100% sure how to reproduce what you are seeing (https://github.com/vichry2/flight-benchmark is great, but I don't have time to understand how to run it exactly as you are).

Could you either:

  1. share the exact commands you are using to run the Rust benchmarks
  2. provide us additional profiling information?

In terms of profiling:

The Rust server's CPUs are fully utilized (using more than Python server in certain cases). After profiling with perf, I am seeing a lot of CPU usage related to memory movement.

Would it be possible to post a screen shot / dump if what you are seeing?

Alternately, perhaps you can capture a flamegraph with cargo flamegraph https://github.com/flamegraph-rs/flamegraph or similiar. Here is a short video tutorial of how I do it: https://youtu.be/2z11xtYw_xs

@alamb
Copy link
Contributor

alamb commented Nov 3, 2024

I strongly suspect there are some unecessary copies or something going on in Rust and with a flamegraph or similar we'll be able to fix them quickly

@vichry2
Copy link
Author

vichry2 commented Nov 3, 2024

Hi @alamb,

Thanks for the flamegraph tutorial!

I've generated a flamegraph for my Rust server which I've attached here: flamegraph

You can also find it on the flight-benchmark repo (assets folder).

As mentioned previously, it appears that the bottleneck is related to memory movement (memmove) during batch encoding.

Additionally, I’ve added commands in the README.md file of my repo that explain how to run the client.py script on the servers. This script compares the average response times for serving a single request in both languages, showing that Python is quicker.

Regarding the screenshot of the heavy CPU usage during the Locust load test, I can get that to you within the week. I prefer using taskset to separate client CPUs from server CPUs, so you can see the server CPUs saturating before the client CPUs using htop. However, on my home computer (MacOS), I haven't found an easy equivalent for that.

@vichry2
Copy link
Author

vichry2 commented Nov 3, 2024

After observing the problematic function write_buffer, I suspect the extend_from_slice calls to be the issue.

@tustvold
Copy link
Contributor

tustvold commented Nov 3, 2024

It looks like the IPCDataGenerator is not doing a good job, if any, at estimating buffer sizes ahead of time and is relying on bump allocation. This is what is leading to a large amount of time spent in realloc.

Hooking up buffer estimation based on https://docs.rs/arrow-data/latest/arrow_data/struct.ArrayData.html#method.get_slice_memory_size would likely help.

That being said I'm not aware of a way to avoid at least some memmove, as our gRPC implementation requires the response payloads to be contiguous blocks of memory, and does not have a mechanism to allow vectored writes.

hyperium/tonic#1558

@vichry2
Copy link
Author

vichry2 commented Nov 3, 2024

I was able to reduce libsystem_malloc overhead, visible in the previous flamegraph by doing something similar to what you proposed @tustvold.
This resulted in the following flamegraph:
flamegraph

I used https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatch.html#method.get_array_memory_size to estimate the size of the arrow_data vec in record_batch_to_bytes -- and initialized the vec with Vec::with_capacity(array_memory_size). This would likely overestimate the size of the arrow_data vector created, hence not be optimal for memory consumption, but that's beside the point.
The point being that the bottleneck is related to memory movement, and as you mentioned, if the response requires that data be in contiguous blocks of memory, not sure there is a solution to this.

Do you happen to know why the C++ implementation does not seem to have this overhead? Is contiguous memory not a requirement?

Thanks!

@tustvold
Copy link
Contributor

tustvold commented Nov 3, 2024

Do you happen to know why the C++ implementation does not seem to have this overhead? Is contiguous memory not a requirement?

The C++ gRPC implementation has a mechanism for providing a list of buffers. I'm not familiar with arrow-cpp but I imagine this is what they're using.

We would need a similar mechanism in tonic, and then we could look to hook up the IPC machinery to take advantage of it.

@tustvold
Copy link
Contributor

tustvold commented Nov 3, 2024

Something worth putting into context though is that in most scenarios this copy will be irrelevant, as the transfer will be bottlenecked on network IO.

This won't the case when running benchmarks against a local machine, but then arrow flight is an odd choice for such a deployment, where FFI or shared memory IPC would be more appropriate.

@alamb
Copy link
Contributor

alamb commented Nov 4, 2024

I used https://docs.rs/arrow/latest/arrow/record_batch/struct.RecordBatch.html#method.get_array_memory_size to estimate the size of the arrow_data vec in record_batch_to_bytes -- and initialized the vec with Vec::with_capacity(array_memory_size). This would likely overestimate the size of the arrow_data vector created, hence not be optimal for memory consumption, but that's beside the point.

Thank you @vichry2 -- is this something we could make a PR in arrow-rs to improve? Or maybe add a documentation example or something?

@vichry2
Copy link
Author

vichry2 commented Nov 5, 2024

Thank you @vichry2 -- is this something we could make a PR in arrow-rs to improve?

I think that would be good. Although just a minor improvement now, if ever non-contiguous memory buffers are implemented in tonic and incorporated in FlightDataEncoder, the recurring malloc calls would become more evident.

I can investigate the difference in memory allocation of the size of arrow_data initialized with RecordBatch.get_array_memory_size and the current arrow_data Vec, to ensure we don’t reserve an excessive amount of unnecessary memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants