-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor performance #1631
Comments
Which kinds of objects you are working with? Can we know the dtypes/schema if you are working with pandas dataframe/tables. There's known performance degeneration if your pandas dataframe has string columns. |
Hi @sighingnow, basically You could try out putting a |
Thanks for the information. We'll take a try to verify the result. |
To clarify, the measurements mentioned above are not valid in most cases. The amended benchmarks shows 1.2 ~ 1.5 performance degradation compared with |
The underlying reasons for the observed performance gap are:
@sighingnow Please verify the first statement, if so, this channel could be closed. |
I have tested the code and done some profiling, the |
Hi @qiranq99, Actually, the performance gap has nothing to do with the Cython/pybind11 calls. The gap is because plasma internally use multiple threads for concurrent memcpying (by default is 6, see also: https://github.com/apache/arrow/blob/apache-arrow-11.0.0/python/pyarrow/_plasma.pyx#L532) while vineyard uses single thread for memcpy. After enabling concurrent memcpy, vineyard archives even higher throughput than plasma at the same level of parallelism when putting numpy ndarrays:
The benchmark case and newly added concurrency control in Python APIs can be found at #1646. From the result, you can see there are indeed improvements compared with plasma when putting large tensors. For small tensors, the gap is because there are still opportunities to further improving the dispatch logic of builders and resolvers. Compared with plasma, vineyard unlocks the opportunities for more complex objects as well as object impossibilities. The optimization of builders and resolvers is already in our roadmap (issue #727). |
The concurrent memcpy is only enabled for copies >= 4MB to optimize the overhead of creating threads. |
…#1646) Remove the problematic `.buffer` property (as it cannot bind the lifetime of the underlying blob to the memoryview object) and add concurrent support for memcpy for faster object building. Fixes #1631 Signed-off-by: Tao He <[email protected]>
Hi,
Under several benchmarks of putting data object to the shared store (from 1KB to several GBs), we observed that
vineyard
underperformsray
(plasma
), for spending 2x-5x more timeWith the data object size getting larger, the performance issue scales. Are there any specific reasons or sources of overhead?
The text was updated successfully, but these errors were encountered: