Poor performance #1631

qiranq99 · 2023-12-05T01:51:41Z

Hi,

Under several benchmarks of putting data object to the shared store (from 1KB to several GBs), we observed that vineyard underperforms ray (plasma), for spending 2x-5x more time

With the data object size getting larger, the performance issue scales. Are there any specific reasons or sources of overhead?

The text was updated successfully, but these errors were encountered:

sighingnow · 2023-12-05T02:21:14Z

Which kinds of objects you are working with? Can we know the dtypes/schema if you are working with pandas dataframe/tables.

There's known performance degeneration if your pandas dataframe has string columns.

qiranq99 · 2023-12-05T02:30:41Z

Hi @sighingnow,

basically numpy arrays and torch.tensor are tested. The performance is not satisfying in both cases, whether small or large data objects.

You could try out putting a np.random.rand(512,1024,1) array (4MB), and compare vineyard.put() with ray.put(). On our machine, the performance is barely 1/4 against ray.

sighingnow · 2023-12-05T02:46:04Z

Thanks for the information. We'll take a try to verify the result.

qiranq99 · 2023-12-05T06:58:34Z

To clarify, the measurements mentioned above are not valid in most cases. The amended benchmarks shows 1.2 ~ 1.5 performance degradation compared with ray, especially for large data objects (several hundreds of MBs or several GBs)

qiranq99 · 2023-12-07T02:57:40Z

The underlying reasons for the observed performance gap are:

plasma and ray enable threadPool for single process, while vineyard seems to be using single thread;
if we scale num_workers, i.e., the number of processes, the performance gap would be eliminated.

@sighingnow Please verify the first statement, if so, this channel could be closed.

dashanji · 2023-12-08T13:16:41Z

You could try out putting a np.random.rand(512,1024,1) array (4MB), and compare vineyard.put() with ray.put(). On our machine, the performance is barely 1/4 against ray.

I have tested the code and done some profiling, the vineyard.put is mainly spent on the memcpy (copy to the shared memory). My guess is that ray use cython and vineyard use pybind via pybind/pybind11#1227

sighingnow · 2023-12-13T05:42:13Z

Hi @qiranq99,

Actually, the performance gap has nothing to do with the Cython/pybind11 calls. The gap is because plasma internally use multiple threads for concurrent memcpying (by default is 6, see also: https://github.com/apache/arrow/blob/apache-arrow-11.0.0/python/pyarrow/_plasma.pyx#L532) while vineyard uses single thread for memcpy.

After enabling concurrent memcpy, vineyard archives even higher throughput than plasma at the same level of parallelism when putting numpy ndarrays:

---------------------------------------------------------------------------------------------------------------- benchmark: 10 tests ----------------------------------------------------------------------------------------------------------------
Name (time in us)                                            Min                     Max                    Mean                 StdDev                  Median                    IQR            Outliers          OPS            Rounds  Iterations
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
test_bench_numpy_ndarray[plasma_client-256]              77.8432 (1.0)          141.5438 (1.0)           87.8491 (1.0)           7.3290 (1.0)           85.9904 (1.0)           5.7258 (1.0)        165;99  11,383.1537 (1.0)        1256           1
test_bench_numpy_ndarray[plasma_client-256KB]           110.6099 (1.42)       1,562.4296 (11.04)        132.2550 (1.51)         36.4446 (4.97)         124.1728 (1.44)         17.9743 (3.14)      159;228   7,561.1512 (0.66)       2503           1
test_bench_numpy_ndarray[vineyard_client-256]           256.9887 (3.30)       1,944.8148 (13.74)        381.2453 (4.34)         82.1342 (11.21)        386.8388 (4.50)         93.9578 (16.41)      194;11   2,622.9829 (0.23)       1115           1
test_bench_numpy_ndarray[vineyard_client-256KB]         280.0189 (3.60)       3,346.4008 (23.64)        394.1826 (4.49)        102.5968 (14.00)        383.9559 (4.47)         93.7111 (16.37)       109;8   2,536.8954 (0.22)       1365           1
test_bench_numpy_ndarray[vineyard_client-256MB]      10,421.3501 (133.88)    35,830.6342 (253.14)    12,618.7254 (143.64)    4,998.5722 (682.03)    10,794.7108 (125.53)    2,562.6968 (447.57)        1;1      79.2473 (0.01)         25           1
test_bench_numpy_ndarray[plasma_client-256MB]        11,778.7081 (151.31)    14,720.6802 (104.00)    12,230.5463 (139.22)      744.5806 (101.59)    11,988.2962 (139.41)      318.0359 (55.54)         2;2      81.7625 (0.01)         23           1
test_bench_numpy_ndarray[vineyard_client-1GB]        39,629.0324 (509.09)    51,090.4198 (360.95)    46,283.8365 (526.86)    4,987.9626 (680.58)    49,023.0583 (570.10)    9,908.4431 (>1000.0)       3;0      21.6058 (0.00)          9           1
test_bench_numpy_ndarray[plasma_client-1GB]          46,443.7758 (596.63)    58,577.3201 (413.85)    48,716.5607 (554.55)    4,372.6252 (596.62)    47,076.5112 (547.46)      954.7628 (166.75)        1;1      20.5269 (0.00)          7           1
test_bench_numpy_ndarray[vineyard_client-4GB]       149,866.7547 (>1000.0)  153,821.7869 (>1000.0)  152,392.1663 (>1000.0)   1,538.5827 (209.93)   152,710.1910 (>1000.0)   1,839.8198 (321.32)        1;0       6.5620 (0.00)          5           1
test_bench_numpy_ndarray[plasma_client-4GB]         163,773.4137 (>1000.0)  206,139.4518 (>1000.0)  182,380.0197 (>1000.0)  15,748.3002 (>1000.0)  182,238.1048 (>1000.0)  19,385.6989 (>1000.0)       2;0       5.4831 (0.00)          5           1

The benchmark case and newly added concurrency control in Python APIs can be found at #1646.

From the result, you can see there are indeed improvements compared with plasma when putting large tensors. For small tensors, the gap is because there are still opportunities to further improving the dispatch logic of builders and resolvers. Compared with plasma, vineyard unlocks the opportunities for more complex objects as well as object impossibilities.

The optimization of builders and resolvers is already in our roadmap (issue #727).

sighingnow · 2023-12-13T05:42:56Z

The concurrent memcpy is only enabled for copies >= 4MB to optimize the overhead of creating threads.

…#1646) Remove the problematic `.buffer` property (as it cannot bind the lifetime of the underlying blob to the memoryview object) and add concurrent support for memcpy for faster object building. Fixes #1631 Signed-off-by: Tao He <[email protected]>

sighingnow added performance Issues that related to the performance of vineyardd and vineyard SDKs. component:python Issues about vineyard's python SDK priority:high labels Dec 5, 2023

sighingnow mentioned this issue Dec 13, 2023

Implement concurrent memcpy for building Python objects into vineyard #1646

Merged

sighingnow closed this as completed in #1646 Dec 13, 2023

sighingnow mentioned this issue Dec 19, 2023

Benchmarking: put/get/delete objects from Python #721

Open

2 tasks

sighingnow self-assigned this Dec 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor performance #1631

Poor performance #1631

qiranq99 commented Dec 5, 2023 •

edited

Loading

sighingnow commented Dec 5, 2023

qiranq99 commented Dec 5, 2023 •

edited

Loading

sighingnow commented Dec 5, 2023

qiranq99 commented Dec 5, 2023 •

edited

Loading

qiranq99 commented Dec 7, 2023

dashanji commented Dec 8, 2023

sighingnow commented Dec 13, 2023 •

edited

Loading

sighingnow commented Dec 13, 2023 •

edited

Loading

Poor performance #1631

Poor performance #1631

Comments

qiranq99 commented Dec 5, 2023 • edited Loading

sighingnow commented Dec 5, 2023

qiranq99 commented Dec 5, 2023 • edited Loading

sighingnow commented Dec 5, 2023

qiranq99 commented Dec 5, 2023 • edited Loading

qiranq99 commented Dec 7, 2023

dashanji commented Dec 8, 2023

sighingnow commented Dec 13, 2023 • edited Loading

sighingnow commented Dec 13, 2023 • edited Loading

qiranq99 commented Dec 5, 2023 •

edited

Loading

qiranq99 commented Dec 5, 2023 •

edited

Loading

qiranq99 commented Dec 5, 2023 •

edited

Loading

sighingnow commented Dec 13, 2023 •

edited

Loading

sighingnow commented Dec 13, 2023 •

edited

Loading