You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the Bug:
This reported performance issue initially stems from test timeouts encountered in shift_left_right.pass in #1928, but I suspect the issue may extend beyond this algorithm.
OpenMP performance in shift_left and shift_right algorithms significantly degrades on CPUs with large core counts especially for small-to-medium sized inputs where we end up having very small grain sizes per thread. I believe this potentially extends far beyond these two algorithms, so we may have more similar cases like this. The best option in my opinion would be to benchmark performance of the OpenMP backend across different CPUs followed by optimization efforts where required.
oneDPL version: ac39d7e - The version is less relevant as it impacts the stable OMP backend.
Compiler version: less relevant, but I used: Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1
OS: less relevant but I used Ubuntu 22.04
CPU: Intel(R) Xeon(R) Platinum 8480+
Here are the results I saw with shift_left_right.pass with different thread counts. A commit prior to #1928 is required which works around this issue in the test itself.
OMP_NUM_THREADS
Runtime (s)
1
0.621
2
2.850
4
4.060
8
7.426
16
9.489
32
20.765
64
20.077
128
61.985
unset
87.79
Ultimately, we will likely need more rigorous benchmarks to understand the impact of this issue on different problem sizes and then go from there with optimization work.
Expected Behavior:
I would expect the OpenMP backend to better control the number of threads launched along with grain size based on the problem size, so we do not require intervention from the user to avoid the shown performance issues.
The text was updated successfully, but these errors were encountered:
Describe the Bug:
This reported performance issue initially stems from test timeouts encountered in
shift_left_right.pass
in #1928, but I suspect the issue may extend beyond this algorithm.OpenMP performance in
shift_left
andshift_right
algorithms significantly degrades on CPUs with large core counts especially for small-to-medium sized inputs where we end up having very small grain sizes per thread. I believe this potentially extends far beyond these two algorithms, so we may have more similar cases like this. The best option in my opinion would be to benchmark performance of the OpenMP backend across different CPUs followed by optimization efforts where required.To Reproduce:
cmake -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_STANDARD=20 -DONEDPL_BACKEND=omp -DONEDPL_DEVICE_TYPE=HOST -DCMAKE_CXX_FLAGS="-DTEST_LONG_RUN=1" -DCMAKE_BUILD_TYPE=Release
Here are the results I saw with
shift_left_right.pass
with different thread counts. A commit prior to #1928 is required which works around this issue in the test itself.Ultimately, we will likely need more rigorous benchmarks to understand the impact of this issue on different problem sizes and then go from there with optimization work.
Expected Behavior:
I would expect the OpenMP backend to better control the number of threads launched along with grain size based on the problem size, so we do not require intervention from the user to avoid the shown performance issues.
The text was updated successfully, but these errors were encountered: