Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate and resolve performance issues in the OpenMP backend on large core count CPUs #1929

Open
mmichel11 opened this issue Nov 1, 2024 · 0 comments
Labels

Comments

@mmichel11
Copy link
Contributor

mmichel11 commented Nov 1, 2024

Describe the Bug:
This reported performance issue initially stems from test timeouts encountered in shift_left_right.pass in #1928, but I suspect the issue may extend beyond this algorithm.

OpenMP performance in shift_left and shift_right algorithms significantly degrades on CPUs with large core counts especially for small-to-medium sized inputs where we end up having very small grain sizes per thread. I believe this potentially extends far beyond these two algorithms, so we may have more similar cases like this. The best option in my opinion would be to benchmark performance of the OpenMP backend across different CPUs followed by optimization efforts where required.

To Reproduce:

  • CMake command: cmake -DCMAKE_CXX_COMPILER=g++ -DCMAKE_CXX_STANDARD=20 -DONEDPL_BACKEND=omp -DONEDPL_DEVICE_TYPE=HOST -DCMAKE_CXX_FLAGS="-DTEST_LONG_RUN=1" -DCMAKE_BUILD_TYPE=Release
  • oneDPL version: ac39d7e - The version is less relevant as it impacts the stable OMP backend.
  • Compiler version: less relevant, but I used: Intel(R) oneAPI DPC++/C++ Compiler 2024.2.1
  • OS: less relevant but I used Ubuntu 22.04
  • CPU: Intel(R) Xeon(R) Platinum 8480+

Here are the results I saw with shift_left_right.pass with different thread counts. A commit prior to #1928 is required which works around this issue in the test itself.

OMP_NUM_THREADS Runtime (s)
1 0.621
2 2.850
4 4.060
8 7.426
16 9.489
32 20.765
64 20.077
128 61.985
unset 87.79

Ultimately, we will likely need more rigorous benchmarks to understand the impact of this issue on different problem sizes and then go from there with optimization work.

Expected Behavior:
I would expect the OpenMP backend to better control the number of threads launched along with grain size based on the problem size, so we do not require intervention from the user to avoid the shown performance issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant