-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
performance overhead of cilk_span vs parlay::par_do #135
Comments
I'm starting to look into this issue. But at the moment I've only got access to an ARM64 system, and the Parlay scheduler doesn't work correctly there. (Oftentimes it will hang, sometimes it returns the wrong sum, and at one point it crashed with a failed assertion.). I'll take another look once I get access to an Intel box. |
I confirmed the behavior on x86. Time with log size 20 is greater with the Cilk version and is quite variable. Also, using On my FreeBSD ARM system the parlay version tends to hang. |
I managed to run some tests on an Intel machine. I'm still trying to diagnose the issue, but here are some more notes of things I've found so far:
@VoxSciurorum For the |
So I tried to generate a bunch of data to see which cases matter I added two more versions to the above code which was one using a cilk_for with a new reducer, and another serial version. those two versions are below. I first tried generating data for all sizes between 20 and 30 for 101 trials and was getting a much smaller difference than I was seeing before, so I hypothosized that their was some sort of start up costs. So I also added the data for 11 trials The data can be found here https://docs.google.com/spreadsheets/d/1F5V629TlNribEs7ipiBOCNFu-eek0fn9oyb3M9FTO74/edit?usp=sharing Each sheet says in what configuration it was run it and how many trials were run On each sheet I have the time in microseconds for each size for all 4 different sums and I present the 10%, 50% and 90%. I also show what the speedup over the serial code each one got, as well as the slowdown from the fastest version. What I see from the data is that when running only 11 trials anything with numa or hyperthreading parlay seems to have the most wins. This trend is much less pronounced when running 101 trials, but still the standard one on all the cores is still the worst for cilk, and no numa no hyperthreads is the best. Let me know if anything about my experiment doesn't make sense, or it would be helpful for me to run anything else.
|
In general, Cilk spawning has less overhead than parlaylib task creation but Cilk scheduling has higher overhead than parlaylib scheduling. This puts Cilk at a disadvantage when very small amounts of work are stolen. We would normally recommend the |
Describe the bug
I noticed I was getting slower than expected performance in a fairly simple parallel sum and compared it to comparative code using parlaylib's schedule. I made a minimal example which shows that for a fairly simple parallel sum the parlaylib version was about 5x faster for small cases. The difference decreases as the total size increases.
For a vector of length 2^20 cilk took a median of 600 microseconds to sum up the elemnts, while parlaylib only took about 120 microseconds
by length 2^25 parlaylib takes about 2,000 microseconds vs cilks 3,500
by 2^30 parlaylib is taking about 78,000 microseconds vs cilks 80,000 microseconds
Expected behavior
I would expect fairly similar performance in what I expected to be a fairly compute or memory bound program
OpenCilk version
clang version 14.0.6 (https://github.com/OpenCilk/opencilk-project fc90ded)
System information
NAME="Ubuntu"
VERSION="20.04.1 LTS (Focal Fossa)"
2 Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz
Steps to reproduce (include relevant output)
The code I used is bellow, you will need to clone the parlaylib repo and put it in the same directory
compiled with
clang++ -Wall -Wextra -Ofast -std=c++20 -Wno-deprecated-declarations -g -gdwarf-4 -march=native -fopencilk -lpthread -o run main.cpp
The text was updated successfully, but these errors were encountered: