-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Modernize the threaded assembly howto #1068
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1068 +/- ##
=======================================
Coverage 93.72% 93.73%
=======================================
Files 39 39
Lines 6011 6017 +6
=======================================
+ Hits 5634 5640 +6
Misses 377 377 ☔ View full report in Codecov by Sentry. |
I get NaNs every now and then julia> nK4, nf4 = main(; n = 15, ntasks = 8) #src
0.163506 seconds (141.79 k allocations: 20.707 MiB, 0.02% compilation time)
(1.759428561238959e13, 0.10506179512844058)
julia> nK4, nf4 = main(; n = 15, ntasks = 8) #src
0.159455 seconds (141.79 k allocations: 20.707 MiB, 0.01% compilation time)
(1.759428561238959e13, 0.10506179512844058)
julia> nK4, nf4 = main(; n = 15, ntasks = 8) #src
0.158596 seconds (141.79 k allocations: 20.707 MiB, 0.02% compilation time)
(NaN, NaN) |
Try after fa2f1d2 ? Curious that it didn't always gave trash data though... |
Saw the fix and I am also very confused that it worked in first place (or why it should fail only occasionally). Do you have access to some cluster node (or @KnutAM ?) to check scaling? Or some other machine built to scale in the number of threads? Ideally with frequency scaling turned off. We should also check thread pinning to cores to give some recommendations here. |
The issue remains on head
Seems to only happen with 8 and 16 threads tho (on an 8 core machine with hyperthreading). |
Great to fix this issue! @local scratch = ScratchData(dh, K, f, cellvalues)
(; cell_cache, cellvalues, Ke, fe, assembler) = scratch |
Ah, nice find. Probably that explains the large amount of allocations too.. |
The NaN issue seems to be gone now. However, it just removed 1/3 of the allocs. |
For me it reduced it significantly (see the last commit). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are my timings on the cluster with n=30:
Run with nthreads: 1
8.996648 seconds (902 allocations: 816.172 KiB, 0.00% compilation time)
-----
Run with nthreads: 2
4.957708 seconds (1.64 k allocations: 1.564 MiB, 0.00% compilation time)
-----
Run with nthreads: 4
2.547859 seconds (3.12 k allocations: 3.099 MiB, 0.00% compilation time)
-----
Run with nthreads: 8
1.555725 seconds (6.07 k allocations: 6.168 MiB, 0.00% compilation time)
-----
Run with nthreads: 16
0.951255 seconds (11.97 k allocations: 12.306 MiB, 0.00% compilation time)
-----
Run with nthreads: 32
0.614075 seconds (23.78 k allocations: 24.582 MiB, 0.01% compilation time)
cell_cache = CellCache(dh) | ||
n = ndofs_per_cell(dh) | ||
Ke = zeros(n, n) | ||
fe = zeros(n) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously we allocated the scratch data in a different way: fes = [zeros(n_basefuncs) for i in 1:nthreads]
(to avoid cache misses I believe). Is that not needed anymore? This is definitely cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This constructor will be called once per task yes.
OhMyThreads.@tasks for cellidx in color | ||
@set scheduler = scheduler | ||
## Obtain a task local scratch and unpack it | ||
@local scratch = ScratchData(dh, K, f, cellvalues_tmp) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the scratch data only created once per task? It does not look like it because it inside the loop, but maybe that is what the macro is for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On
with
|
With threadpinning (
|
a53d646
to
e9b71fe
Compare
This patch rewrites the threaded assembly howto to use OhMyThreads.jl which is a better interface to multithreading than using "raw" `at-threads`. Also adds some more prose and explanations.
e9b71fe
to
0e9306c
Compare
No description provided.