use SIMD.jl directly instead of LV.jl for `fast_findmin()` #84

Moelf · 2024-10-27T21:54:29Z

codecov · 2024-10-27T21:55:43Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 70.59%. Comparing base (eab62d1) to head (fa51402).

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #84      +/-   ##
==========================================
+ Coverage   70.18%   70.59%   +0.40%     
==========================================
  Files          17       17              
  Lines        1164     1180      +16     
==========================================
+ Hits          817      833      +16     
  Misses        347      347

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚨 Try these New Features:

Flaky Tests Detection - Detect and resolve failed and flaky tests

graeme-a-stewart · 2024-10-28T14:56:56Z

I will test this, but initial indications are that the performance on Apple silicon is rather bad

Moelf · 2024-10-28T15:12:29Z

hmm, I don't have access to Apple Silicon, but luckily we don't run JetReco on aarch64 in HEP yet

graeme-a-stewart · 2024-10-28T15:47:29Z

Yeah, but it's where I develop! And people testing FCC workflows, for example, do use their Mac laptops...

graeme-a-stewart · 2024-10-28T15:56:45Z

Yeah, it's quite a significant regression on Apple silicon. On my M2...

main

 ~/.julia/dev/JetReconstruction/examples/ [main*] julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 170.6484634375 ± 4.057617987022607 μs
Lowest time per event 166.17292 μs

exorcise_lv

 ~/.julia/dev/JetReconstruction/examples/ [exorcise_lv] julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 201.44234250000002 ± 3.8225514883321954 μs
Lowest time per event 197.07167 μs

I need to test also on x86 and follow-up with further benchmarks of suggestions which came up in the Discourse thread.

graeme-a-stewart · 2024-10-28T17:50:17Z

Benchmarks for x86_64, AMD Ryzen 7 5700G.

main

pc-sft-fa-0a :: dev/JetReconstruction/examples ‹main› » julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 184.5104475 ± 6.358035705629633 μs
Lowest time per event 176.97703 μs

exorcise_lv

pc-sft-fa-0a :: dev/JetReconstruction/examples ‹exorcise_lv› » julia --project instrumented-jetreco.jl --algorithm=AntiKt -R 0.4 ../test/data/events.pp13TeV.hepmc3.gz -m 32
Processed 100 events 32 times
Average time per event 184.13992875 ± 6.5172848018939655 μs
Lowest time per event 177.66260999999997 μs

So it's really doing a good job on x86. We just have to find a way to make it also work well on Apple silicon.

Moelf · 2024-10-28T19:25:51Z

do you ever have NaN in this function? because:

julia> @fastmath foldl(min, [1.0, NaN, 0.5])
0.5

julia> foldl(min, [1.0, NaN, 0.5])
NaN

graeme-a-stewart · 2024-10-28T20:16:56Z

No, there can't be a NaN there. I think if you can have NaN then fast math's assumptions are violated and bad things happen.

Moelf · 2024-10-28T20:24:04Z

ok yeah if you don't have NaN and don't care about -0.0 vs. 0.0 I think you can use ~fastmath

graeme-a-stewart · 2024-10-28T20:25:17Z

ok yeah if you don't have NaN and don't care about -0.0 vs. 0.0 I think you can use ~fastmath

There shouldn't be any zeros at all, so -0.0 vs. +0.0 should be moot!

Moelf · 2024-10-29T23:03:51Z

honestly, I might need to buy a Mac Mini just so I can test aarch64 performance....

graeme-a-stewart · 2024-10-30T07:43:38Z

honestly, I might need to buy a Mac Mini just so I can test aarch64 performance....

They are quite sweet little machines (especially now they have M4s) - I am thinking about getting one myself!

graeme-a-stewart · 2024-10-30T07:45:01Z

So I was thinking that if we do not find a generic solution, we can tolerate two ways to implement fast_findmin, switching on Sys.KERNEL (specifically isapple() and friends).

Though it would be good to understand if it's OS we should switch on or CPU arch - @aoanla would be able to help us run a few tests.

carstenbauer · 2024-11-03T05:07:48Z

honestly, I might need to buy a Mac Mini just so I can test aarch64 performance....

Poor man's workaround: ssh to a aarch64 MacOS GitHub Runner. ;)

Moelf · 2024-11-22T16:20:27Z

btw, here's what LV.jl is doing:
https://gist.github.com/Moelf/7432073603f10d7718ef82552d5362dd

use SIMD.jl directly instead of LV.jl for fast_findmin()

35d2f79

Moelf added 4 commits October 27, 2024 22:57

clean up

6659fe1

more inbounds just for safety

a508028

clean up

8f98075

format

395a28d

graeme-a-stewart self-requested a review November 4, 2024 16:11

graeme-a-stewart marked this pull request as draft November 22, 2024 09:10

graeme-a-stewart added the Internals Changes that affect the internals of the package, but not the public API label Nov 22, 2024

Merge branch 'main' into exorcise_lv

fa51402

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use SIMD.jl directly instead of LV.jl for `fast_findmin()` #84

use SIMD.jl directly instead of LV.jl for `fast_findmin()` #84

Moelf commented Oct 27, 2024

codecov bot commented Oct 27, 2024 •

edited

Loading

graeme-a-stewart commented Oct 28, 2024

Moelf commented Oct 28, 2024

graeme-a-stewart commented Oct 28, 2024

graeme-a-stewart commented Oct 28, 2024 •

edited

Loading

graeme-a-stewart commented Oct 28, 2024

Moelf commented Oct 28, 2024

graeme-a-stewart commented Oct 28, 2024 •

edited

Loading

Moelf commented Oct 28, 2024

graeme-a-stewart commented Oct 28, 2024

Moelf commented Oct 29, 2024

graeme-a-stewart commented Oct 30, 2024

graeme-a-stewart commented Oct 30, 2024 •

edited

Loading

carstenbauer commented Nov 3, 2024

Moelf commented Nov 22, 2024

use SIMD.jl directly instead of LV.jl for fast_findmin() #84

Are you sure you want to change the base?

use SIMD.jl directly instead of LV.jl for fast_findmin() #84

Conversation

Moelf commented Oct 27, 2024

codecov bot commented Oct 27, 2024 • edited Loading

Codecov Report

graeme-a-stewart commented Oct 28, 2024

Moelf commented Oct 28, 2024

graeme-a-stewart commented Oct 28, 2024

graeme-a-stewart commented Oct 28, 2024 • edited Loading

main

exorcise_lv

graeme-a-stewart commented Oct 28, 2024

main

exorcise_lv

Moelf commented Oct 28, 2024

graeme-a-stewart commented Oct 28, 2024 • edited Loading

Moelf commented Oct 28, 2024

graeme-a-stewart commented Oct 28, 2024

Moelf commented Oct 29, 2024

graeme-a-stewart commented Oct 30, 2024

graeme-a-stewart commented Oct 30, 2024 • edited Loading

carstenbauer commented Nov 3, 2024

Moelf commented Nov 22, 2024

use SIMD.jl directly instead of LV.jl for `fast_findmin()` #84

use SIMD.jl directly instead of LV.jl for `fast_findmin()` #84

codecov bot commented Oct 27, 2024 •

edited

Loading

graeme-a-stewart commented Oct 28, 2024 •

edited

Loading

graeme-a-stewart commented Oct 28, 2024 •

edited

Loading

graeme-a-stewart commented Oct 30, 2024 •

edited

Loading