-
Notifications
You must be signed in to change notification settings - Fork 861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
coll/xhc: Bring in latest developments #12908
base: main
Are you sure you want to change the base?
Conversation
Multiple changes, new features, and enchancements Notable new features: - Inlining data in control path for extra small messages. - Allreduce goes through Bcast's implementation intead of duplicating it. - Fine grained op-wise tuning for hierarchy, chunk size, cico threshold. - Reduce reworked & improved, with double buffering and other optimizations. - When smsc support is not present, handle only small messages (below cico threshold), instead of not at all. Signed-off-by: George Katevenis <[email protected]>
buffer for each iteration without altering it. Since modern processors | ||
implicitly cache data, this can lead to false/unrealistic/unrepresentative | ||
results, given that actual real-world applications do not (usually/optimally!) | ||
perform duplicate operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've also observed that using empty buffers doesn't provide accurate latency measurements while benchmarking nt-buffer-transfer (Ref: openucx/ucx#9408) in intra-node use cases.
I appreciate you sharing the modified version of OSU-benchmarks, I will try this out.
I noticed another aspect about OSU-benchmarks, which always use page-aligned buffers for latency measurement. This may not accurately represent real-world MPI applications that might use unaligned transfers, potentially suffering from different penalties and even substantially decreased performance on some AMD CPUs (https://lunnova.dev/articles/ryzen-slow-short-rep-mov/). Have you encountered similar findings in your experience?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment, indeed the stock micro-benchmark implementation is problematic for measuring intra-node latency, especially/mostly for things like non-hierarchical broacast.
No I haven't done any work around the page alignment concerns you raise; if you do any testing and find something out of place, send over a result or two :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@gkatev Do you have a table with the performance improvement numbers for any HPC apps (or any new AI apps) with xHC for intra-node cases with details like ranksXthreads, mapping, binding, machine CPU details, etc ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you had a look in our paper "A framework for hierarchical single-copy MPI collectives on multicore nodes" @ Cluster 2022? It includes a few HPC app benchmarks with XHC. But there have been many improvements to XHC between the version in that paper and the one in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went through the paper and I see you are getting good improvements with PiSvM, miniAMR and CNTK. Have you had a chance to test the performance with WRF, NWCHEM, or OpenFOAM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of those I've only tried OpenFOAM at some point. If I recall correctly, it didn't make all that vast use of collectives like Allreduce, Reduce, Bcast, Barrier. Do you know how heavy use of collectives WRF and NWCHEM do?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The collective percentage should depend on the type of input used right? I don't have much idea about a particular input and the MPI percentage/individual collective percentage, an application expert might help I guess :)
Multiple changes, new features, and enchancements
Notable new features: