coll/xhc: Bring in latest developments #12908

gkatev · 2024-11-04T12:50:02Z

Multiple changes, new features, and enchancements

Notable new features:

Inlining data in control path for extra small messages.
Allreduce goes through Bcast's implementation intead of duplicating it.
Fine grained op-wise tuning for hierarchy, chunk size, cico threshold.
Reduce reworked & improved, with double buffering and other optimizations.
When smsc support is not present, handle only small messages (below cico threshold), instead of not at all.

Multiple changes, new features, and enchancements Notable new features: - Inlining data in control path for extra small messages. - Allreduce goes through Bcast's implementation intead of duplicating it. - Fine grained op-wise tuning for hierarchy, chunk size, cico threshold. - Reduce reworked & improved, with double buffering and other optimizations. - When smsc support is not present, handle only small messages (below cico threshold), instead of not at all. Signed-off-by: George Katevenis <[email protected]>

arun-chandran-edarath · 2024-11-08T12:05:37Z

ompi/mca/coll/xhc/README.md

+buffer for each iteration without altering it. Since modern processors
+implicitly cache data, this can lead to false/unrealistic/unrepresentative
+results, given that actual real-world applications do not (usually/optimally!)
+perform duplicate operations.


I've also observed that using empty buffers doesn't provide accurate latency measurements while benchmarking nt-buffer-transfer (Ref: openucx/ucx#9408) in intra-node use cases.

I appreciate you sharing the modified version of OSU-benchmarks, I will try this out.

I noticed another aspect about OSU-benchmarks, which always use page-aligned buffers for latency measurement. This may not accurately represent real-world MPI applications that might use unaligned transfers, potentially suffering from different penalties and even substantially decreased performance on some AMD CPUs (https://lunnova.dev/articles/ryzen-slow-short-rep-mov/). Have you encountered similar findings in your experience?

Thanks for the comment, indeed the stock micro-benchmark implementation is problematic for measuring intra-node latency, especially/mostly for things like non-hierarchical broacast.

No I haven't done any work around the page alignment concerns you raise; if you do any testing and find something out of place, send over a result or two :)

@gkatev Do you have a table with the performance improvement numbers for any HPC apps (or any new AI apps) with xHC for intra-node cases with details like ranksXthreads, mapping, binding, machine CPU details, etc ?

Have you had a look in our paper "A framework for hierarchical single-copy MPI collectives on multicore nodes" @ Cluster 2022? It includes a few HPC app benchmarks with XHC. But there have been many improvements to XHC between the version in that paper and the one in this PR.

I went through the paper and I see you are getting good improvements with PiSvM, miniAMR and CNTK. Have you had a chance to test the performance with WRF, NWCHEM, or OpenFOAM?

Of those I've only tried OpenFOAM at some point. If I recall correctly, it didn't make all that vast use of collectives like Allreduce, Reduce, Bcast, Barrier. Do you know how heavy use of collectives WRF and NWCHEM do?

The collective percentage should depend on the type of input used right? I don't have much idea about a particular input and the MPI percentage/individual collective percentage, an application expert might help I guess :)

github-actions bot added the Target: main label Nov 4, 2024

arun-chandran-edarath reviewed Nov 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

coll/xhc: Bring in latest developments #12908

coll/xhc: Bring in latest developments #12908

gkatev commented Nov 4, 2024

arun-chandran-edarath Nov 8, 2024

gkatev Nov 8, 2024

arun-chandran-edarath Nov 9, 2024

gkatev Nov 13, 2024

arun-chandran-edarath Nov 15, 2024 •

edited

Loading

gkatev Nov 15, 2024

arun-chandran-edarath Nov 18, 2024

coll/xhc: Bring in latest developments #12908

Are you sure you want to change the base?

coll/xhc: Bring in latest developments #12908

Conversation

gkatev commented Nov 4, 2024

arun-chandran-edarath Nov 8, 2024

Choose a reason for hiding this comment

gkatev Nov 8, 2024

Choose a reason for hiding this comment

arun-chandran-edarath Nov 9, 2024

Choose a reason for hiding this comment

gkatev Nov 13, 2024

Choose a reason for hiding this comment

arun-chandran-edarath Nov 15, 2024 • edited Loading

Choose a reason for hiding this comment

gkatev Nov 15, 2024

Choose a reason for hiding this comment

arun-chandran-edarath Nov 18, 2024

Choose a reason for hiding this comment

arun-chandran-edarath Nov 15, 2024 •

edited

Loading