Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

coll/xhc: Bring in latest developments #12908

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

gkatev
Copy link
Contributor

@gkatev gkatev commented Nov 4, 2024

Multiple changes, new features, and enchancements

Notable new features:

  • Inlining data in control path for extra small messages.
  • Allreduce goes through Bcast's implementation intead of duplicating it.
  • Fine grained op-wise tuning for hierarchy, chunk size, cico threshold.
  • Reduce reworked & improved, with double buffering and other optimizations.
  • When smsc support is not present, handle only small messages (below cico threshold), instead of not at all.

Multiple changes, new features, and enchancements

Notable new features:

- Inlining data in control path for extra small messages.
- Allreduce goes through Bcast's implementation intead of duplicating it.
- Fine grained op-wise tuning for hierarchy, chunk size, cico threshold.
- Reduce reworked & improved, with double buffering and other optimizations.
- When smsc support is not present, handle only small messages (below
	cico threshold), instead of not at all.

Signed-off-by: George Katevenis <[email protected]>
buffer for each iteration without altering it. Since modern processors
implicitly cache data, this can lead to false/unrealistic/unrepresentative
results, given that actual real-world applications do not (usually/optimally!)
perform duplicate operations.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also observed that using empty buffers doesn't provide accurate latency measurements while benchmarking nt-buffer-transfer (Ref: openucx/ucx#9408) in intra-node use cases.

I appreciate you sharing the modified version of OSU-benchmarks, I will try this out.

I noticed another aspect about OSU-benchmarks, which always use page-aligned buffers for latency measurement. This may not accurately represent real-world MPI applications that might use unaligned transfers, potentially suffering from different penalties and even substantially decreased performance on some AMD CPUs (https://lunnova.dev/articles/ryzen-slow-short-rep-mov/). Have you encountered similar findings in your experience?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, indeed the stock micro-benchmark implementation is problematic for measuring intra-node latency, especially/mostly for things like non-hierarchical broacast.

No I haven't done any work around the page alignment concerns you raise; if you do any testing and find something out of place, send over a result or two :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gkatev Do you have a table with the performance improvement numbers for any HPC apps (or any new AI apps) with xHC for intra-node cases with details like ranksXthreads, mapping, binding, machine CPU details, etc ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you had a look in our paper "A framework for hierarchical single-copy MPI collectives on multicore nodes" @ Cluster 2022? It includes a few HPC app benchmarks with XHC. But there have been many improvements to XHC between the version in that paper and the one in this PR.

Copy link

@arun-chandran-edarath arun-chandran-edarath Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went through the paper and I see you are getting good improvements with PiSvM, miniAMR and CNTK. Have you had a chance to test the performance with WRF, NWCHEM, or OpenFOAM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of those I've only tried OpenFOAM at some point. If I recall correctly, it didn't make all that vast use of collectives like Allreduce, Reduce, Bcast, Barrier. Do you know how heavy use of collectives WRF and NWCHEM do?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The collective percentage should depend on the type of input used right? I don't have much idea about a particular input and the MPI percentage/individual collective percentage, an application expert might help I guess :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants