-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4/shm: fix performance degradation on Sapphire Rapids with Intel Compiler #7150
Open
yfguo
wants to merge
1
commit into
pmodels:main
Choose a base branch
from
yfguo:fix-shm-perf
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Inter-NUMA, Sunspot, icx <style> </style>
|
Intra-NUMA, Sunspot, icx <style> </style>
|
AVX in MPICH configure vs AVX in MPL configure, intra-NUMA, Sunspot, icx <style> </style>
|
test:mpich/ch4/ofi |
Inter-NUMA, TOPO enabled vs disabled, Intel Compiler.
|
4 tasks
Fix the performance degradation on Intel Sapphire Rapids after introducing topo-aware SHM. This problem only happens when building with Intel compiler. The problem was topo-aware default to disabled. It uses regular memcpy for inter-NUMA message which is different from v4.2.2 (uses non-temporal copy). The reason this is disabled by default was due to using non-temporal copy results in higher latency in small message. After more testing with different CPUs (broadwell, skylake, cascade, icelake, milan), It seems only skylake, cascade and icelake has this issue on small message. It is probably OK to make topo-aware SHM default to enabled.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Pull Request Description
@zhenggb72 reported performance degradation in inter-NUMA SHM communication when compare to v4.2.2. The issue was introduced in #7046. MPICH v4.2.2 was getting ~14us latency for 64KB message, but only getting ~23us latency after #7046. Setting
MPIR_CVAR_CH4_SHM_POSIX_TOPO_ENABLE=true
solves the problem.The issue is cause by a change in memcpy operation. v4.2.2 uses non-temporal store for both intra-NUMA and inter-NUMA SHM communication. This was change to regular memcpy when topo-aware is disabled. The change in memcpy was because non-temporal store has higher latency in intra-NUMA communications in some architectures (see below result on Milan). Also, the non-temporal store has higher latency in inter-NUMA small message in other architectures (skylake, cascade, icelake).
After more comprehensive testing on broadwell, skylake, cascade, icelake, sapphire rapids, and milan, I think it is probably OK to make the topo-aware default to enabled, which would yield better performance for sapphire rapids and milan. Details numbers can be found in following comments.
This PR also address another source of performance degradation observed when building with Intel compiler. PR#7074 consolidated SSE2 and AVX related optimization options into MPL's configure because only MPL explicitly use them.This is superceded by #7152.This change showed no performance degradation with GNU compiler. But, with Intel compilers, this does results in some performance degradation (see below). Therefore, we should add them back in the main configure. Currently, the main configure checks for availability of SSE2, AVX and AVX512F, and add them to CFLAGS. The MPL configure will further check for specific instructions that is used in MPL.
All raw numbers:
2024-shm_bench-arch_comparison.xlsx
Author Checklist
Particularly focus on why, not what. Reference background, issues, test failures, xfail entries, etc.
Commits are self-contained and do not do two things at once.
Commit message is of the form:
module: short description
Commit message explains what's in the commit.
Whitespace checker. Warnings test. Additional tests via comments.
For non-Argonne authors, check contribution agreement.
If necessary, request an explicit comment from your companies PR approval manager.