Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use MPI_Bcast instead of multiple p2p messages to update nest from parent #2059

Merged
merged 34 commits into from
Jan 22, 2024

Conversation

dkokron
Copy link
Contributor

@dkokron dkokron commented Dec 20, 2023

PR Author Checklist:

  • I have linked PR's from all sub-components involved in section below.

  • I am confirming reviews are completed in ALL sub-component PR's.

  • I have run the full RT suite on either Hera/Hercules AND have attached the log to this PR below this line:

  • I have added the list of all failed regression tests to "Anticipated changes" section.

  • I have filled out all sections of the template.

Performance profiling of a HAFS case on NOAA systems revealed significant of time was spent in fill_nested_grid_cpl(). The fill_nested_grid_cpl() routine from FV3/atmos_cubed_sphere/driver/fvGFS/atmosphere.F90 is showing up as a performance bottleneck. This routine gathers a global SST field (6,336,000 bytes) onto rank 0, then broadcasts that field to all ranks in the nest. The code uses point-to-point (p2p) messages (Isend/Recv) from rank 0 to the nest ranks. This communication pattern is maxing out the SlingShot-10 link on the first node resulting in a .15s hit every fifth time step.

The proposed fix is to modify the relevant FV3 code to use a single MPI_Bcast (via mpp_broadcast()) instead of multiple point-to-point messages (via mpp_send/mpp_recv). The use of mpp_broadcast depends on a fix to FMS that was merged on 16 June and is available in version 2023.02 of that package.

This PR depends on

This PR also depends on

I ran the UFS regression suite on acorn and cactus. Both runs resulted in "REGRESSION TEST WAS SUCCESSFUL". I do not have access to Hera/Hercules.

This change is zero-diff. No need to update baselines.

Commit Message

  • use single MPI_Bcast (via mpp_broadcast()) instead of multiple point-to-point messages (via mpp_send/mpp_recv)
  • address issue maxing out the SlingShot-10 link on the first node resulting in a .15s hit every fifth time step

Subcomponents involved:

  • AQM
  • CDEPS
  • CICE
  • CMEPS
  • CMakeModules
  • FV3
  • GOCART
  • HYCOM
  • MOM6
  • NOAHMP
  • WW3
  • stochastic_physics
  • none

Anticipated Changes

Input data

  • No changes are expected to input data.
  • Changes are expected to input data:
    • New input data.
    • Updated input data.

Regression Tests:

  • No changes are expected to any regression test.
  • Changes are expected to the following tests:
Tests effected by changes in this PR:

Libraries

  • Not Needed
  • Needed
    • Create separate issue in JCSDA/spack-stack asking for update to library. Include library name, library version.
    • Add issue link from JCSDA/spack-stack following this item
Code Managers Log
  • This PR is up-to-date with the top of all sub-component repositories except for those sub-components which are the subject of this PR.
  • Move new/updated input data on RDHPCS Hera and propagate input data changes to all supported systems.
    • N/A

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
    • Completed
  • opnReqTest
    • N/A
    • Log attached to comment

Dan Kokron added 18 commits May 5, 2023 00:17
…epends on NOAA-GFDL/FMS#1246 to be functional.  More efficient 'if' test in fill_nested_grid_cpl()
…weather-model into hafs-BcastFillNestedGridCpl

Keeping pace with mainline
…fficient determination of ranks involved in the mpp_broadcast call in fill_nested_grid_cpl routine
@jkbk2004
Copy link
Collaborator

@dkokron can you sync up your branches? We plan to work on this pr by combining with #2066 tomorrow.

@BrianCurtis-NOAA
Copy link
Collaborator

BrianCurtis-NOAA commented Jan 18, 2024

@dkokron you'll want to make sure to pull from ufs-community/ufs-weather-model develop branch and then update your submodule hashes. This includes in your FV3 PR as well bring the updates into UFSWM.

@jkbk2004
Copy link
Collaborator

jkbk2004 commented Jan 18, 2024

@dkokron I think branches got synced ok. Can you check files changed?

@FernandoAndrade-NOAA FernandoAndrade-NOAA added the No Baseline Change No Baseline Change label Jan 19, 2024
@zach1221 zach1221 added the Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked. label Jan 21, 2024
@BrianCurtis-NOAA
Copy link
Collaborator

Skipping WCOSS2 as it's down today. Will be back at 20Z if PR gets delayed past that.

@zach1221
Copy link
Collaborator

@dkokron @jkbk2004 fv3atm is merged. Please update hash and revert gitmodule url.
Hash: NOAA-EMC/fv3atm@6c2b775

@zach1221 zach1221 merged commit adfcede into ufs-community:develop Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
No Baseline Change No Baseline Change Ready for Commit Queue The PR is ready for the Commit Queue. All checkboxes in PR template have been checked.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants