Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gaea C6 support for UFSWM #2448

Draft
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

BrianCurtis-NOAA
Copy link
Collaborator

Commit Queue Requirements:

  • Fill out all sections of this template.
  • All sub component pull requests have been reviewed by their code managers.
  • Run the full Intel+GNU RT suite (compared to current baselines) on either Hera/Derecho/Hercules
  • Commit 'test_changes.list' from previous step

Description:

This PR will bring in all changes necessary to provide Gaea C6 support for UFSWM

Commit Message:

* UFSWM - Gaea C6 Support

Priority:

  • Normal

Git Tracking

UFSWM:

  • Closes #
  • None

Sub component Pull Requests:

  • None

UFSWM Blocking Dependencies:

  • Blocked by #
  • None

Changes

Regression Test Changes (Please commit test_changes.list):

  • No Baseline Changes. (just adds logs for Gaea C6)

Input data Changes:

  • None.

Library Changes/Upgrades:

  • No Updates

Testing Log:

  • RDHPCS
    • Hera
    • Orion
    • Hercules
    • Jet
    • Gaea
    • Derecho
  • WCOSS2
    • Dogwood/Cactus
    • Acorn
  • CI
  • opnReqTest (complete task if unnecessary)

@BrianCurtis-NOAA
Copy link
Collaborator Author

cpld_control_p8 intel fails for timing out, so there's work to tweak the configs to better match the C6 hardware.

I think there's still lots of other items to check here, this is just a placeholder for now. Please feel free to send PR's to my fork/branch to add/adjust/fix any issues etc...

@BrianCurtis-NOAA
Copy link
Collaborator Author

Also, once things start falling into place, we'll need to make sure intelllvm support is available for c6.

@RatkoVasic-NOAA
Copy link
Collaborator

@BrianCurtis-NOAA, name change suggestion:

gaea -----> gaea-c5
gaeac6 ---> gaea-c6

@sanAkel
Copy link

sanAkel commented Oct 4, 2024

@BrianCurtis-NOAA Shall I re-try building with these modulefiles/ufs_gaeac6.intel.lua in this PR?

tests/compile.sh Outdated
@@ -95,7 +98,7 @@ export SUITES
set -ex

# Valid applications
if [[ ${MACHINE_ID} != gaea ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we even need this logic here, adding or not adding -DMOM6SOLO=ON? As far as I know, we do not regression test MOM6SOLO. Can we remove this block of code entirely from this script?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. It was added there for a reason, and I don't recall if we ever RT'd MOM6SOLO. @junwang-noaa do you recall what this block of code was used for?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi! I'm new to the UFS, but AFAIK, nobody seems to use -DMOM6SOLO=ON, though I would differ it to @junwang-noaa.

Copy link

@sanAkel sanAkel Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I remember correctly, this is to support standalone MOM testing. @jiandewang Do you know why MOM6 SOLO does not work on gaea?

It was added here many years ago and we never tried this SOLO on any platform. My understanding is with nuopc_cap it has to be coupled with something.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@junwang-noaa My understanding from @jiandewang is that he (and others) are no longer routinely testing MOM solo config; I have always built using instructions at MOM6-examples

yes we use MOM-example to do standalone test when it's needed to do some debug work (to help GFDL to narrow down issue when their big PR is not working as expected in UWM).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you do not use tests/compile.sh to build standalone test, is that correct?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you do not use tests/compile.sh to build standalone test, is that correct?

correct

Copy link
Collaborator

@DusanJovic-NOAA DusanJovic-NOAA Oct 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we should remove it from compile.sh

@BrianCurtis-NOAA
Copy link
Collaborator Author

cpld_control_p8 fails with:

  5: MPICH ERROR [Rank 5] [job id 207188364.0] [Fri Oct  4 13:33:08 2024] [c6n0210] - Abort(941244175) (rank 5 in comm 0): Fatal error in PMPI_Win_create: Other MPI error, error stack:
  5: PMPI_Win_create(294)......: MPI_Win_create(base=0x7ffe81f20fe0, size=0, disp_unit=1, MPI_INFO_NULL, comm=0xc4000060, win=0x7ffe81f2113c) failed
  5: MPID_Win_create(89).......:
  5: MPIDIG_mpi_win_create(872):
  5: win_allgather(246)........: OFI mr_enable failed (ofi_win.c:246:win_allgather:Address already in use)

and control_p8 runs to completion:

0: * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . * . 
  0: *****************RESOURCE STATISTICS*******************************
  0: The total amount of wall time                        = 853.216145
  0: The total amount of time in user mode                = 216.242551
  0: The total amount of time in sys mode                 = 410.041583
  0: The maximum resident set size (KB)                   = 1720560
  0: Number of page faults without I/O activity           = 131391
  0: Number of page faults with I/O activity              = 173
  0: Number of times filesystem performed INPUT           = 1024
  0: Number of times filesystem performed OUTPUT          = 0
  0: Number of Voluntary Context Switches                 = 16903
  0: Number of InVoluntary Context Switches               = 9006
  0: *****************END OF RESOURCE STATISTICS*************************

@BrianCurtis-NOAA
Copy link
Collaborator Author

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

@DusanJovic-NOAA
Copy link
Collaborator

@DusanJovic-NOAA this look ok?:

diff --git a/tests/compile.sh b/tests/compile.sh
index 2c3c7796..26e3a788 100755
--- a/tests/compile.sh
+++ b/tests/compile.sh
@@ -97,17 +97,6 @@ SUITES=$(grep -Po "\-DCCPP_SUITES=\K[^ ]*" <<< "${MAKE_OPT}")
 export SUITES
 set -ex
 
-# Valid applications
-if [[ ${MACHINE_ID} != gaea-c5 && ${MACHINE_ID} != gaea-c6 ]] || [[ ${RT_COMPILER} != intelllvm ]]; then # skip MOM6SOLO on gaea with intelllvm
-  if [[ "${MAKE_OPT}" == *"-DAPP=S2S"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-
-  if [[ "${MAKE_OPT}" == *"-DAPP=NG-GODAS"* ]]; then
-      CMAKE_FLAGS+=" -DMOM6SOLO=ON"
-  fi
-fi
-
 CMAKE_FLAGS=$(set -e; trim "${CMAKE_FLAGS}")
 echo "CMAKE_FLAGS = ${CMAKE_FLAGS}"

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants