By default, use MKL as virtual provider for blas/lapack/fftw with Intel compilers (classic and llvm-based/oneapi); update site configs to revert to openblas/fftw as needed; skip wgrib2 with Intel oneapi; bump odc to 1.5.2 #1226

climbfuji · 2024-08-07T21:21:06Z

Summary

Describe the changes made in this PR and why they are needed.

Unrelated change but needed for gcc@13 support: bump odc from 1.4.6 to 1.5.2.
Split configs/common/packages.yaml into a compiler-independent configs/common/packages.yaml and compiler-dependent configs/common/packages_${COMPILER}.yaml; use openblas and fftw as virtual providers for blas, lapack, fftw-api with gnu@ and apple-clang@; use intel-oneapi-mkl with intel@ and oneapi@.
Site config updates for all sites: split packages.yaml into packages_${COMPILER}.yaml and add Intel MKL as external package for intel@ and oneapi@ compilers. Please follow the examples for blackpearl, narwhal, nautilus. Note that updating the site config does not imply testing the update (see section "Testing" below for which tests where done).

Update 2024/08/09: Certain site configs were modified to by default retain the openblas/fftw configuration with Intel - see list below. Steps to switch to the new default MKL configuration are documented in each site's packages_*.yaml.

List of sites:

Corresponding documentation updates.

Testing

@fmahebert @srherbener @RatkoVasic-NOAA @natalie-perlin @AlexanderRichert-NOAA I "assigned" this PR to you in case you want to test the updated site configs (see list above) on the system(s) that you are responsible for - this is optional, because we'll go through all platforms in a few weeks anyway when we roll out spack-stack-1.8.0.

Built unified environment on blackpearl with [email protected] and [email protected] (@climbfuji)
Built neptune standalone environment on Narwhal with [email protected] and [email protected] (@climbfuji)
Built unified environment on Orion with [email protected], with openblas (default per site config) and with MKL (manually changed) (@climbfuji)
CI

Applications affected

All

Systems affected

All

Dependencies

n/a

Issue(s) addressed

Resolves #759

Checklist

This PR addresses one issue/problem/enhancement, or has a very good reason for not doing so.
These changes have been tested on the affected systems and applications.
~~All dependency PRs/issues have been resolved and this PR can be merged.~~

…-specific settings out of packages.yaml

…packages), add mirror

…d testing

…or '%oneapi'

configs/common/packages.yaml

…eature/oneapi_intel_use_mkl

…ers/specs/jedi-ci.yaml

…eature/oneapi_intel_use_mkl

…/PreConfiguredSites.rst

climbfuji · 2024-08-12T11:47:41Z

@AlexanderRichert-NOAA I have two environments on Orion:

/work2/noaa/jcsda/dheinzel/spack-stack-feature-oneapi_intel_use_mkl/envs/ue-intel-2021.9.0/install/modulefiles/Core

and

/work2/noaa/jcsda/dheinzel/spack-stack-feature-oneapi_intel_use_mkl/envs/ue-intel-2021.9.0-mkl/install/modulefiles/Core

…eature/oneapi_intel_use_mkl

climbfuji · 2024-08-12T21:25:40Z

@AlexanderRichert-NOAA Here is a paper from Intel demonstrating better performance with MKL: https://www.intel.com/content/www/us/en/developer/articles/technical/performance-comparison-of-openblas-and-intel-math-kernel-library-in-r.html

climbfuji · 2024-08-13T02:55:12Z

@AlexanderRichert-NOAA I tried using the current ufs-weather-model develop branch with this spack-stack PR and the two test installs on Orion noted above. I tried the cpld_control_p8_mixedmode_intel regression test. Both the openblas and the mkl builds segfault after/in the ww3 initialization:

> cat out
...
180:        Wave model ...
180:  WW3 log written to /work/noaa/stmp/dheinzel/stmp/dheinzel/FV3_RT/rt_2304865/cpl
180:  d_control_p8_mixedmode_intel_test_mkl/./log.ww3

> cat err
150: WARNING from PE     0: Unused line in INPUT/MOM_input : ODA_VINC_VAR = 'v_inc'
150:
150:
150: WARNING from PE     0: Unused line in INPUT/MOM_input : ODA_INCUPD_NHOURS = 6
150:
 77: Abort(805361423) on node 77 (rank 77 in comm 0): Fatal error in PMPI_Init_thread: Other MPI error, error stack:
 77: MPI_Init_thread(307): Cannot call MPI_INIT or MPI_INIT_THREAD more than once

I checked and esmf is linked static, mapl shared. I don't know why/how that changed. Remember that spack-stack develop uses newer ESMF and MAPL versions, and I also had to put this workaround in the ufs-weather-model top-level CMakeLIsts.txt to be able to run cmake and make:

diff --git a/CMakeLists.txt b/CMakeLists.txt
index e5fdd1e8..5c7a974a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -148,6 +148,9 @@ endif()

 find_package(NetCDF 4.7.4 REQUIRED C Fortran)
 find_package(ESMF 8.3.0 MODULE REQUIRED)
+if (NOT TARGET ESMF::ESMF)
+       add_library(ESMF::ESMF ALIAS esmf)
+endif ()
 if(FMS)
   find_package(FMS 2022.04 REQUIRED COMPONENTS R4 R8)
   if(APP MATCHES "^(HAFSW)$")

Will try to force static mapl linking tomorrow.

…eature/oneapi_intel_use_mkl

climbfuji · 2024-08-13T13:30:36Z

@AlexanderRichert-NOAA Update. I rebuilt mapl as static libraries and linked against that version. With both MKL and blas, the ufs-weather-model still aborts with the same error and in the same place as described above. I am certain this is unrelated to the changes in this PR, it must have something to do with the update of some packages from spack-stack-1.6.0 to spack-stack-develop.

…er-scu17

.github/workflows/ubuntu-ci-x86_64-intel.yaml

climbfuji · 2024-08-14T21:09:04Z

@AlexanderRichert-NOAA Update. I rebuilt mapl as static libraries and linked against that version. With both MKL and blas, the ufs-weather-model still aborts with the same error and in the same place as described above. I am certain this is unrelated to the changes in this PR, it must have something to do with the update of some packages from spack-stack-1.6.0 to spack-stack-develop.

@AlexanderRichert-NOAA Yet another update. I compiled the spack-stack develop unified environment, then ufs-weather-model and ran the same test as above. It failed in the same place. Therefore, the problem is not related to this PR and it shouldn't be held up by that.

AlexanderRichert-NOAA · 2024-08-14T21:18:34Z

Have you run other UWM RTs with it?

climbfuji · 2024-08-14T21:25:31Z

Have you run other UWM RTs with it?

I haven't.

climbfuji · 2024-08-14T21:30:57Z

One thing I had to do in order to compile was:

index e5fdd1e8..5c7a974a 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -148,6 +148,9 @@ endif()

 find_package(NetCDF 4.7.4 REQUIRED C Fortran)
 find_package(ESMF 8.3.0 MODULE REQUIRED)
+if (NOT TARGET ESMF::ESMF)
+       add_library(ESMF::ESMF ALIAS esmf)
+endif ()
 if(FMS)
   find_package(FMS 2022.04 REQUIRED COMPONENTS R4 R8)
   if(APP MATCHES "^(HAFSW)$")
diff --git a/tests/logs/RegressionTests_orion.log b/tests/logs/RegressionTests_orion.log
index b08d50a9..eee3eda5 100644

This is because of the findESMF mismatches etc. discussed elsewhere. I don't think this is the cause of the problem, though ...

I'll run one basic regression test next (not sure when I get to it)

climbfuji · 2024-08-15T15:27:06Z

@AlexanderRichert-NOAA control_c48 runs with spack-stack-dev and with this branch. The openblas config (default as per this PR) produces b4b identical results for control_c48 than spack-stack-dev. The mkl run is still stuck in the queue (orion had a power outage).

See ufs-community/ufs-weather-model#2399 where Dusan reports that he gets the same errors I had above with the coupled run when using esmf 8.6.1 and mapl 2.46.2 in spack-stack-1.6.0.

I think at this point there is no reason to hold up this PR.

climbfuji · 2024-08-15T16:31:33Z

Thanks @AlexanderRichert-NOAA

climbfuji · 2024-08-15T20:46:36Z

@AlexanderRichert-NOAA I know this has been merged already, but for the sake of completeness: the control_c48 run with MKL is b4b identical to the openblas run. Of course, this may not be the case for the fully coupled model, but at least for atm standalone atmosphere with some stochastics (cellular automata) it is.

climbfuji added 5 commits August 6, 2024 17:42

Create packages_{apple-clang,gcc,intel,oneapi}.yaml and move compiler…

63bb917

…-specific settings out of packages.yaml

Update blackpearl site config: remove oneapi settings (now in common …

b9be674

…packages), add mirror

Update .gitmodules and submodule pointer for spack for code review an…

b69c4e0

…d testing

Temporarily remove wgrib2 requirement from various virtual packages f…

8089b57

…or '%oneapi'

First pass to update configs/sites/tier1/narwhal

2b9ccd2

climbfuji mentioned this pull request Aug 7, 2024

For gcc@13: Update GFE packages (yafyaml etc) from spack develop and add [email protected] JCSDA/spack#457

Merged

1 task

Update formatting in configs/common/packages_*.yaml

b586a26

climbfuji added 3 commits August 7, 2024 15:57

Update nautilus site config: use packages_COMPILER.yaml

be60b11

Remove unused external intel-oneapi-tbb from Narwhal site config

38ed6fb

Update CI workflows to use Intel MKL with Intel and OneAPI compilers

6e9b458

climbfuji commented Aug 8, 2024

View reviewed changes

configs/common/packages.yaml Outdated Show resolved Hide resolved

climbfuji added 7 commits August 8, 2024 08:56

Merge branch 'develop' of https://github.com/jcsda/spack-stack into f…

148c4d9

…eature/oneapi_intel_use_mkl

Update submodule pointer for spack

5b3c534

Add libfabric to LD_LIBRARY_PATH in narwhal compiler config

18ae9fc

Bump odc to 1.5.2 in configs/common/packages.yaml and configs/contain…

bc82b78

…ers/specs/jedi-ci.yaml

Merge branch 'develop' of https://github.com/jcsda/spack-stack into f…

35120ca

…eature/oneapi_intel_use_mkl

For Intel Classic, pin gettext to 0.21.1 since 0.22.5 doesn't build

108ba70

Revert .gitmodules and update submodule pointer for spack

7ab7ef0

climbfuji assigned climbfuji, fmahebert, srherbener, RatkoVasic-NOAA, natalie-perlin and AlexanderRichert-NOAA Aug 9, 2024

climbfuji added 5 commits August 8, 2024 21:43

Update atlantis site config

7967004

Update S4 site config

7c2def4

Update Orion site config

ba3f738

Work in progress: update doc/source/NewSiteConfigs.rst and doc/source…

ba44456

…/PreConfiguredSites.rst

Update Hercules site config

6f08ead

Merge branch 'develop' of https://github.com/jcsda/spack-stack into f…

ee49e3d

…eature/oneapi_intel_use_mkl

AlexanderRichert-NOAA mentioned this pull request Aug 12, 2024

spack-stack Intel will use MKL for FFTW/BLAS/LAPACK functionality NOAA-EMC/GSI#780

Open

climbfuji marked this pull request as ready for review August 12, 2024 21:47

climbfuji requested a review from AlexanderRichert-NOAA August 12, 2024 21:47

climbfuji added the enhancement New feature or request label Aug 12, 2024

climbfuji mentioned this pull request Aug 12, 2024

"Revert" Intel deprecation flag changes in spack's lib/spack/env/cc and pull in improved solution from spack mainline #1238

Merged

6 tasks

climbfuji added 2 commits August 13, 2024 07:20

Update site configs for Hera and Jet (use openblas for now with Intel)

10ea19d

Merge branch 'develop' of https://github.com/jcsda/spack-stack into f…

e95796d

…eature/oneapi_intel_use_mkl

Update site configs for aws-pcluster, derecho, discover-scu16, discov…

139dd43

…er-scu17

climbfuji commented Aug 14, 2024

View reviewed changes

.github/workflows/ubuntu-ci-x86_64-intel.yaml Show resolved Hide resolved

climbfuji added 3 commits August 14, 2024 07:32

Update site configs for acorn, gaea-c5, gaea-c6

9b0d8cf

Update site configs for noaa-{aws,azure,gcloud}

7506192

Final update of doc/source/PreConfiguredSites.rst

30c7027

climbfuji mentioned this pull request Aug 15, 2024

Should we update FindESMF.cmake and maybe the name of the imported esmf target ufs-community/ufs-weather-model#2399

Open

AlexanderRichert-NOAA approved these changes Aug 15, 2024

View reviewed changes

climbfuji merged commit 7168fec into JCSDA:develop Aug 15, 2024
8 checks passed

climbfuji deleted the feature/oneapi_intel_use_mkl branch August 15, 2024 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

By default, use MKL as virtual provider for blas/lapack/fftw with Intel compilers (classic and llvm-based/oneapi); update site configs to revert to openblas/fftw as needed; skip wgrib2 with Intel oneapi; bump odc to 1.5.2 #1226

By default, use MKL as virtual provider for blas/lapack/fftw with Intel compilers (classic and llvm-based/oneapi); update site configs to revert to openblas/fftw as needed; skip wgrib2 with Intel oneapi; bump odc to 1.5.2 #1226

climbfuji commented Aug 7, 2024 •

edited

Loading

climbfuji commented Aug 12, 2024

climbfuji commented Aug 12, 2024

climbfuji commented Aug 13, 2024

climbfuji commented Aug 13, 2024

climbfuji commented Aug 14, 2024

AlexanderRichert-NOAA commented Aug 14, 2024

climbfuji commented Aug 14, 2024

climbfuji commented Aug 14, 2024

climbfuji commented Aug 15, 2024

climbfuji commented Aug 15, 2024

climbfuji commented Aug 15, 2024

By default, use MKL as virtual provider for blas/lapack/fftw with Intel compilers (classic and llvm-based/oneapi); update site configs to revert to openblas/fftw as needed; skip wgrib2 with Intel oneapi; bump odc to 1.5.2 #1226

By default, use MKL as virtual provider for blas/lapack/fftw with Intel compilers (classic and llvm-based/oneapi); update site configs to revert to openblas/fftw as needed; skip wgrib2 with Intel oneapi; bump odc to 1.5.2 #1226

Conversation

climbfuji commented Aug 7, 2024 • edited Loading

Summary

Testing

Applications affected

Systems affected

Dependencies

Issue(s) addressed

Checklist

climbfuji commented Aug 12, 2024

climbfuji commented Aug 12, 2024

climbfuji commented Aug 13, 2024

climbfuji commented Aug 13, 2024

climbfuji commented Aug 14, 2024

AlexanderRichert-NOAA commented Aug 14, 2024

climbfuji commented Aug 14, 2024

climbfuji commented Aug 14, 2024

climbfuji commented Aug 15, 2024

climbfuji commented Aug 15, 2024

climbfuji commented Aug 15, 2024

climbfuji commented Aug 7, 2024 •

edited

Loading