-
Notifications
You must be signed in to change notification settings - Fork 154
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Building with the spack-stack unified environment on non-production machines #589
Comments
I have established a testing environment on Hera referencing a test build of the spack-stack unified environment. So far, this is working well. All regression tests run to completion. I am running the regression tests against the branch to be merged in #571. There are some differences between the two, with the hwrf, global_4denvar, global_3denvar, global_3dvar, and rtma tests all failing cost and/or penalty comparison tests, with values differencing in the 1-5% range on the first iteration. One potential source for differences is a newer version of the BUFR library. The spack-stack test install is running version 12.0.0 versus the hpc-stack-based install that is compiled against version 11.7.0. Running with the older version with spack-stack would determine if this is a source of difference. |
@arunchawla-NOAA @RussTreadon-NOAA @hu5970 @ulmononian @mark-a-potts |
Dave ran ctests on Hera using Output from the loproc global_4denvar for updat ( The initial temperature (obs-ges) innovation statistics (fort.203) differ for prepbufr report types 120, 132, 180, and 182. These report types are virtual temperatures. Bias and rms are larger for observations closer to the surface. Differences decrease moving up in the atmosphere. Innovation statistics for report types 130, 131, 133, 134, 136, and 136 are identical. These report types are sensible temperatures. This suggests something odd happens when processing virtual temperature observations with Prints added to
returns Recompiling the code with Do we need to change or modify the call to |
Write Replace
with
in The
This failure, however, was due to the maximum allowable threshold time check. The updat
This may reflect the additional prints in the Comparison of the analysis results shows that the Not sure why replacing a real(8) argument with an integer(4) argument was needed in this call to My working directories on Hera are
Other ctests have not yet been run with the above change. |
Hera ctests with
All failures except
and
The
As a test, replace
Recompile and rerun
Stop here and let others follow up / investigate more. |
Also tagging @jbathegit @edwardhartnett @wx20jjung @dtkleist for awareness. |
@DavidHuber-NOAA and @RussTreadon-NOAA thanks for your tests on this! For now I would focus on building and testing with spack-stack that has bufr/11.7.0 I am going to bring this to the attention of @jbathegit @jack-woollen and @edwardhartnett to see if they can throw more light on Russ's results |
It looks like you figured the issue out on your own. This was indeed a planned change within v12.0.0, as noted in the release notes...
This was done intentionally as part of the simplification down to one library build. Note that in previous versions of ufbqcd, the 3rd argument returned was just a real, not an explicit real*8. So it was only a real*8 in the _8 and _d builds because of the compile flags for those builds, and not because of some intrinsic need within the library itself. The return value of the 3rd argument is never larger than 255 so it fits easily inside of any real, and in fact it's always a whole number so there's really no reason not to return it as an integer and make everything simpler. |
I feel the need to contribute my two cents in supporting backward compatibility with downstream codes as an important principle in software management. Considering reasons not to make such a change as this, seems to me there's not a good enough compensating reason to do it. |
I hear what you're saying Jack, and I totally agree that backwards compatibility is an important principle in software management. But in this case, this is no different than where we've already agreed that _8 users will now need to make a change to their application codes to add a call to setim8b, and to make sure that all calls to library functions which return integers are now explicitly declared as integer*4. In other words, with a major release such as this (where the X of the version X.Y.Z is changing), it's understood that downstream users may need to make some adjustments to their application codes, and this ufbqcd change is just another example of that, and we've already documented this in the accompanying release notes for _8 and _d users. Note also that, if we hadn't included this ufbqcd change, then as software managers we'd have had to instead develop new internal routines similar to x48 and x84, but for reals instead of integers, since this routine (along with its sister routine ufbqcp) were the only cases of user-callable routines where a real argument was being passed that wasn't already explicitly declared as real*8. And we couldn't have just changed the library to now make that 3rd argument an explicit real*8, because doing so would have similarly impacted any existing _4 users. So bottom line, no matter what we did here, it would have impacted some of our users, and this is just one of the consequences of our earlier decision to eliminate the previous _8 and _d builds and consolidate everything down to a single _4 build. |
On a separate but related note, I did note the discussion where some of the spack-stack test cases appear to be running slower with the new 12.0.0 build than with the previous 11.7.0 build. That's definitely not a feature that we intentionally introduced into the new build, nor is it something that we saw in our own regression testing for the library, so if it's still an issue of concern then we'd definitely like to better understand what part(s) of the library may be responsible. Is there any way that some sort of profiling tool might better pinpoint where those differences are (e.g. in which library functions or lines of code) for your test cases? |
Also took the opportunity to update prod-util loads for each modulefile and delete the gsi_common_wcoss2.lua modulefile, which was not incorporated with the Intel 2022 upgrade PR. NOAA-EMC#589
I ran regression tests with spack-stack/1.4.1 on Hera against the current develop branch with the following results:
Looking into the netcdf_fv3_regional case, differences first appear in initial radiance penalties. This seems like a CRTM issue, though I am having trouble tracking it down. The CRTM directories are The CRTM fix directories are The contents of the fix directories are identical. The hpc-stack source code for CRTM is located here: So there doesn't seem to be an obvious difference in CRTM. The library versions do differ between the two stacks. In particular, hdf5/netcdf 1.14.0/4.9.2 and nemsio in spack-stack is still at 2.5.2 and includes w3nco, but I don't think these should have an impact. hpc-stack
spack-stack
@RussTreadon-NOAA Would you mind taking a look at the RTs on Hera? My regression test directory is in |
A check of the run directories shows the crtm coefficients to be identical between the updat and contrl runs. A check of Are the same compiler flags used to build the hpc-stack and spack-stack crtm libraries? |
@AlexanderRichert-NOAA Could you point me to the build log for spack-stack/1.4.1 CRTM/2.4.0 on Hera? I'm seeing some differences between it and the hpc-stack version. |
Sure-- /scratch1/NCEPDEV/nems/role.epic/spack-stack/spack-stack-1.4.1/envs/unified-env/install/intel/2021.5.0/crtm-2.4.0-wpiygpl/.spack/spack-build-out.txt (and there's other build-related info in that directory) |
I won't look at it any further unless you'd like me to, but for what it's worth the difference that stands out to me is the CMAKE_BUILD_TYPE (hpc-stack=Release, spack-stack=RelWithDebInfo), and we've seen some differences in other codes based on that setting. |
@jbathegit @edwardhartnett I just ran a simple test on wc2 reading a large satellite file, which is basically all the gsi does with bufrlib, and got much the same result David did in his tests. With bl11 it took 2m5s, with bl12 2m45s. Timing changes a little from run to run but it generally takes something like 40s longer with bl12. My test literally just read the data using ireadmg and ireadsb then stopped. I'll proceed to comment the im8b test blocks for routine involved in the read process and see what happens. The fix may just be something like separate paths through the lower level routines for 4 and 8 byte cases. Depends on how many routines that turns out to be. |
Moving the tests to hera, using gfortran, v12 was actually a little bit faster than v11. So how do you compile the bufrlib with ifort on hera? |
@jack-woollen Hmm, interesting. I compiled version 12 of bufrlib two ways -- |
Can this problem be demonstrated in a bufrlib test program? That is, can we get a simple, one-file program, which runs slower under v12 than v11? |
@edwardhartnett It sounded like Jack had done just that in this comment. |
Timing these codes can be fiddly, depending on what platform it tested on, and when. WCOSS2 has demonstrated widely varying timings, via the unix time command. At first the results looked like v12 was slower than v11. That's when I wrote the comment above. But when I ran it over and over and over the results became more murky. Then on Sunday I tried timing on hera where v12 was a couple seconds faster then v11, but compiling with gfortran. Them I went back to WC2 (and ifort) and got results like hera, v12 faster by a hair. So far my simple timing test only checks reading a large and compressed (mtiasi) file. I'm thinking running the GSI observer with a full set of data will give a better comparison for that code. Working on it now . The elephant in the room for bufrlib timing may be the prepobs process, since it does reading, writing and arithmetic. I'll look at that also. My question about about compiling bufrlib on hera with ifort, refers to when I download bufrlib from git, the cmake default is to use gfortran. I'm wondering what to change so it compiles with ifort. Any help with that is appreciated. |
@jack-woollen Ah, understood. To compile with Intel on Hera, you only need to load the intel/2022.1.2 module then run cmake. Cmake will detect the Intel compiler and create a makefile based on it. |
Thanks to @DavidHuber-NOAA help with the gsi cmake setup I have been able to make timing comparisons of running the gsi observer mode with BF-11.7.0 and BF-12.0.0. The observer reads all the BUFR datasets into the program and then will stop if the variable miter=0. So it exercises most if not all the bufr coding within the gsi, and it only takes a few minutes to complete. Trying this and trying that has revealed what looks like roughly half of the timing issue apparent in the gsi runs using bufrlib version 12. Subroutine upb8.f, which is called when unpacking compressed bufr datasets, benefitted from some optimization specifically for the 4byte case. This subroutine is not introduced as part of the mods to the single library build, but rather to accomodate the new WMO approved feature of allowing numeric elements to fill up to 64 bits, instead of just 32 as it has been until now. The fix for upb8.f can be reviewed on dogwood in directory /lfs/h2/emc/global/noscrub/Jack.Woollen/bufrtime/bufr_v12.0.0/NCEPLIBS-bufr/src. Following is a list of timing results before and after the mods to upb8. Before upb8 change The total amount of time in user mode = 268.026405 ~40s The total amount of time in user mode = 267.477490 ~42s The total amount of time in user mode = 267.551686 ~43s The total amount of time in user mode = 266.931843 ~50s The total amount of time in user mode = 268.897811 ~47s After upb8 change The total amount of time in user mode = 268.612845 ~22s The total amount of time in user mode = 274.917766 ~14s The total amount of time in user mode = 265.291353 ~25s The total amount of time in user mode = 265.125857 ~25s The total amount of time in user mode = 267.488541 ~23s |
@jack-woollen Fantastic! Thanks for the update. I will run this optimization through the GSI regression tests. |
@jack-woollen I also have to add my kudos - well done sir! Your change to add an I'll go ahead and set up a PR to add this to the library baseline for the next release, and I'll let you know when it's ready. |
BTW, and just for my own FYI, how were you doing your timings? Were you using calls to the intrinsic routines Or maybe this is more of a question for @DavidHuber-NOAA or others? |
@jbathegit For the gsi there is the timing printed from the w3 routine I think. For other test I use the unix time command. In either case the user time is more stable than the wall time for comparisons. |
Thanks Jack. Do you happen to know which w3 routine they're using? Maybe w3utcdat? And by "user time", I presume you mean CPU time? |
Jeff, I don't know what timer is used in gsi. It isn't w3utcdat, at least I can't find that in the code. What ever it is it reports lke this: RESOURCE STATISTICS************** Of course the unix timer reports: real 0m0.012s In either case I think the user time is more stable than wall time. It is probably more like cpu time. |
I ran regression tests again spack-stack 1.5.1 on Orion. Multiple tests failed timethresh tests. Following the suggestion in NOAA-EMC/global-workflow#2044, I tried adding --cpus-per-task to the srun call, but this still did not help. Additionally, the
The I then ran regression tests on Hera using spack-stack/1.5.1. No tests failed for timethresh exceedances. It seems that the slowdowns seen on Orion are relegated to just that system and only for spack-stack/1.5.1. However, 4 tests failed for non-reproduceable cost function results on both systems ( |
@DavidHuber-NOAA FYI, I copied the latest souped up test version of bufr_v12 over to hera and compiled it with intel/2022.1.2, which is consistent with the bufr/11.7.0 module, and ran the time test reading through a 827MB mtiasi file. The results are below. They are a little better than I see on wcoss2. The updated v12 bufr working set is found on hera in /scratch1/NCEPDEV/global/Jack.Woollen/bufrtime/bufr_v12.0.0. /scratch2/NCEPDEV/nwprod/hpc-stack/libs/hpc-stack/intel-2022.1.2/bufr/11.7.0/lib64/libbufr_4.a /scratch1/NCEPDEV/global/Jack.Woollen/bufrtime/bufr_v12.0.0/build/path1/lib64/libbufr_4.a |
@jack-woollen @AlexanderRichert-NOAA Those are some very promising times. I will give that a try in the regression tests on Hera later today. And yes, I tried these tests out on Hera as well with spack-stack/1.5.1. There isn't a long runtime issue on that system, but there are reproducibility issues. I looked into it this morning and it appears that CRTM is still being built with |
I recompiled both develop and spack-stack on Orion and reran the regression tests this morning. The time results were much improved, though somewhat erratic. Perhaps I forgot to recompile after updating the modulefiles yesterday or perhaps there is a system issue. That said, four tests still failed their timethresh tests:
For comparison, below are the same runtimes on Hera. The runtimes are obviously slower for spack-stack on Hera (I'm guessing due to optimization settings), but not to the same extent as Orion.
I'm going to double check that I have all of the right libraries and then try rerunning the tests on Orion on /work2 just to see if that makes a difference. |
Here's a summary of the latest results as shown above. The orion side has some non-bufr issues for sure. I think given the latest optimization of the bufr code, and if I understand correctly that these tests are only variations of running the gsi executable, where the bufr data is read just once and done, then any difference between develop and spack-stack of more than a very few seconds is due to something other than the bufrlib. That might include i/o or cpu bottlenecks or other hardware or software factors which could be in play. <style> </style>
|
@jack-woollen Agreed that these do not have to do with BUFR (I'm trying to keep the BUFR-related tests in #642). I believe the differences on Hera are due to optimization differences of the libraries between develop (hpc-stack) and spack-stack. Looking into this some more, it appears the only library that is compiled with RelWithDebInfo that is used by the GSI is the CRTM, so it appears the time differences are (likely) just related to that library. This is an issue being tracked in JCSDA/spack-stack#827. The differences on Orion are something more and I think I/O may be a significant issue. The /work filesystem has always been slow and perhaps the lower optimization of spack-stack is magnified if additional disk I/O accompanies lower optimization. Indeed, I just finished a test on /work2 and the very large time differences between the HAFS tests go away. There's some difference, but it is much smaller:
I think I am satisfied with these timings on Orion. I'm going to move on to Hercules, Jet, and S4. |
I built this branch on S4. I won't be able to test that system until the global workflow is updated since the regression test dataset has restricted data on it which is not allowed on S4. |
This include netcdf/4.9.2 and hdf5/1.14.0. However, it does not include the upgrades in spack-stack on the other systems for ncdiag/1.1.2, ip/4.3.0, and w3emc/2.10.0.
With the upgrade to HDF5 version 1.14.0 ongoing (#563), this would be a good time to move to the spack-stack unified environment, which also incorporates this version of HDF5 and netCDF version 4.9.2. Some additions are needed to the unified environment (in particular, ncio, ncdiag, and bufr), but these can be added manually as they already exist in spack. This is also desired for the upcoming UFS SRW release, which will incorporate the GSI.
The text was updated successfully, but these errors were encountered: