Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix initialization of doubles in VerticalProfileMod #6649

Merged

Conversation

peterdschwartz
Copy link
Contributor

@peterdschwartz peterdschwartz commented Sep 27, 2024

Variables that are r8 were initialized as single-precision, potentially causing inconsistent failures with the sums not adding to 1.0_r8.

Also, fixed a syntax error in CH4Mod for spval.

[BFB]

@peterdschwartz
Copy link
Contributor Author

I can't verifiy that this indeed fixes the bug on mappy. Someone with access to mappy would have to test it a few times due to the intermittent nature of the failure.

But this code is bugged, so it makes sense to me merge it to see if it's fixed using CDash.

@rljacob
Copy link
Member

rljacob commented Sep 30, 2024

@peterdschwartz you can merge this to next today.

peterdschwartz added a commit that referenced this pull request Sep 30, 2024
)

Variables that are r8 were initialized as single-precision, potentially causing inconsistent failures with the sums not adding to 1.0_r8.

Also, fixed a syntax error in CH4Mod for spval.

Fixes #6650
[BFB]
@peterdschwartz
Copy link
Contributor Author

merged to next.

Tested [BFB] with intel and gnu compilers on pm-cpu

@peterdschwartz
Copy link
Contributor Author

@rljacob I noticed that ERS.r05_r05.ICNPRDCTCBC.mappy_gnu.elm-cbudget is also failing occasionally with same VerticalProfileMod sum error message.

Since that test still failed overnight on next with this PR, I re-checked and noticed I missed a couple of if-branches that have single-precision (as below). I made a new commit and will re-merge today.

else 
   froot_prof(p,1) = 1./dzsoi_decomp(1)
   croot_prof(p,1) = 1./dzsoi_decomp(1)
   leaf_prof(p,1) = 1./dzsoi_decomp(1)
   stem_prof(p,1) = 1./dzsoi_decomp(1) 
end if 

Test error message:

 profile sums:    1.0000000000000000        1.0000000000000000        1.0000000000000000       0.99923994737367017     
 ENDRUN: ERROR: sum-1 > deltaERROR in /home/e3sm-jenkins/jenkins-ws/workspace/mappy_e3sm_next/E3SM/components/elm/src/biogeochem/VerticalProfileMod.F90 at line 340                                                                                                                                                                                                                                                                                                                                                                                          
 ERROR: Unknown error submitted to shr_abort_abort.

@rljacob
Copy link
Member

rljacob commented Oct 1, 2024

Should the 3.0.1 tag wait for this?

@peterdschwartz
Copy link
Contributor Author

@rljacob I'm not sure. The code is more correct this way, but practically, it seems to only manifest on one machine/compiler and we can't guarantee that this will fix that specifically.

peterdschwartz added a commit that referenced this pull request Oct 1, 2024
@peterdschwartz
Copy link
Contributor Author

re-merged to next

@rljacob
Copy link
Member

rljacob commented Oct 2, 2024

Was this supposed to help with ERS.r05_r05.IELM.mappy_gnu.elm-V2_ELM_MOSART_features ? It flipped from DIFF yesterday to FAIL today.

@peterdschwartz
Copy link
Contributor Author

peterdschwartz commented Oct 2, 2024

Yes, this PR was aiming at fixing random fails for ERS.r05_r05.IELM.mappy_gnu.elm-V2_ELM_MOSART_features and ERS.r05_r05.ICNPRDCTCBC.mappy_gnu.elm-cbudget. So, I assume that the FAIL today will go back to being a DIFF on its own. Same as the elm-cbudget test failed yesterday but passes today.

The randomness, only affecting one machine/compiler, and happening to two different tests is what makes me think it's some floating-point issue.

Been trying to figure out the next step this morning and It's difficult to go more in-depth without being able to reproduce this behavior on pm-cpu. I attempted to run a debug version of the mosart test ERS_D.r05_r05.IELM.mappy_gnu.elm-V2_ELM_MOSART_features but that ran into a floating point exception while reading in MOSART restart file so there's NaNs in some of the MOSART variables (I'm guessing) which has to be unrelated.

/global/u2/p/pschwar3/integration/E3SM/components/mosart/src/riverroute/RtmRestFile.F90
code:
            if (abs(rtmCTL%erout(n,nt))   > 1.e30) then <=----- THIS LINE
                 rtmCTL%erout(n,nt) = 0.
             end if

Potential next step:
Make another commit that puts more variable information in the error message and hopefully find a red flag.

There are several checks to avoid dividing-by-zero like
(rootfr_tot > 0._r8) .and. (surface_prof_tot > 0._r8)
which could be causing instability if those variables are essentially zero.
Changing those checks to
(rootfr_tot > smallparameter) .and. (surface_prof_tot > smallparameter)
is another thing to try, but it's pretty awkward relying on nightly test results to debug this

@rljacob
Copy link
Member

rljacob commented Oct 2, 2024

@jgfouca may be able to help speed up the development since he could run a test on your branch on mappy.

Note that I reset next so this PR is no longer on it.

@jgfouca
Copy link
Member

jgfouca commented Oct 3, 2024

@rljacob , @peterdschwartz , I am happy to kick off some runs on mappy. I assume I will need to checkout this branch. How is this for a test:

% for i in $(seq 10); do
  ./create_test ERS.r05_r05.IELM.mappy_gnu.elm-V2_ELM_MOSART_features --wait
done

I assume we are looking for FAILs, not DIFFs, so no need for baseline comparison.

@rljacob
Copy link
Member

rljacob commented Oct 9, 2024

@peterdschwartz want do you want to do with this PR? It still fixes things that should be fixed even if it doesn't fix #6650

@peterdschwartz
Copy link
Contributor Author

peterdschwartz commented Oct 9, 2024

Right, I will push one more commit that just adds useful information to the error message and start merging it tomorrow

@rljacob rljacob added this to the maint-3.0 milestone Oct 9, 2024
peterdschwartz added a commit that referenced this pull request Oct 10, 2024
)

Variables that are r8 were initialized as single-precision, potentially causing inconsistent failures with the sums not adding to 1.0_r8.
Added more output to errmsg to aid in understanding transient failures.
Also, fixed a syntax error in CH4Mod for spval.

[BFB]
@rljacob
Copy link
Member

rljacob commented Oct 14, 2024

@peterdschwartz you can merge this to master.

@peterdschwartz peterdschwartz merged commit 12f489f into E3SM-Project:master Oct 14, 2024
9 checks passed
@peterdschwartz
Copy link
Contributor Author

merged to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants