Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ELM-FATES Restart Error #1234

Open
rgknox opened this issue Aug 12, 2024 · 33 comments
Open

ELM-FATES Restart Error #1234

rgknox opened this issue Aug 12, 2024 · 33 comments

Comments

@rgknox
Copy link
Contributor

rgknox commented Aug 12, 2024

@ckoven detected a problem restarting the model in-between spin-up phases with ELM-FATES. I believe the last phase to complete was an AD spinup, which is usually followed by another pre-industrial run without accerlated decomposition constants. This type of run typically restarts via specifying the finidat file. @ckoven can you provide more details and perhaps share the script that was used to generate the error.
The model fails during two-stream radiation, with what appears to be issues with uninitialized values.

 Warning. small nocomp patch wasnt able to find another patch to fuse with.
270:            0           0  4.157895100850780E-003
 19: forrtl: error (65): floating invalid
 19: Image              PC                Routine            Line        Source             
 19: libpthread-2.31.s  00001479D53EF910  Unknown               Unknown  Unknown
 19: e3sm.exe           000000000422A710  twostreammlpemod_         365  TwoStreamMLPEMod.F90
 19: e3sm.exe           00000000042332F8  twostreammlpemod_         562  TwoStreamMLPEMod.F90
 19: e3sm.exe           000000000421CF76  fatestwostreamuti         353  FatesTwoStreamUtilsMod.F90
 19: e3sm.exe           00000000042066A2  fatesradiationdri         432  FatesRadiationDriveMod.F90
 19: e3sm.exe           0000000000BD93ED  elmfatesinterface        2220  elmfates_interfaceMod.F90
 19: e3sm.exe           00000000009C446F  elm_driver_mp_elm         722  elm_driver.F90
 19: e3sm.exe           000000000096124B  lnd_comp_mct_mp_l         617  lnd_comp_mct.F90
 19: e3sm.exe           00000000004AE943  component_mod_mp_         757  component_mod.F90
 19: e3sm.exe           0000000000469F0A  cime_comp_mod_mp_        2963  cime_comp_mod.F90
 19: e3sm.exe           00000000004971EB  MAIN__                    153  cime_driver.F90
 19: e3sm.exe           0000000000433CFD  Unknown               Unknown  Unknown
 19: libc-2.31.so       00001479D4E3C24D  __libc_start_main     Unknown  Unknown
 19: e3sm.exe           0000000000433C2A  Unknown               Unknown  Unknown
MOSART decomp info proc =       383 begr =    258526 endr =    259200 numr =       675
270:  Warning. small nocomp patch wasnt able to find another patch to fuse with.
270:            0           0  4.157895100850780E-003
345:  GETRDN
345:    27.2995144373317     
345:   0.823811719313825     
345:   0.000000000000000E+000
345:   1.289252691832906E-008
345:   0.000000000000000E+000  2.328840873721851E-014
345:                      NaN  4.143972961881559E-002
345:   3.510709829585243E-014  1.289246852282203E-008
345:    75.0546825251975     
345:    61.7607829908001     
345:    1.00000000000000     
345:    1.00000000000000     
345:  ENDRUN:
345:  ERROR in TwoStreamMLPEMod.F90 at line 388     

When combined with the write statement code:

 write(log_unit,*)"GETRDN"
scelg%Kb    = 27.2995144373317
scelb%a  = 0.823811719313825
vai  = 0.000000000000000E+000
scelb%Ad  = 1.289252691832906E-008
scelb%B1 =  0.000000000000000E+000  scelb%B2 = 2.328840873721851E-014
scelb%lambda1_beam =   NaN scelb%lambda2_beam = 4.143972961881559E-002
scelb%lambda1_diff = 3.510709829585243E-014  scelb%lambda2_diff = 1.289246852282203E-008
this%band(ib)%Rbeam_atm =  75.0546825251975
this%band(ib)%Rdiff_atm = 61.7607829908001
exp(-scelg%Kb*vai) = 1
exp(scelb%a*vai) = 1
@rgknox
Copy link
Contributor Author

rgknox commented Aug 12, 2024

What is standing out to me @ckoven is there is no vegetated area in this element. If this is true, there should be no element, unless it is an "air" element, but that seems unlikely. The Kb value is not 0.5, which is a nominal value given for the air element.

@rgknox rgknox changed the title ELM-FATES Two Stream Restart Error ELM-FATES Restart Error Aug 12, 2024
@ckoven
Copy link
Contributor

ckoven commented Aug 12, 2024

Thanks @rgknox, I have a minimal run script that reproduces the bug with just one year of runtime, I paste it below. Currently I am using this tag, which combines a few different things, I can try something closer to main as well.

#!/usr/bin/env bash

SRCDIR=$HOME/E3SM/components/elm/src/
cd ${SRCDIR}
GITHASH1=`git log -n 1 --format=%h`
cd external_models/fates
GITHASH2=`git log -n 1 --format=%h`

STAGE=STEP_ONE
#STAGE=STEP_TWO

if [ "$STAGE" = "STEP_ONE" ]; then
    SETUP_CASE=fates_e3sm_perlmttr_4x5_test_step1
elif [ "$STAGE" = "STEP_TWO" ]; then
    PRIOR_CASE=fates_e3sm_perlmttr_4x5_test_step1
    SETUP_CASE=fates_e3sm_perlmttr_4x5_test_step2
fi
    
CASE_NAME=${SETUP_CASE}_${GITHASH1}_${GITHASH2}
basedir=$HOME/E3SM/cime/scripts

cd $basedir
export RES=f45_f45
project=m2467

./create_newcase -case ${CASE_NAME} -res ${RES} -compset IELMFATES -mach pm-cpu -project $project

cd $CASE_NAME

ncgen -o fates_params_default_${GITHASH2}.nc ${SRCDIR}/external_models/fates/parameter_files/fates_params_default.cdl

if [ "$STAGE" = "STEP_ONE"  ]; then

    ./xmlchange RUN_STARTDATE=0001-01-01
    ./xmlchange NTASKS=-3
    ./xmlchange STOP_N=1
    ./xmlchange REST_N=1
    ./xmlchange STOP_OPTION=nyears
    ./xmlchange JOB_QUEUE=debug
    ./xmlchange JOB_WALLCLOCK_TIME=00:30:00
    
    cat > user_nl_elm <<EOF
flandusepftdat = '/global/homes/c/cdkoven/scratch/inputdata/fates_landuse_pft_map_4x5_20240206.nc'
use_fates_luh = .true.
use_fates_nocomp = .true.
use_fates_fixed_biogeog = .true.
fates_paramfile = '${basedir}/${CASE_NAME}/fates_params_default_${GITHASH2}.nc'
use_fates_sp = .false.
fates_spitfire_mode = 1
fates_harvest_mode = 'no_harvest'
use_fates_potentialveg = .true.
fluh_timeseries = ''
EOF

elif [ "$STAGE" = "STEP_TWO" ]; then

    ./xmlchange RUN_STARTDATE=0001-01-01
    ./xmlchange RESUBMIT=0
    ./xmlchange NTASKS=-3
    ./xmlchange STOP_N=1
    ./xmlchange REST_N=1
    ./xmlchange STOP_OPTION=nyears
    ./xmlchange JOB_WALLCLOCK_TIME=00:30:00
    
    cat > user_nl_elm <<EOF
flandusepftdat = '/global/homes/c/cdkoven/scratch/inputdata/fates_landuse_pft_map_4x5_20240206.nc'
use_fates_luh = .true.
use_fates_nocomp = .true.
use_fates_fixed_biogeog = .true.
fates_paramfile = '${basedir}/${CASE_NAME}/fates_params_default_${GITHASH2}.nc'
use_fates_sp = .false.
fates_spitfire_mode = 1
fates_harvest_mode = 'no_harvest'
use_fates_potentialveg = .true.
fluh_timeseries = ''
finidat = '/global/homes/c/cdkoven/scratch/e3sm_scratch/pm-cpu/${PRIOR_CASE}_${GITHASH1}_${GITHASH2}/run/${PRIOR_CASE}_${GITHASH1}_${GITHASH2}.elm.r.0002-01-01-00000.nc'
EOF

fi


./case.setup
./case.build
./case.submit

@glemieux
Copy link
Contributor

@ckoven I ran the above script on Perlmutter using e3sm commit bde8cf51ab and fates tag sci.1.77.2_api.36.0.0. I was able to succesfully run both stages without issue. For good measure I also adapted the script for ctsm and tested it on derecho without issue as well.

Per our discussion this morning, would you point me to Rosie's parameter file that you were using so that I can test that next?

@ckoven
Copy link
Contributor

ckoven commented Aug 14, 2024

@glemieux the updated parameter file is in commit 391d6e9

@glemieux
Copy link
Contributor

glemieux commented Aug 14, 2024

@ckoven I'm getting passing cases for both stages using the updated parameter file with the same e3sm commit and fates tag as noted above.

I noticed that the fluh_timeseries is being set to '' in stage 2. Is that correct for this test?

perlmutter scratch locations:

/global/homes/g/glemieux/scratch/e3sm_scratch/pm-cpu/replicate_RFparam_step1_bde8cf51ab_240366de
/global/homes/g/glemieux/scratch/e3sm_scratch/pm-cpu/replicate_RFparam_step2_bde8cf51ab_240366de

@ckoven
Copy link
Contributor

ckoven commented Aug 21, 2024

Just some updates on this. It is a bit more maddening of a bug than I had originally perceived. I've now been able to replicate it a few different ways, and it is possibly related to #1237.

The closest-to-main configuration I've gotten is via tag 731bf6e, which adds three canopy layers and Rosie's parameter file on top of a recent-ish main. If I run the above script, then the step one part works fine. Step two fails in one of two ways, depending on whether or not DEBUG=TRUE in the xml files. If in debug mode, it crashes on reading the restart file with the following:

270:  Warning. small nocomp patch wasnt able to find another patch to fuse with.
270:            0           0  4.157895100850780E-003
 30: forrtl: error (65): floating invalid
 30: Image              PC                Routine            Line        Source             
 30: libpthread-2.31.s  0000154235BEF910  Unknown               Unknown  Unknown
 30: e3sm.exe           000000000421D239  twostreammlpemod_         369  TwoStreamMLPEMod.F90
 30: e3sm.exe           0000000004225D4C  twostreammlpemod_         562  TwoStreamMLPEMod.F90
 30: e3sm.exe           000000000420F9C8  fatestwostreamuti         338  FatesTwoStreamUtilsMod.F90
 30: e3sm.exe           00000000041FA476  fatesradiationdri         432  FatesRadiationDriveMod.F90
 30: e3sm.exe           0000000000BD93ED  elmfatesinterface        2220  elmfates_interfaceMod.F90
 30: e3sm.exe           00000000009C446F  elm_driver_mp_elm         722  elm_driver.F90
 30: e3sm.exe           000000000096124B  lnd_comp_mct_mp_l         617  lnd_comp_mct.F90
 30: e3sm.exe           00000000004AE943  component_mod_mp_         757  component_mod.F90
 30: e3sm.exe           0000000000469F0A  cime_comp_mod_mp_        2963  cime_comp_mod.F90
 30: e3sm.exe           00000000004971EB  MAIN__                    153  cime_driver.F90
 30: e3sm.exe           0000000000433CFD  Unknown               Unknown  Unknown
 30: libc-2.31.so       000015423563C24D  __libc_start_main     Unknown  Unknown
 30: e3sm.exe           0000000000433C2A  Unknown               Unknown  Unknown

If debug is not true, then it makes it several months into the run, before crashing with a longer error message, the first few lines of which are below. This is the same failure mode that goes away in the longer continuous runs if I apply the promotion reordering fix in #1237. But the really weird part is that if I apply the promotion reordering fix in #1237 to this script, then the model crashes in the way that I originally encountered in this bug.

141:  Total canopy flux balance not closing in TwoStrteamMLPEMod:Solve
141:  Relative Error, delta/(Rbeam_atm+Rdiff_atm) :  0.895974502301465     
141:  Max Error:   1.000000000000000E-006
141:  ib:            1
141:  scattering coeff:   0.178200000000000     
141:  Breakdown:           2
141:                 1           1
141:        0.164447326258415               12  0.467184519382494     
141:        0.391912784682456       0.929927898941276       0.966422554120326     

So TLDR, this does not seem to be related to the grazing code. I will try to go back closer to main on this, first to see if the crash in debug mode happens with just two potential canopy layers but the modified parameter file, and also to see if it happens on main with default parameters.

@ckoven
Copy link
Contributor

ckoven commented Aug 21, 2024

OK, looked more into the minimum set of things needed to trigger this. I get the same failure mode (crash during initialization from a prior otherwise-identical one-year-long run's restart file when debug set to true) if nclmax=2 and using the modified parameter file. It doesn't happen on main with default parameter file. But if I set to use two stream in the parameter file, then it happens. So somehow that seems to be either causing the crash or exposing an error generated somewhere else.

@ckoven
Copy link
Contributor

ckoven commented Aug 21, 2024

One thing that doesn't make sense to me is how, when debug=true in the xml file, it is seeing a floating invalid on line 369, but when debug=false in the xml file, it doesn't get trapped by the NaN check on line 375. @rgknox do you have any idea how that might happen?

@rgknox
Copy link
Contributor Author

rgknox commented Aug 21, 2024

That nan-check is inside a local debug, which is only true when the debug logical in the file is hard-coded to true:

https://github.com/NGEET/fates/blob/main/radiation/TwoStreamMLPEMod.F90#L36

Huh.. but yeah, its set to true. I suppose that routine is not doing what it is suppose to be doing.

I wonder if the simple equivalency test that we commented out will catch it:

https://github.com/NGEET/fates/blob/main/radiation/TwoStreamMLPEMod.F90#L374

@ckoven
Copy link
Contributor

ckoven commented Aug 21, 2024

I wonder if the simple equivalency test that we commented out will catch it:

Thanks I can try that

@ckoven
Copy link
Contributor

ckoven commented Aug 21, 2024

OK that worked to catch the error without setting debug in the xml file:

206:  GETRDN
206:    27.5512011283433     
206:   0.826620635513254     
206:   0.000000000000000E+000
206:   0.000000000000000E+000
206:   0.000000000000000E+000  0.000000000000000E+000
206:   0.000000000000000E+000  0.000000000000000E+000
206:                      NaN  0.000000000000000E+000
206:    44.7335246941840     
206:    51.3457894877309     
206:    1.00000000000000     
206:    1.00000000000000     
206:  ENDRUN:
206:  ERROR in TwoStreamMLPEMod.F90 at line 388                                      
206:                                                                                 
206:                                                                                 
206:                                                                                 
206:                                                                                 
206:                                                                                 
206:                                        
206:  ERROR: Unknown error submitted to shr_abort_abort.
206: Image              PC                Routine            Line        Source             
206: e3sm.exe           0000000001455CBD  shr_abort_mod_mp_         114  shr_abort_mod.F90
206: e3sm.exe           0000000000FD65B3  twostreammlpemod_         258  TwoStreamMLPEMod.F90
206: e3sm.exe           0000000000FD703F  twostreammlpemod_         562  TwoStreamMLPEMod.F90
206: e3sm.exe           0000000000FD3AA0  fatestwostreamuti         338  FatesTwoStreamUtilsMod.F90
206: e3sm.exe           0000000000FD1064  fatesradiationdri         432  FatesRadiationDriveMod.F90
206: e3sm.exe           00000000005F4950  elmfatesinterface        2220  elmfates_interfaceMod.F90
206: e3sm.exe           00000000005725D1  elm_driver_mp_elm         722  elm_driver.F90
206: e3sm.exe           0000000000558570  lnd_comp_mct_mp_l         617  lnd_comp_mct.F90
206: e3sm.exe           000000000045D42E  component_mod_mp_         757  component_mod.F90
206: e3sm.exe           0000000000436D14  cime_comp_mod_mp_        2963  cime_comp_mod.F90
206: e3sm.exe           000000000045D0C2  MAIN__                    153  cime_driver.F90
206: e3sm.exe           000000000043447D  Unknown               Unknown  Unknown
206: libc-2.31.so       000014D219E3C24D  __libc_start_main     Unknown  Unknown
206: e3sm.exe           00000000004343AA  Unknown               Unknown  Unknown
206: MPICH ERROR [Rank 206] [job id 29651724.0] [Wed Aug 21 15:40:26 2024] [nid006876] - Abort(1001) (rank 206 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 206
206: 
206: aborting job:
206: application called MPI_Abort(MPI_COMM_WORLD, 1001) - process 206

@ckoven
Copy link
Contributor

ckoven commented Aug 21, 2024

Another thing that I see and am not sure how to interpret. I thought that this was happening during the first timestep but it isn't, it's a bit afterwards, as seen in the lnd.log:

dtime_sync=         1800  dtime_elm=         1800  mod =            0
 Beginning timestep   : 0001-01-01_00:00:00
 --WARNING-- skipping CN balance check for first timestep
 --WARNING-- skipping CN balance check for first timestep
    Completed timestep: 0001-01-01_00:00:00
 Beginning timestep   : 0001-01-01_00:30:00
 --WARNING-- skipping CN balance check for first timestep
 FATES dynamics start
 FATES dynamics complete
 --WARNING-- skipping CN balance check for first timestep
    Completed timestep: 0001-01-01_00:30:00
 Beginning timestep   : 0001-01-01_01:00:00
    Completed timestep: 0001-01-01_01:00:00
<CRASH>

@ckoven
Copy link
Contributor

ckoven commented Aug 21, 2024

to add further confusion, I tried adding a NaN check for scelb%lambda1_diff, the variable showing up as a NaN above, in the one place it appears on the LHS via ckoven@e4612c0 but it didn't catch anything. :(

@rgknox
Copy link
Contributor Author

rgknox commented Aug 21, 2024

That scattering element has nonsense values in it anyway, I'm not surprised its generating nonsense results for a scattering solution. For instance, it has a beam optical depth of exactly zero, which should be impossible if there is material to scatter.

Here is where the calculation is made:
https://github.com/NGEET/fates/blob/main/radiation/TwoStreamMLPEMod.F90#L984

My best guess is that we have a situation where there are cohorts with no stem or leaf area. No leaf area is normal, ie deciduous trees, but no stem area should be impossible... Unless this uses the new grass allometry that has no SAI or LAI?

I know that line 984 above, does divide by the leaf and stem area, so it should be an inf or a nan, but its possible the compiler is protecting the division because its a 0 divided by a 0.

@ckoven
Copy link
Contributor

ckoven commented Aug 21, 2024

Thanks @rgknox. So if I work upstream setting traps to try to identify scattering elements where lai and sai are both zero, that might help identify what is going on?

@rgknox
Copy link
Contributor Author

rgknox commented Aug 22, 2024

I would put a trap here right after the lai and sai of the element is calculated, if it is less than nearzero, I would fail and print out the cohort involved.

Like here:
https://github.com/NGEET/fates/blob/main/radiation/FatesTwoStreamUtilsMod.F90#L213

Are you using Xiulin's new grass allometry? That could create a cohort with no leaf or stem area I think. If that is the case, we just need to put in a provision when we are creating the elements, to make that element an "air" element.

@ckoven
Copy link
Contributor

ckoven commented Aug 22, 2024

Thanks again @rgknox. OK I added some more diagnostics to the error message that it was crashing at before, because I noticed that if VAI + LAI was near zero, then logic just above the line you point to should have already made the pft of the scattering element the air_ft. So anyway, seems even weirder.

206:  GETRDN
206:    27.5512011283433     
206:   0.826620635513254     
206:   0.000000000000000E+000
206:   0.000000000000000E+000
206:   0.000000000000000E+000  0.000000000000000E+000
206:   0.000000000000000E+000  0.000000000000000E+000
206:                      NaN  0.000000000000000E+000
206:    44.7335246941840     
206:    51.3457894877309     
206:    1.00000000000000     
206:    1.00000000000000     
206:  ican           1
206:  icol          10
206:  ib           1
206:  scelg%pft           5
206:  scelg%lai   1.99931012632750     
206:  scelg%sai  0.184394986794637     
206:  ENDRUN:
206:  ERROR in TwoStreamMLPEMod.F90 at line 394                                      
206:                                                                                 
206:                                                                                 
206:                                                                                 
206:                                                                                 
206:                                                                                 
206:                                        
206:  ERROR: Unknown error submitted to shr_abort_abort.
206: Image              PC                Routine            Line        Source             
206: e3sm.exe           000000000145613D  shr_abort_mod_mp_         114  shr_abort_mod.F90
206: e3sm.exe           0000000000FD6802  twostreammlpemod_         258  TwoStreamMLPEMod.F90
206: e3sm.exe           0000000000FD728F  twostreammlpemod_         568  TwoStreamMLPEMod.F90
206: e3sm.exe           0000000000FD3AA0  fatestwostreamuti         338  FatesTwoStreamUtilsMod.F90
206: e3sm.exe           0000000000FD1064  fatesradiationdri         432  FatesRadiationDriveMod.F90
206: e3sm.exe           00000000005F4950  elmfatesinterface        2220  elmfates_interfaceMod.F90
206: e3sm.exe           00000000005725D1  elm_driver_mp_elm         722  elm_driver.F90
206: e3sm.exe           0000000000558570  lnd_comp_mct_mp_l         617  lnd_comp_mct.F90
206: e3sm.exe           000000000045D42E  component_mod_mp_         757  component_mod.F90
206: e3sm.exe           0000000000436D14  cime_comp_mod_mp_        2963  cime_comp_mod.F90
206: e3sm.exe           000000000045D0C2  MAIN__                    153  cime_driver.F90
206: e3sm.exe           000000000043447D  Unknown               Unknown  Unknown
206: libc-2.31.so       000014E2CB83C24D  __libc_start_main     Unknown  Unknown
206: e3sm.exe           00000000004343AA  Unknown               Unknown  Unknown

@rgknox
Copy link
Contributor Author

rgknox commented Aug 22, 2024

The vai that it is reporting on the third line, is the integrated vegetation depth (top down coordinate) at which it is being asked to report downwelling radiation in this element. So, its reporting downwelling radiation at the top of the element. The element itself, does have non-zero lai and sai, so that is good, and may help clear up some confusion. Ill look at this more and see if I can figure anything out.

@rgknox
Copy link
Contributor Author

rgknox commented Aug 22, 2024

Maybe the sun is at a very low inclination angle, which is why the Kb term is so high. It could be that this is making life difficult for the solver?

For instance we have exponentials that have Kb * LAI. This could create math operations of e^(+-100) or greater. Maybe we just need cap Kb... An exponential function of even e^(-300) should not generate a number larger than can be held in a 16 byte real, but it could still be making life tough on the matrix inversion.

https://github.com/NGEET/fates/blob/main/radiation/TwoStreamMLPEMod.F90#L506

Try reducing kb_max to something like 10?
https://github.com/NGEET/fates/blob/main/radiation/TwoStreamMLPEMod.F90#L74

But I think it would be good to also get more info on why some of these other terms are zero... For instance, lets look at everything that builds the "Ad" term...

https://github.com/NGEET/fates/blob/main/radiation/TwoStreamMLPEMod.F90#L1245C11-L1266C82

Can you add print statements to your current fail statement that includes more of the parameters in the line above:

write(log_unit,*) "om = " ,scelb%om
write(log_unit,*) "betab = ",scelb%betab
write(log_unit,*) "Kd = ",scelg%Kd
write(log_unit,*) "betad = ",scelb%betad
write(log_unit,*)  "Rbeam0 = ", scelb%Rbeam0
write(log_unit,*)  "b2 = ", -(scelg%Kd*(1._r8-scelb%om)*(1._r8-2._r8*scelb%betab)+scelg%Kb) * &
               scelb%om*scelg%Kb*scelb%Rbeam0
write(log_unit,*)  "b1 = ",  -(scelg%Kd*(1._r8-scelb%om+2._r8*scelb%om*scelb%betad) + &
               (1._r8-2._r8*scelb%betab)*scelg%Kb) * &
               scelb%om*scelg%Kb*scelb%Rbeam0
write(log_unit,*) "ncols = ",this%n_col(:)

@ckoven
Copy link
Contributor

ckoven commented Aug 22, 2024

Thanks @rgknox! the tag that has all these diagnostics is dc16e85.

@rgknox
Copy link
Contributor Author

rgknox commented Aug 22, 2024

In the spirit of this very special issue number, I'd like to remind everyone of this scene from spaceballs:

https://www.youtube.com/watch?v=a6iW-8xPw3k

This is also a good reminder about password quality while we are at it.

@rgknox
Copy link
Contributor Author

rgknox commented Aug 28, 2024

So I dumped all the cohorts when the error is generated (in the GetRdDn() function), and found two things that I thought were quite strange...

  1. This is the fifth out of 5 cohorts. According to the two-stream, there is only 1 canopy layer, and the fractional area of those cohorts sums to exactly 1. This should be super incredibly massively unlikely. The area of an upper canopy layer should sum to exactly 1, but there would be a second or more layers below it. Or, if there is really only 1 layer, it is unlikely that the cohorts "exactly" fill it up, and one element should be an air element... But why? @ckoven is this possibly a new feature of the potential veg? Could there be exactly 1 canopy layer, perfectly full? If not, there must be some issue where we are incorrectly dropping the cohorts in the next layer from the scattering element construct. I'll run a test where I dump out the patch info as well.

  2. The Rbeam0 value is 1.0 in the first 4 cohorts, but 0 in the fifth and problematic cohort. The scattering coefficients are almost the same for all five of them. This is the unit downwelling beam radiation at the top of the cohort. For a single canopy layer, this is very trivial, it should be 1.0. This value is calculated early on in %Solve(), the core two-stream routine. So this tells me that for some reason, the 5th cohort simply did not run %Solve() prior to this call. If the scattering coefficients look normalish, I suppose everything but the solve was called...

          0           0  4.157895100850780E-003
286:  2S Error detected, rd_dn_top, aborting
286:  2S Error detected, rd_dn_top, aborting
286:  Fail in FatesRadationDriveMod:FatesSunShadeFracs
286:  cl,icol:           1           5
286:  Dumping Two-stream elements for band            1
286:  
286:  rbeam atm:    29.5579614957216     
286:  rdiff_atm:    42.9459994856711     
286:  alb grnd diff:   9.000000000000000E-002
286:  alb grnd beam:   9.000000000000000E-002
286:  cosz:   1.000000000000000E-003
286:  snow fraction:   0.000000000000000E+000
286:  lat:    6.00000000000000     
286:  lon:    100.000000000000     
286:  --           1           1 --
286:  pft:           1
286:  area:   3.562978277523072E-002
286:  lai,sai:   0.718175877697489       7.639246369664410E-002
286:  Kb:    27.2118427933894     
286:  Kb leaf:    30.0000000000000     
286:  Kd:   0.894163653939458     
286:  Rb0:    1.00000000000000     
286:  om:   0.173941877430036     
286:  betad:   0.581748361676105     
286:  betab:  0.514701010412914     
286:  a:   0.824160069016057     
286:  Unit RDiff Down @ bottom:   0.596150858145936     
286:  Unit RDiff Up @ bottom:   5.727878568803392E-002
286:  Unit Rbeam @ bottom:   4.072194874583690E-010
286:  --           1           2 --
286:  pft:           1
286:  area:   0.442848625209875     
286:  lai,sai:   0.706447156690642       7.538929105828590E-002
286:  Kb:    27.2036486058113     
286:  Kb leaf:    30.0000000000000     
286:  Kd:   0.894196739857111     
286:  Rb0:    1.00000000000000     
286:  om:   0.173953462315922     
286:  betad:   0.581799168375515     
286:  betab:  0.514706710772222     
286:  a:   0.824192626928066     
286:  Unit RDiff Down @ bottom:   0.602430852217088     
286:  Unit RDiff Up @ bottom:   5.794090738181571E-002
286:  Unit Rbeam @ bottom:   5.795298226319605E-010
286:  --           1           3 --
286:  pft:           1
286:  area:   0.498234570044878     
286:  lai,sai:   0.667889860777080       6.880463429500017E-002
286:  Kb:    27.2915035908341     
286:  Kb leaf:    30.0000000000000     
286:  Kd:   0.893842005141536     
286:  Rb0:    1.00000000000000     
286:  om:   0.173829253543993     
286:  betad:   0.581254085198995     
286:  betab:  0.514645772066091     
286:  a:   0.823843549515680     
286:  Unit RDiff Down @ bottom:   0.624984864491763     
286:  Unit RDiff Up @ bottom:   5.852228564702393E-002
286:  Unit Rbeam @ bottom:   1.854778680687521E-009
286:  --           1           4 --
286:  pft:           1
286:  area:   1.160590394903519E-002
286:  lai,sai:   0.706047587206787       7.060497184133467E-002
286:  Kb:    27.3636291292104     
286:  Kb leaf:    30.0000000000000     
286:  Kd:   0.893550781685804     
286:  Rb0:    1.00000000000000     
286:  om:   0.173727282955254     
286:  betad:   0.580806010432847     
286:  betab:  0.514596036238539     
286:  a:   0.823556964276842     
286:  Unit RDiff Down @ bottom:   0.605037438714731     
286:  Unit RDiff Up @ bottom:   5.796553503290591E-002
286:  Unit Rbeam @ bottom:   5.893313139001513E-010
286:  --           1           5 --
286:  pft:           1
286:  area:   1.168111802098158E-002
286:  lai,sai:   0.705831843410712       7.058318434107121E-002
286:  Kb:    27.3636363636364     
286:  Kb leaf:    30.0000000000000     
286:  Kd:   0.893550752475144     
286:  Rb0:   0.000000000000000E+000
286:  om:   0.173727272727273     
286:  betad:   0.580805965463108     
286:  betab:  0.514596031263025     
286:  a:   0.823556935531130     
286:  Unit RDiff Down @ bottom:  -1.213382324716519E+042
286:  Unit RDiff Up @ bottom:   3.336319600306448E-113
286:  Unit Rbeam @ bottom:   0.000000000000000E+000
286:  ENDRUN:
286:  ERROR in FatesRadiationDriveMod.F90 at line 467     

@ckoven
Copy link
Contributor

ckoven commented Aug 28, 2024

@rgknox that is very weird that there is only one exactly full canopy layer. I don't see how that could happen unless all understory cohorts are getting killed somehow. I don't know enough about the call sequence to understand how point 2 could have happened.

@ckoven
Copy link
Contributor

ckoven commented Aug 29, 2024

There is some discussion right now on the CLM call about the distinction between hybrid runs (equivalent to setting finidat to a prior run's restart file), and branch runs (equivalent to continuing a restart file). Given that this crash happens only in the hybrid type case, it might be helpful to know exactly what happens differently during the hybrid case restart read versus a branch case restart read.

edit: noting that this is controlled by the nsrest variable

@ckoven
Copy link
Contributor

ckoven commented Aug 29, 2024

Just as a sanity check, I ran a slightly modified script from the above in both a RUN_TYPE=hybrid and a RUN_TYPE=branch case, and the former case crashed while the latter case did not. I hadn't actually run a branch test, only a restart within a case, so wanted to confirm that doing so would in fact pass, which it did. So clearly there is some logic specific to the type of initialization from restart files that is generating this.

@rgknox
Copy link
Contributor Author

rgknox commented Sep 4, 2024

@ckoven , does this congifuration modify, add or remove any patches or cohorts during the restart procedure, aside from reading in the patch and cohort information from the restart file?

If so, we need to be careful to call update_3dpatch_radiation() after everything is settled:

https://github.com/E3SM-Project/E3SM/blob/master/components/elm/src/main/elmfates_interfaceMod.F90#L1926-L1928

That call to update_3dpatch_radiation() is where the scattering element info is prepared, and the solver is run for the first time. If we perform this action and then modify the patch structure in anyway before exiting the initialization, it will be problematic.

@ckoven
Copy link
Contributor

ckoven commented Sep 4, 2024

@rgknox I don't believe there is anything in the restart sequence that is different in this configuration than in the standard full-FATES configuration.

@rgknox
Copy link
Contributor Author

rgknox commented Sep 4, 2024

It looks like in a hybrid run, the following initialization sequence, which is maybe meant for a cold-start is run:

https://github.com/NGEET/fates/blob/main/main/EDInitMod.F90#L469

Is it possible that something was run twice, or inadvertantly, and that is creating problems during the hybrid run?

@ckoven
Copy link
Contributor

ckoven commented Sep 4, 2024

Thanks @rgknox does this line here evaluate as .true. in a hybrid run? I don't think we should be entering the code block under that case.

@rgknox
Copy link
Contributor Author

rgknox commented Sep 4, 2024

I believe it would indeed evaluate as true, based on this:

https://github.com/E3SM-Project/E3SM/blob/master/components/elm/src/main/elmfates_interfaceMod.F90#L482-L487

I'd like to run a test to verify though...

@ckoven
Copy link
Contributor

ckoven commented Sep 6, 2024

@rgknox that section of code in set_site_properties isn't being triggered before the crash, I guess because the call sequence to it is within this block, which excludes hybrid and finidat='filename' cases: https://github.com/E3SM-Project/E3SM/blob/master/components/elm/src/main/elm_initializeMod.F90#L1023-L1037

actually EDIT: I think it only excludes finidat, but not hybrid?

@rgknox
Copy link
Contributor Author

rgknox commented Sep 7, 2024

The finidat will not be ' ', so this block of code will not be triggered for either branch or hybrid. I'm just guessing, but my expectation was that we would not want cold-start type initializations to happen in this situation, that initializations would happen during the restart() call in elmfates_interfacemod.F90.

On a side-note, I don't like that string compare on the finidat == ' '. It seems awfully vulnerable that an unspecified finidat has to be empty with exactly 1 space? I feel like we should to a trim and a string length count of one or less.

@ckoven
Copy link
Contributor

ckoven commented Sep 30, 2024

Just wanted to note here that @rgknox's fixes on the two branches below appear to fix the issue.

https://github.com/rgknox/fates/tree/twostream-restart-bugfixes
https://github.com/rgknox/E3SM/tree/twostr-rest-bugfix

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: ❕Todo
Development

No branches or pull requests

3 participants