Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FATES landuse default fluh_timeseries crashes FatesColdLUH test on izumi #1224

Closed
glemieux opened this issue Jul 16, 2024 · 2 comments
Closed

Comments

@glemieux
Copy link
Contributor

glemieux commented Jul 16, 2024

This was discovered in the process of testing #1223. The FatesColdLUH2 test in the fates suite fails RUN very early into the process. From the lnd.log file it looks like the finundated read upper bound step isn't reporting the correct file that it's reading from, but I think that might be a red herring. Note that this doesn't appear to be an issue on derecho or perlmutter.

I can confirm that switching the fluh_timeseries to an older file that has a shorter time length does not present this issue. That said, the size of the file does not appear to be an issue after attempting to run the test case with a copy of the same file, but truncated to a shorter time. I will also note that the older file is formatted with the classic netcdf type, where as the newer file is cdf5. That said, I'm not sure how relevant that is as the flandusepftdat file that is used in this test does not present an issue when used in conjunction with the older fluh_timeseries file.

It is possible that the newer file, which was generated via the fates land use tool, could be introducing an issue based on an update since the initial tool development when the original default was created (the original file was created when the tool was located as part of the fates repository). Issue NGEET/tools-fates-landusedata#5 to investigate potential causes on that side.

The log file results are below:

lnd.log

successfully initialized sdat
(shr_strdata_readstrm) opening   : /fs/cgd/csm/inputdata/lnd/clm2/paramdata/finundated_inversiondata_0.9x1.25_c170706.nc
(shr_strdata_readstrm) setting pio descriptor : /fs/cgd/csm/inputdata/lnd/clm2/paramdata/finundated_inversiondata_0.9x1.25_c170706.nc
(shr_strdata_set_stream_iodesc) setting iodesc for : FWS_TWS_A with dimlens(1), dimlens(2) =      288       192   variable as time dimension time
(shr_strdata_readstrm) reading file lb: /fs/cgd/csm/inputdata/lnd/clm2/paramdata/finundated_inversiondata_0.9x1.25_c170706.nc       1
(shr_strdata_readstrm) reading file ub: /fs/cgd/csm/inputdata/lnd/clm2/

cesm.log

Obtained 10 stack frames.
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0x336f214]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0x336f748]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0x336fcc8]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0x33727f9]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe(PIOc_openfile+0x11) [0x336e611]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0x33230e9]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0xa2a117]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0xaa737f]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0xb3ef07]
/scratch/cluster/glemieux/ctsm-tests/tests_0716-152356iz/ERS_D_Ld3.f45_f45_mg37.I2000Clm50FatesCruRsGs.izumi_nag.clm-FatesColdLUH2.0716-152356iz/bld/cesm.exe() [0xa8a293]
[cli_1]: aborting job:
application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1
[[email protected]] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002
[[email protected]] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
@ekluzek
Copy link
Collaborator

ekluzek commented Jul 16, 2024

@glemieux that last bit in the error about the launcher is something that tells me to resubmit. And usually it resolves itself. I think I've maybe only had to resubmit another time for it to resolve.

But, if you are getting this consistently on every submission (but try a good four times or so) -- this must mean something real. The first thing that springs to mind is to try the intel and gnu compilers. And you might also try with fewer processors.

Hmmm....

@glemieux
Copy link
Contributor Author

Closing to move this to CTSM as it is either an issue there or an issue with the fates land use data tool.
See ESCOMP/CTSM#2653 and NGEET/tools-fates-landusedata#5, respectively.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants