Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MPI timeout (?) sometimes in Izumi nag tests #2800

Open
samsrabin opened this issue Sep 30, 2024 · 4 comments
Open

MPI timeout (?) sometimes in Izumi nag tests #2800

samsrabin opened this issue Sep 30, 2024 · 4 comments
Labels
bug something is working incorrectly

Comments

@samsrabin
Copy link
Collaborator

samsrabin commented Sep 30, 2024

Brief summary of bug

Some Izumi nag tests sometimes fail in the run phase with Warning: Floating underflow occurred in cesm.log. Re-submitting usually fixes it.

General bug information

CTSM version you are using: ctsm5.3.002 (but this has happened to me before, not just with this tag)

Does this bug cause significantly incorrect results in the model's science? No?

Configurations affected: Izumi nag

Details of bug

Affected tests (today, that is):

ERI_D_Ld9_P48x1.f10_f10_mg37.I2000Clm50Sp.izumi_nag.clm-reduceOutput
ERP_D_Ld9.f10_f10_mg37.I1850Clm60BgcCrop.izumi_nag.clm-clm60cam7LndTuningModeLDust
ERS_D_Ld15.f45_f45_mg37.I2000Clm50FatesRs.izumi_nag.clm-FatesColdTwoStream
SMS_D_Ld5.f45_f45_mg37.I2000Clm60Fates.izumi_nag.clm-FatesCold

Unfortunately there's no useful traceback, so I'm not sure what's going on. However, it always happens after the CTSM: end of main integration loop message is printed in lnd.log.

@samsrabin samsrabin added bug something is working incorrectly next this should get some attention in the next week or two. Normally each Thursday SE meeting. labels Sep 30, 2024
@ekluzek
Copy link
Collaborator

ekluzek commented Oct 2, 2024

@samsrabin floating underflow is something that we should expect in CTSM as a natural occurrence. Something just got so small it was truncated to exactly zero. In some codes that could be a problem but not something we manage. So the underflow isn't the real issue here.

I think you are talking about the cases that close with MPI timeout launcher errors as discussed for example in #1317. right? My suspicion is an MPI race condition that only happens randomly. There are also cases where the MPI timeout launcher error is a legit issue in the code.

I wanted to make sure we are talking about the same thing as if so, I think we should change the title. We can also talk about this tomorrow as it has next on it.

@samsrabin
Copy link
Collaborator Author

Yeah, that's what I'm talking about, although the messaging looks different. What I could do is just add the new messaging in a comment to that issue and close this one, so future searches will find it.

@ekluzek
Copy link
Collaborator

ekluzek commented Oct 2, 2024

Sounds good. The tail of cesm.log for a case I just resubmitted looks like this:

[1] 208 at [0x000000000d143160], src/mpid/ch3/src/mpid_vc.c[110]
[1] 96 at [0x000000000d153480], src/util/procmap/local_proc.c[93]
[1] 96 at [0x000000000d143f10], src/util/procmap/local_proc.c[92]
[1] 208 at [0x000000000d153310], src/mpid/ch3/src/mpid_vc.c[110]
[1] 96 at [0x00000000084db040], src/util/procmap/local_proc.c[93]
[1] 96 at [0x00000000084daf40], src/util/procmap/local_proc.c[92]
[1] 504 at [0x0000000008212700], src/mpi/comm/commutil.c[328]
[1] 504 at [0x0000000008212460], src/mpi/comm/commutil.c[328]
[1] 504 at [0x0000000008211f20], src/mpi/comm/commutil.c[328]
[1] 208 at [0x000000000820f6b0], src/mpid/ch3/src/mpid_vc.c[110]
Warning: Floating underflow occurred
[[email protected]] HYDT_bscd_pbs_wait_for_completion (tools/bootstrap/external/pbs_wait.c:67): tm_poll(obit_event) failed with TM error 17002
[[email protected]] HYDT_bsci_wait_for_completion (tools/bootstrap/src/bsci_wait.c:23): launcher returned error waiting for completion
[[email protected]] HYD_pmci_wait_for_completion (pm/pmiserv/pmiserv_pmci.c:218): launcher returned error waiting for completion
[[email protected]] main (ui/mpich/mpiexec.c:340): process manager error waiting for completion

If I search for issues or PR's with ""launcher returned error waiting for completion" I find a few that cover it.

@samsrabin
Copy link
Collaborator Author

Actually, I'm just going to leave this one open. The other issue looked like it was a consistent thing, whereas now it's random. I'll change the title.

@samsrabin samsrabin changed the title Floating underflow in Izumi nag tests MPI timeout (?) sometimes in Izumi nag tests Oct 2, 2024
@samsrabin samsrabin removed the next this should get some attention in the next week or two. Normally each Thursday SE meeting. label Oct 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug something is working incorrectly
Projects
None yet
Development

No branches or pull requests

2 participants