-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Running MONC with DEPHY forcings on ARCHER2 #51
Comments
@sjboeing wrote 11/5/2021:
|
@MarkUoLeeds wrote 11/5/2021:
|
@sjboeing wrote 11/5/2021:
|
Having just spoken to @sjboeing about this there are currently some issues with MPI-related crashes with "PTHREADS" @sjboeing here's what I wrote to Nick Brown back in 2017 on getting more useful errors back from MPI in fortran:
The change I suggested was never made, but I still think this would be very useful. What do you think? @cemac-ccs I am wondering what you think about adding this (i.e. making changing the error handling to |
Update: *** Error in `/lus/cls01095/work/n02/n02/sboeing/monc/./build/bin/monc_driver.exe': malloc(): memory corruption: 0x0000000001be1b30 ***
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x2b9bdb12659f in ???
#1 0x2b9bdb126520 in ???
#2 0x2b9bdb127b00 in ???
#3 0x2b9bdb169956 in ???
#4 0x2b9bdb170172 in ???
#5 0x2b9bdb173548 in ???
#6 0x2b9bdb174fd6 in ???
#7 0x4d71c9 in ???
#8 0x4cc24b in ???
#9 0x488b14 in ???
#10 0x406480 in ???
#11 0x2b9bdb111349 in ???
#12 0x4064b9 in ???
at ../sysdeps/x86_64/start.S:120
#13 0xffffffffffffffff in ???
mlx5: nid001878: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000010 00000000 00000000 00000000
00000000 00008a12 0a00abe2 a48f5cd3
...
MPICH ERROR [Rank 126] [job id 277588.0] [Fri May 21 05:04:11 2021] [unknown] [nid001513] - Abort(136471695) (rank 126 in comm 0): Fatal error in PMPI_Test: Other MPI error, error stack:
PMPI_Test(205)................: MPI_Test(request=0x888e14, flag=0x7fffb184b42c, status=0x7fffb184b460) failed
MPIR_Test(85).................:
MPIR_Test_impl(39)............:
MPIDI_Progress_test(72).......:
MPIDI_OFI_handle_cq_error(902): OFI poll failed (ofi_events.h:904:MPIDI_OFI_handle_cq_error:Input/output error)
Program received signal SIGABRT: Process abort signal. |
That's frustrating @sjboeing. The clouds are looking good though! Did you run this with edit: actually, I think that is a different issue that Chris' was addressing there: #44 |
Yes, and I think the same is true for openmp vs cray-mpich on archer2.
Thanks
Chris
…___________________________________________________________
Chris Symonds l Software Development Scientist l CEMAC l
School of Earth and Environment l University of Leeds l LS2 9JT l 0113 3438668
________________________________
From: Leif Denby ***@***.***>
Sent: 23 May 2021 17:25
To: Leeds-MONC/monc ***@***.***>
Cc: Christopher Symonds ***@***.***>; Mention ***@***.***>
Subject: Re: [Leeds-MONC/monc] Running MONC with DEPHY forcings on ARCHER2 (#51)
That's frustrating @sjboeing<https://github.com/sjboeing>. The clouds are looking good though!
Did you run this with openmpi or mvapich? I was just wondering because I think @cemac-ccs<https://github.com/cemac-ccs> said mvapich works better on ARC4
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#51 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALBZQEHZH5FVP7YHLMHUTO3TPEUBPANCNFSM45G4HC5Q>.
|
This is using Chris' scripts with minor modifications on ARCHER2 (so cray-mpich). One parameter which may need changing is the thread_pool number in the io configuration, which is currently set to 500. |
I've also had this MPICH error a few times now, did you happen to make any progress with it? |
I'm creating this issue to track progress in getting EUREC4A gases defined through DEHPY forcings running on ARCHER2 (@sjboeing is doing this work primarily, not me)
The text was updated successfully, but these errors were encountered: