Running MONC with DEPHY forcings on ARCHER2 #51

leifdenby · 2021-05-20T12:20:00Z

I'm creating this issue to track progress in getting EUREC4A gases defined through DEHPY forcings running on ARCHER2 (@sjboeing is doing this work primarily, not me)

leifdenby · 2021-05-20T12:23:56Z

@sjboeing wrote 11/5/2021:

Just a quick heads up to let you know that the current version of MONC with CASIM and changes for DEPHY seems to be running successfully on ARCHER2. Thanks all for the hard work: I will do some more testing and then bring this onto the git repository.

In terms of the setup, I am currently using 108 compute cores and 18 IO cores per node on the ARCHER2. This is a ratio of 6:1, and allows us to make use of almost the entire node. Grids that have a multiple of 54 (factors as 33322 for FFTs) grid points in each dimension should be a good match. I have been running a 108108 domain with 200m grid spacing yesterday, and will be looking to upgrade this to 540540 @150 m grid spacing soon. Eventually, we will want to have something more like 2160*2160 @ 100 m (after initial validation, decisions on aerosols/CDNC, integration of SOCRATES, and resolution of a pending issue on surface pressure).

leifdenby · 2021-05-20T12:24:57Z

@MarkUoLeeds wrote 11/5/2021:

I find it really odd that there is no multiple of “NUMA regions” in your calculations. On Archer2 there are 8 numa regions, each with 16 cores; 15 moncs per IO would fit that well. I understand there is a requirement for the grids to match MPI decompositions and 120 monc procs might not work. Perhaps 8 IOs per node is also too low for your resolution? Your factors seem to relate to 9s.
Did you try the recommended (by me) 15 moncs per io?

leifdenby · 2021-05-20T12:25:29Z

@sjboeing wrote 11/5/2021:

I thought I would keep a small number of MONCs per IO at first, but possibly keeping things in the same NUMA region is more important as you say (I find it hard to predict things with the IO server). I will give it a go, in that case we can use domain sizes a multiple of 120 which would be nice.

leifdenby · 2021-05-20T13:10:27Z

Having just spoken to @sjboeing about this there are currently some issues with MPI-related crashes with "PTHREADS" Pthreads error in IO server, error code=-2 (@sjboeing could you add a few more details below?)

@sjboeing here's what I wrote to Nick Brown back in 2017 on getting more useful errors back from MPI in fortran:

As regards to getting better error messages when MONC fails with an MPI-related error I looked into what you suggested (I also found the lecture notes from a course I attended during my MPhil http://people.ds.cam.ac.uk/nmm1/MPI/Notes/notes_06.pdf (archive: http://web.archive.org/web/20181008171031/http://people.ds.cam.ac.uk/nmm1/MPI/Notes/notes_06.pdf) which were really helpful). I can see that because the default behaviour of MPI is to die on any error, and because the cray fortran compiler doesn’t support producing tracebacks (both gfortran and ifort do…), this makes it particularly hard work out exactly where MONC went wrong. Have you considered changing the default error handling from MPI_ERRORS_ARE_FATAL to MPI_ERRORS_RETURN and then using the value of ierr in each MPI call in io/src/mpicommunication.F90? This would already make it much clearer exactly which MPI call caused the issue and each call to subroutines in io/src/mpicommunication.F90 could provide a string indicating what underlying IO-server operation was using MPI at the time (in effect providing a poor-mans stacktrace). I already did the latter for the netcdf error checker in model_core/src/utils/netcdf_misc.F90 which makes it much easier to work out what goes wrong with netcdf files.

The change I suggested was never made, but I still think this would be very useful. What do you think? @cemac-ccs I am wondering what you think about adding this (i.e. making changing the error handling to MPI_ERRORS_RETURN and making checks on the return code from all calls? We could then include the name of the module and subroutine in the error message we display before MONC dying).

sjboeing · 2021-05-20T14:08:19Z

Here is a snapshot from a 108*108 grid point run at 200m, running on a single ARCHER2 node:

Variable is cloud top height

sjboeing · 2021-05-21T08:20:03Z

Update:
With 16 MONCs per IO on 15 nodes, the simulation has crashed at the very start with a pthreads error (""Pthreads error in IO server, error code=-2").
With 6 MONCs per IO, it got about halfway the simulation before crashing with an MPI error, see below.

*** Error in `/lus/cls01095/work/n02/n02/sboeing/monc/./build/bin/monc_driver.exe': malloc(): memory corruption: 0x0000000001be1b30 ***

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x2b9bdb12659f in ???
#1  0x2b9bdb126520 in ???
#2  0x2b9bdb127b00 in ???
#3  0x2b9bdb169956 in ???
#4  0x2b9bdb170172 in ???
#5  0x2b9bdb173548 in ???
#6  0x2b9bdb174fd6 in ???
#7  0x4d71c9 in ???
#8  0x4cc24b in ???
#9  0x488b14 in ???
#10  0x406480 in ???
#11  0x2b9bdb111349 in ???
#12  0x4064b9 in ???
        at ../sysdeps/x86_64/start.S:120
#13  0xffffffffffffffff in ???
mlx5: nid001878: got completion with error:
00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000
00000010 00000000 00000000 00000000
00000000 00008a12 0a00abe2 a48f5cd3
...
MPICH ERROR [Rank 126] [job id 277588.0] [Fri May 21 05:04:11 2021] [unknown] [nid001513] - Abort(136471695) (rank 126 in comm 0): Fatal error in PMPI_Test: Other MPI error, error stack:
PMPI_Test(205)................: MPI_Test(request=0x888e14, flag=0x7fffb184b42c, status=0x7fffb184b460) failed
MPIR_Test(85).................: 
MPIR_Test_impl(39)............: 
MPIDI_Progress_test(72).......: 
MPIDI_OFI_handle_cq_error(902): OFI poll failed (ofi_events.h:904:MPIDI_OFI_handle_cq_error:Input/output error)


Program received signal SIGABRT: Process abort signal.

leifdenby · 2021-05-23T16:25:45Z

That's frustrating @sjboeing. The clouds are looking good though!

Did you run this with openmpi or mvapich? I was just wondering because I think @cemac-ccs said mvapich works better on ARC4.

edit: actually, I think that is a different issue that Chris' was addressing there: #44

cemac-ccs · 2021-05-25T08:47:42Z

Yes, and I think the same is true for openmp vs cray-mpich on archer2. Thanks Chris

…

___________________________________________________________ Chris Symonds l Software Development Scientist l CEMAC l School of Earth and Environment l University of Leeds l LS2 9JT l 0113 3438668

________________________________ From: Leif Denby ***@***.***> Sent: 23 May 2021 17:25 To: Leeds-MONC/monc ***@***.***> Cc: Christopher Symonds ***@***.***>; Mention ***@***.***> Subject: Re: [Leeds-MONC/monc] Running MONC with DEPHY forcings on ARCHER2 (#51) That's frustrating @sjboeing<https://github.com/sjboeing>. The clouds are looking good though! Did you run this with openmpi or mvapich? I was just wondering because I think @cemac-ccs<https://github.com/cemac-ccs> said mvapich works better on ARC4 — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#51 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ALBZQEHZH5FVP7YHLMHUTO3TPEUBPANCNFSM45G4HC5Q>.

sjboeing · 2021-05-25T12:14:16Z

This is using Chris' scripts with minor modifications on ARCHER2 (so cray-mpich). One parameter which may need changing is the thread_pool number in the io configuration, which is currently set to 500.

eers1 · 2021-10-07T15:00:28Z

I've also had this MPICH error a few times now, did you happen to make any progress with it?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running MONC with DEPHY forcings on ARCHER2 #51

Running MONC with DEPHY forcings on ARCHER2 #51

leifdenby commented May 20, 2021 •

edited

Loading

leifdenby commented May 20, 2021

leifdenby commented May 20, 2021

leifdenby commented May 20, 2021

leifdenby commented May 20, 2021 •

edited

Loading

sjboeing commented May 20, 2021 •

edited

Loading

sjboeing commented May 21, 2021 •

edited by leifdenby

Loading

leifdenby commented May 23, 2021 •

edited

Loading

cemac-ccs commented May 25, 2021 via email

sjboeing commented May 25, 2021

eers1 commented Oct 7, 2021

Running MONC with DEPHY forcings on ARCHER2 #51

Running MONC with DEPHY forcings on ARCHER2 #51

Comments

leifdenby commented May 20, 2021 • edited Loading

leifdenby commented May 20, 2021

leifdenby commented May 20, 2021

leifdenby commented May 20, 2021

leifdenby commented May 20, 2021 • edited Loading

sjboeing commented May 20, 2021 • edited Loading

sjboeing commented May 21, 2021 • edited by leifdenby Loading

leifdenby commented May 23, 2021 • edited Loading

cemac-ccs commented May 25, 2021 via email

sjboeing commented May 25, 2021

eers1 commented Oct 7, 2021

leifdenby commented May 20, 2021 •

edited

Loading

leifdenby commented May 20, 2021 •

edited

Loading

sjboeing commented May 20, 2021 •

edited

Loading

sjboeing commented May 21, 2021 •

edited by leifdenby

Loading

leifdenby commented May 23, 2021 •

edited

Loading