PDLIB Block Explicit scheme: Fix MPI restart reproducibility #1134

MatthewMasarik-NOAA · 2023-11-29T22:30:27Z

Describe the bug
When using the Block-explicit solver, the runs are not MPI restart reproducible. Meaning, if you perform a run using MPI tasks=X, and use one of those restart files to initiate a separate run, which uses MPI tasks=Y, the runs are not reproducible.

To Reproduce
You can create this situation by starting with the regtest ww3_ufs1.1/unstr, and turning the Block explicit scheme on by selecting grid b. From here you can set up the restart functionality by using the ww3_ufs1.2 03Z restart test as a template. Here's how the jobcard commands could look for differing MPI counts, <X> and <Y>:

# MPI tasks:  X
./bin/run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -i input_unstr -w work_<X> -g b -f -p srun -n <X> ../model ww3_ufs1.1

# MPI tasks:  Y
mkdir -p ww3_ufs1.1/work_<Y>
 cp -v ww3_ufs1.1/work_<X>/20210401.030000.restart.ww3 ww3_ufs1.1/work_<Y>/restart.ww3
./bin/run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -i input_unstr -w work_<Y> -g b -f -p srun -n <Y> ../model ww3_ufs1.1

Expected behavior
The expected behavior is that the results should be identical between these runs (running test.comp shows that ALL files are identical). However, when running test.comp for different values of <X> and <Y>, the results are not identical (out_grd, out_pnt, and restarts are different).

Screenshots
test.comp output for <X>=16, <Y>=11:

*****************************************************************        
     ***  compare WAVEWATCH III ww3_ufs1.1/work_16 and ww3_ufs1.1/work_11  ***  
       *****************************************************************        
                                                                               
Warning: 20210401.030000.restart.ww3 is different                              
Warning: 20210401.060000.restart.ww3 is different                              
Warning: 20210401.090000.restart.ww3 is different                              
Warning: 20210401.120000.restart.ww3 is different                              
Warning: 20210401.150000.restart.ww3 is different                              
Warning: 20210401.180000.restart.ww3 is different                              
Warning: 20210401.210000.restart.ww3 is different                              
Warning: 20210402.000000.restart.ww3 is different                              
Warning: out_grd.ww3 is different                                              
Warning: out_grd.ww3.txt is different                                          
Warning: out_pnt.ww3 is different                                              
Warning: out_pnt.ww3.txt is different                                          
Warning: [ww3.2021.nc](http://ww3.2021.nc/) is different                                              
[ww3.2021_tab.nc](http://ww3.2021_tab.nc/) is identical                                                    
exe/ww3_bounc is identical                                                      
exe/ww3_bound is identical                                                      
exe/ww3_gint is identical                                                      
exe/ww3_grib is identical                                                      
exe/ww3_grid is identical                                                      
exe/ww3_gspl is identical                                                      
exe/ww3_multi is identical                                                      
exe/ww3_ounf is identical                                                      
exe/ww3_ounp is identical                                                      
exe/ww3_prep is identical                                                      
exe/ww3_prnc is identical                                                      
exe/ww3_shel is identical                                                      
exe/ww3_strt is identical                                                      
exe/ww3_systrk is identical                                                    
exe/ww3_trck is identical                                                      
exe/ww3_trnc is identical                                                      
exe/ww3_uprstr is identical                                                    
                                                                               
   ****************************************************************************
 ***  end of WAVEWATCH III compare tests with the same version of the code   **
   ****************************************************************************

Additional context
N/A.

The text was updated successfully, but these errors were encountered:

MatthewMasarik-NOAA · 2024-03-14T20:56:04Z

hi @aronroland

I'm working on resolving b4b differences in the restart files for PDLIB block-explicit, and one thing I noticed is that those file sizes will vary when you vary the MPI count. In this case I am not doing a warm start run using a 03Z restart file from a separate run, like described above. Here, I'm only doing two separate runs of the ww3_ufs1.1/unstr regtest with grid_b, say for this example MPI count=20, 40, and then comparing the restart files. You can see the difference in KB here:

[hfe04 regtests ]$ du -k ww3_ufs1.1/work_unstr_b_[24]0/restart001.ww3

11752	ww3_ufs1.1/work_unstr_b_20/restart001.ww3
6588	ww3_ufs1.1/work_unstr_b_40/restart001.ww3

Looking into this I found something in w3iorsmd.F90 I had a question on. The first term in the if statement on line 630 below will always evaluate to True when IOSTYP=0 for the unstr schemes, then that branch will always be taken, so program execution with not go through the ELSE branch starting on line 654, which includes the call to UNST_PDLIB_WRITE_TO_FILE on line 662. Is this intentional?

WW3/model/src/w3iorsmd.F90

Lines 630 to 663 in f66b6d4

 IF ( .NOT.IOSFLG .OR. (NAPROC.EQ.1.AND.NAPRST.EQ.1) ) THEN 

 #ifdef W3_MPI 

 DO JSEA=1, NSEAL 

 CALL INIT_GET_ISEA(ISEA, JSEA) 

 NREC = ISEA + 2 

 RPOS = 1_8 + LRECL*(NREC-1_8) 

 WRITEBUFF(:) = 0. 

 WRITEBUFF(1:NSPEC) = VA(1:NSPEC,JSEA) 

 WRITE (NDSR,POS=RPOS,ERR=803,IOSTAT=IERR) WRITEBUFF 

 END DO 

 #else 

 DO JSEA=1, NSEA 

 ISEA = JSEA 

 NREC = ISEA + 2 

 RPOS = 1_8 + LRECL*(NREC-1_8) 

 WRITEBUFF(:) = 0. 

 WRITEBUFF(1:NSPEC) = VA(1:NSPEC,JSEA) 

 WRITE (NDSR,POS=RPOS,ERR=803,IOSTAT=IERR) WRITEBUFF 

 END DO 

 #endif 

 ! 

 ! I/O server version writing of spectra ( !/MPI ) 

 ! 

 #ifdef W3_MPI 

 ELSE 

 ! 

 IF (LPDLIB) THEN 

 #endif 

 #ifdef W3_TIMINGS 

 CALL PRINT_MY_TIME("Before UNST_PDLIB_WRITE_TO_FILE") 

 #endif 

 #ifdef W3_PDLIB 

 CALL UNST_PDLIB_WRITE_TO_FILE(NDSR) 

 #endif

thanks

aronroland · 2024-03-19T17:04:18Z

I just checked on my side and I can confirm that it is not b4b, which is quite strange. Our commit long time ago with the fixes for the block explicit passed the ufs testcase and was b4b at that time. Otherwise for the restart issues I need to look at it more detailed. It is very long time ago that I worked on this part of the code.

JessicaMeixner-NOAA · 2024-03-19T17:06:26Z

@aronroland everything but the restart file itself reproduces. I believe this has always been the case with the block-explicit. Should we be calling UNST_PDLIB_WRITE_TO_FILE when writing restart files for the block explicit case?

aronroland · 2024-03-19T17:33:32Z

@aronroland please give me some time to look at it ...

aronroland · 2024-03-21T19:39:53Z

Hello, so as I have reviewed the code and it is intentional, we have tried basically to reproduce the behavior of the structured scheme within the unstructured schemes as defined by IOSTYP = 0, 1. IOSTYP 2 and 3 was not streamlined to the unstructured DD part. As for the problem above I would follow the path of IOSTYP = 0. Is it b4b for the simple explicit schemes? There should not be any difference between the block_explicit or other explicit schemes when the writing of the restart files is of concern. If all other files for the block_explicit are b4b, which is the case as I have understood it, there is only little room left. It needs to be traced from w3_wave down to the call to the writing of the restart file and there should be no difference whether block_explicit or explicit. The call to UNST_PDLIB_WRITE_TO_FILE is done for IOSTYP = 1.

aronroland · 2024-03-22T10:31:37Z

I have tested the different output methods from above and found that IOSTYP = 0 is results in about 10% improved runtime. (testcase was tp2.6)

MatthewMasarik-NOAA · 2024-03-22T13:30:44Z

Hello @aronroland, thank you for looking into this and offering insight. I will do some more checking on my end with what you've said. I have a quick follow up, though

Regarding your question, if it is b4b for the simple explicit scheme? I can say I tested this scheme (grid a) with different MPI counts, and it is not b4b.

If I could ask one clarification question? You suggested we use IOSTYP=0. So in this case UNST_PDLIB_WRITE_TO_FILE will not be called, and it will only write the spectra directly from w3iors (structured)?

As I said I will do some tests on my side and will let you know what I find.

aronroland · 2024-03-24T07:44:29Z

Yes, as I see things IOSTYP = 0 does a direct write and 1 analogous to the ww3 logics for the structured grids

MatthewMasarik-NOAA · 2024-03-26T15:21:31Z

Hi @aronroland,
Okay, I confirmed the program flow through w3iors when using IOSTYP=0 vs. IOSTYP=1.

When IOSTYP=0, this block of code gets executed for writing spectra:

WW3/model/src/w3iorsmd.F90

Lines 630 to 639 in f66b6d4

 IF ( .NOT.IOSFLG .OR. (NAPROC.EQ.1.AND.NAPRST.EQ.1) ) THEN 

 #ifdef W3_MPI 

 DO JSEA=1, NSEAL 

 CALL INIT_GET_ISEA(ISEA, JSEA) 

 NREC = ISEA + 2 

 RPOS = 1_8 + LRECL*(NREC-1_8) 

 WRITEBUFF(:) = 0. 

 WRITEBUFF(1:NSPEC) = VA(1:NSPEC,JSEA) 

 WRITE (NDSR,POS=RPOS,ERR=803,IOSTAT=IERR) WRITEBUFF 

 END DO

Though when IOSTYP=1, the ELSE branch is taken, which makes the call to the unstructured write routine (UNST_PDLIB_WRITE_TO_FILE):

WW3/model/src/w3iorsmd.F90

Lines 654 to 663 in f66b6d4

 ELSE 

 ! 

 IF (LPDLIB) THEN 

 #endif 

 #ifdef W3_TIMINGS 

 CALL PRINT_MY_TIME("Before UNST_PDLIB_WRITE_TO_FILE") 

 #endif 

 #ifdef W3_PDLIB 

 CALL UNST_PDLIB_WRITE_TO_FILE(NDSR) 

 #endif

I did runs for each IOSTYP, for two different arbitrary choices for MPI task counts: 40, 20. When I display the sizes of the restart files, the restarts generated from the runs with IOSTYP=0 have varying file sizes based on number of MPI tasks.

[hfe02 ww3_ufs1.1]$ du -k IOSTYP_*/restart001.ww3     
11752   IOSTYP_0.work_unstr_b_20/restart001.ww3
6588    IOSTYP_0.work_unstr_b_40/restart001.ww3
190336  IOSTYP_1.work_unstr_b_20/restart001.ww3
190336  IOSTYP_1.work_unstr_b_40/restart001.ww3

Is it expected that you would get different sized restart files for different MPI task counts, when using IOSTYP=0?

aronroland · 2024-03-27T19:10:33Z

Hi @MatthewMasarik-NOAA, indeed this is strange and I wonder whether both restart files work? Did u check this, anyway it is something that needs to be cleaned.

MatthewMasarik-NOAA · 2024-03-27T22:31:23Z

Hi @aronroland, I have already done some different types of checking on these restart files. To answer whether both restarts work? They both warm start a run and finish without issue, but the output fields from those runs aren't reproducible.

I can confirm that in either case, IOSTYP=0,1, the restart files themselves are not b4b for differing MPI counts. The out_grd from warm starting with a differing MPI count is also not b4b.

It's concerning that for IOSTYP=0 the restart file size will change with MPI count. Do you still think the best path is going with IOSTYP=0?

JessicaMeixner-NOAA · 2024-03-27T23:47:18Z

@MatthewMasarik-NOAA can you share the runs where the output fields of the runs are not reproducible. This is not was I was expecting or have seen in my testing when IOSTYPE=0 (I would have to double check that though).

MatthewMasarik-NOAA · 2024-03-27T23:53:30Z

@MatthewMasarik-NOAA can you share the runs where the output fields of the runs are not reproducible.

@JessicaMeixner-NOAA, please clarify what you mean by 'runs where the output fields of the runs are not reproducible.' I qualified those as warm-start runs from a restart file from a run with a different MPI count. Was that your interpretation?

MatthewMasarik-NOAA · 2024-03-27T23:55:42Z

I was referring to the situation that this issue was started for:

When using the Block-explicit solver, the runs are not MPI restart reproducible. Meaning, if you perform a run using MPI tasks=X, and use one of those restart files to initiate a separate run, which uses MPI tasks=Y, the runs are not reproducible.

JessicaMeixner-NOAA · 2024-03-27T23:56:30Z

I thought the other output fields (gridded, point) except the restart file itself replicates.

MatthewMasarik-NOAA · 2024-03-28T00:00:31Z

They do not. The test.comp output is in the header listing those.

aronroland · 2024-04-04T11:34:57Z

Hi @MatthewMasarik-NOAA, @JessicaMeixner-NOAA at the time when @aliabdolali and me worked on it it was b4b, otherwise it could not have been merged. As for the restart file size i think that this should not be different for a different number of tasks. This does not look right.

aliabdolali · 2024-04-04T16:39:07Z

back in date, it was tested by the emc code managers, and merged after review and approval, and from what I recall, all of ufs_1.1, ufs1.1_unstr, ufs1.2 and ufs1.3 tess have mpi and restart reproducibility checks. If not, I think you just need to activate them in matrix.base. Let me know if you need me to show you where I added them.
Id suggest go back to that hash and test it, and if was b4b, march forward to find the problematic PR, otherwise let us know and we will take a look according to our priority list.

@DeniseWorthen also did some checks in the coupled application.

DeniseWorthen · 2024-04-04T17:20:07Z

@aliabdolali I don't recall ever trying a case where I generated a restart file running w/ a certain decomp, used it to restart a different decomp and compared answers. If I understand correctly, that is the issue being discussed here.

JessicaMeixner-NOAA · 2024-04-04T18:09:24Z

I'm going to run some tests and will report back tomorrow.

aliabdolali · 2024-04-04T18:32:05Z

@aliabdolali I don't recall ever trying a case where I generated a restart file running w/ a certain decomp, used it to restart a different decomp and compared answers. If I understand correctly, that is the issue being discussed here.

I do not remember exactly Denise, but I remember you were checking the ghost nodes, for b4b reproducibility, it might be for mpi not restart or vice versa. It is a long time since we did it.

DeniseWorthen · 2024-04-04T22:59:51Z

@aliabdolali I set up a test case for the ATM-WAV RT test in UFS using the small mesh we did a lot of initial testing on. I ran out 2 hours on 10PEs, and used the restart after 1 hour to start up either a 10PE case or a 14PE case. All three case have identical ATM and mediator restart files at the end of 2 hours. I did run in debug mode, which might prevent some optimizations (ie, order of operations) which could conceivably change answers. But this basic test appears to pass for me.

aliabdolali · 2024-04-04T23:21:24Z

Super @DeniseWorthen, it helps us a lot. Let's see is the outcomes of standalone test. We can compare and see the differences, it might be in the configuration.

aronroland · 2024-04-05T16:47:49Z

As for the future i suggest that we add sanity tests for the source term and solver part, in this way we will be able to easier find out where the issue occurs. As for the UG part of the code we have "solver_coherence" introduced within a cpp pragma. I will do some checks based on the ideas of @MathieuDutSik

JessicaMeixner-NOAA · 2024-04-05T18:33:49Z

Hi Everyone - Thanks for your patience as I ran extra tests. I did not repeat the coupled tests as I got the same results as @DeniseWorthen before and she confirmed that so I focused on standalone tests. I do see what @MatthewMasarik-NOAA saw with the fact that the out_grid had b4b differences in the standalone set-up. The differences are limited to the initial time step for the restart and were in wind, current, Charnock and Ustar. For Wind we have a known issue: #218 that explains why this is different. I suspect this is also the reason for current. Charnock and Ustart - I'll be honest I don't remember if those had differences in the structured case, this is something that we could check. They are both written to the restart file, so this could be somewhere to look into further for answer changes but again all other fields at the initial time are the same in the gridded out and all fields at all other times are the same, so I think we are getting the same results as we were before, I think we just didn't look at the first time step before due to the wind issue. @aronroland additional sanity checks are always great. My suspicion is that something is being written to the restart file that is not actually being used and that is causing the differences. They're just really hard to track down in the binary file output.

MatthewMasarik-NOAA added the bug Something isn't working label Nov 29, 2023

MatthewMasarik-NOAA mentioned this issue Apr 10, 2024

Bit reproducibility of wind in output files when reading from restart (WRST switch) #181

Open

This was referenced Jul 15, 2024

[WIP] Fix restart reproducibility for pdlib #1271

Draft

Restart does not work for some cases using the unstructured scheme #1275

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDLIB Block Explicit scheme: Fix MPI restart reproducibility #1134

PDLIB Block Explicit scheme: Fix MPI restart reproducibility #1134

MatthewMasarik-NOAA commented Nov 29, 2023

MatthewMasarik-NOAA commented Mar 14, 2024 •

edited

Loading

aronroland commented Mar 19, 2024

JessicaMeixner-NOAA commented Mar 19, 2024

aronroland commented Mar 19, 2024

aronroland commented Mar 21, 2024

aronroland commented Mar 22, 2024

MatthewMasarik-NOAA commented Mar 22, 2024

aronroland commented Mar 24, 2024

MatthewMasarik-NOAA commented Mar 26, 2024

aronroland commented Mar 27, 2024 •

edited

Loading

MatthewMasarik-NOAA commented Mar 27, 2024

JessicaMeixner-NOAA commented Mar 27, 2024

MatthewMasarik-NOAA commented Mar 27, 2024

MatthewMasarik-NOAA commented Mar 27, 2024

JessicaMeixner-NOAA commented Mar 27, 2024

MatthewMasarik-NOAA commented Mar 28, 2024 •

edited

Loading

aronroland commented Apr 4, 2024

aliabdolali commented Apr 4, 2024

DeniseWorthen commented Apr 4, 2024

JessicaMeixner-NOAA commented Apr 4, 2024

aliabdolali commented Apr 4, 2024

DeniseWorthen commented Apr 4, 2024

aliabdolali commented Apr 4, 2024

aronroland commented Apr 5, 2024

JessicaMeixner-NOAA commented Apr 5, 2024

PDLIB Block Explicit scheme: Fix MPI restart reproducibility #1134

PDLIB Block Explicit scheme: Fix MPI restart reproducibility #1134

Comments

MatthewMasarik-NOAA commented Nov 29, 2023

MatthewMasarik-NOAA commented Mar 14, 2024 • edited Loading

aronroland commented Mar 19, 2024

JessicaMeixner-NOAA commented Mar 19, 2024

aronroland commented Mar 19, 2024

aronroland commented Mar 21, 2024

aronroland commented Mar 22, 2024

MatthewMasarik-NOAA commented Mar 22, 2024

aronroland commented Mar 24, 2024

MatthewMasarik-NOAA commented Mar 26, 2024

aronroland commented Mar 27, 2024 • edited Loading

MatthewMasarik-NOAA commented Mar 27, 2024

JessicaMeixner-NOAA commented Mar 27, 2024

MatthewMasarik-NOAA commented Mar 27, 2024

MatthewMasarik-NOAA commented Mar 27, 2024

JessicaMeixner-NOAA commented Mar 27, 2024

MatthewMasarik-NOAA commented Mar 28, 2024 • edited Loading

aronroland commented Apr 4, 2024

aliabdolali commented Apr 4, 2024

DeniseWorthen commented Apr 4, 2024

JessicaMeixner-NOAA commented Apr 4, 2024

aliabdolali commented Apr 4, 2024

DeniseWorthen commented Apr 4, 2024

aliabdolali commented Apr 4, 2024

aronroland commented Apr 5, 2024

JessicaMeixner-NOAA commented Apr 5, 2024

MatthewMasarik-NOAA commented Mar 14, 2024 •

edited

Loading

aronroland commented Mar 27, 2024 •

edited

Loading

MatthewMasarik-NOAA commented Mar 28, 2024 •

edited

Loading