Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDLIB Block Explicit scheme: Fix MPI restart reproducibility #1134

Open
MatthewMasarik-NOAA opened this issue Nov 29, 2023 · 25 comments · May be fixed by #1271
Open

PDLIB Block Explicit scheme: Fix MPI restart reproducibility #1134

MatthewMasarik-NOAA opened this issue Nov 29, 2023 · 25 comments · May be fixed by #1271
Labels
bug Something isn't working

Comments

@MatthewMasarik-NOAA
Copy link
Collaborator

Describe the bug
When using the Block-explicit solver, the runs are not MPI restart reproducible. Meaning, if you perform a run using MPI tasks=X, and use one of those restart files to initiate a separate run, which uses MPI tasks=Y, the runs are not reproducible.

To Reproduce
You can create this situation by starting with the regtest ww3_ufs1.1/unstr, and turning the Block explicit scheme on by selecting grid b. From here you can set up the restart functionality by using the ww3_ufs1.2 03Z restart test as a template. Here's how the jobcard commands could look for differing MPI counts, <X> and <Y>:

# MPI tasks:  X
./bin/run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -i input_unstr -w work_<X> -g b -f -p srun -n <X> ../model ww3_ufs1.1

# MPI tasks:  Y
mkdir -p ww3_ufs1.1/work_<Y>
 cp -v ww3_ufs1.1/work_<X>/20210401.030000.restart.ww3 ww3_ufs1.1/work_<Y>/restart.ww3
./bin/run_cmake_test -b slurm -o all -S -T -s MPI -s PDLIB -i input_unstr -w work_<Y> -g b -f -p srun -n <Y> ../model ww3_ufs1.1

Expected behavior
The expected behavior is that the results should be identical between these runs (running test.comp shows that ALL files are identical). However, when running test.comp for different values of <X> and <Y>, the results are not identical (out_grd, out_pnt, and restarts are different).

Screenshots
test.comp output for <X>=16, <Y>=11:

*****************************************************************        
     ***  compare WAVEWATCH III ww3_ufs1.1/work_16 and ww3_ufs1.1/work_11  ***  
       *****************************************************************        
                                                                               
Warning: 20210401.030000.restart.ww3 is different                              
Warning: 20210401.060000.restart.ww3 is different                              
Warning: 20210401.090000.restart.ww3 is different                              
Warning: 20210401.120000.restart.ww3 is different                              
Warning: 20210401.150000.restart.ww3 is different                              
Warning: 20210401.180000.restart.ww3 is different                              
Warning: 20210401.210000.restart.ww3 is different                              
Warning: 20210402.000000.restart.ww3 is different                              
Warning: out_grd.ww3 is different                                              
Warning: out_grd.ww3.txt is different                                          
Warning: out_pnt.ww3 is different                                              
Warning: out_pnt.ww3.txt is different                                          
Warning: [ww3.2021.nc](http://ww3.2021.nc/) is different                                              
[ww3.2021_tab.nc](http://ww3.2021_tab.nc/) is identical                                                    
exe/ww3_bounc is identical                                                      
exe/ww3_bound is identical                                                      
exe/ww3_gint is identical                                                      
exe/ww3_grib is identical                                                      
exe/ww3_grid is identical                                                      
exe/ww3_gspl is identical                                                      
exe/ww3_multi is identical                                                      
exe/ww3_ounf is identical                                                      
exe/ww3_ounp is identical                                                      
exe/ww3_prep is identical                                                      
exe/ww3_prnc is identical                                                      
exe/ww3_shel is identical                                                      
exe/ww3_strt is identical                                                      
exe/ww3_systrk is identical                                                    
exe/ww3_trck is identical                                                      
exe/ww3_trnc is identical                                                      
exe/ww3_uprstr is identical                                                    
                                                                               
   ****************************************************************************
 ***  end of WAVEWATCH III compare tests with the same version of the code   **
   ****************************************************************************

Additional context
N/A.

@MatthewMasarik-NOAA MatthewMasarik-NOAA added the bug Something isn't working label Nov 29, 2023
@MatthewMasarik-NOAA
Copy link
Collaborator Author

MatthewMasarik-NOAA commented Mar 14, 2024

hi @aronroland

I'm working on resolving b4b differences in the restart files for PDLIB block-explicit, and one thing I noticed is that those file sizes will vary when you vary the MPI count. In this case I am not doing a warm start run using a 03Z restart file from a separate run, like described above. Here, I'm only doing two separate runs of the ww3_ufs1.1/unstr regtest with grid_b, say for this example MPI count=20, 40, and then comparing the restart files. You can see the difference in KB here:

[hfe04 regtests ]$ du -k ww3_ufs1.1/work_unstr_b_[24]0/restart001.ww3

11752	ww3_ufs1.1/work_unstr_b_20/restart001.ww3
6588	ww3_ufs1.1/work_unstr_b_40/restart001.ww3

Looking into this I found something in w3iorsmd.F90 I had a question on. The first term in the if statement on line 630 below will always evaluate to True when IOSTYP=0 for the unstr schemes, then that branch will always be taken, so program execution with not go through the ELSE branch starting on line 654, which includes the call to UNST_PDLIB_WRITE_TO_FILE on line 662. Is this intentional?

WW3/model/src/w3iorsmd.F90

Lines 630 to 663 in f66b6d4

IF ( .NOT.IOSFLG .OR. (NAPROC.EQ.1.AND.NAPRST.EQ.1) ) THEN
#ifdef W3_MPI
DO JSEA=1, NSEAL
CALL INIT_GET_ISEA(ISEA, JSEA)
NREC = ISEA + 2
RPOS = 1_8 + LRECL*(NREC-1_8)
WRITEBUFF(:) = 0.
WRITEBUFF(1:NSPEC) = VA(1:NSPEC,JSEA)
WRITE (NDSR,POS=RPOS,ERR=803,IOSTAT=IERR) WRITEBUFF
END DO
#else
DO JSEA=1, NSEA
ISEA = JSEA
NREC = ISEA + 2
RPOS = 1_8 + LRECL*(NREC-1_8)
WRITEBUFF(:) = 0.
WRITEBUFF(1:NSPEC) = VA(1:NSPEC,JSEA)
WRITE (NDSR,POS=RPOS,ERR=803,IOSTAT=IERR) WRITEBUFF
END DO
#endif
!
! I/O server version writing of spectra ( !/MPI )
!
#ifdef W3_MPI
ELSE
!
IF (LPDLIB) THEN
#endif
#ifdef W3_TIMINGS
CALL PRINT_MY_TIME("Before UNST_PDLIB_WRITE_TO_FILE")
#endif
#ifdef W3_PDLIB
CALL UNST_PDLIB_WRITE_TO_FILE(NDSR)
#endif

thanks

@aronroland
Copy link
Collaborator

I just checked on my side and I can confirm that it is not b4b, which is quite strange. Our commit long time ago with the fixes for the block explicit passed the ufs testcase and was b4b at that time. Otherwise for the restart issues I need to look at it more detailed. It is very long time ago that I worked on this part of the code.

@JessicaMeixner-NOAA
Copy link
Collaborator

@aronroland everything but the restart file itself reproduces. I believe this has always been the case with the block-explicit. Should we be calling UNST_PDLIB_WRITE_TO_FILE when writing restart files for the block explicit case?

@aronroland
Copy link
Collaborator

@aronroland please give me some time to look at it ...

@aronroland
Copy link
Collaborator

Hello, so as I have reviewed the code and it is intentional, we have tried basically to reproduce the behavior of the structured scheme within the unstructured schemes as defined by IOSTYP = 0, 1. IOSTYP 2 and 3 was not streamlined to the unstructured DD part. As for the problem above I would follow the path of IOSTYP = 0. Is it b4b for the simple explicit schemes? There should not be any difference between the block_explicit or other explicit schemes when the writing of the restart files is of concern. If all other files for the block_explicit are b4b, which is the case as I have understood it, there is only little room left. It needs to be traced from w3_wave down to the call to the writing of the restart file and there should be no difference whether block_explicit or explicit. The call to UNST_PDLIB_WRITE_TO_FILE is done for IOSTYP = 1.

@aronroland
Copy link
Collaborator

I have tested the different output methods from above and found that IOSTYP = 0 is results in about 10% improved runtime. (testcase was tp2.6)

@MatthewMasarik-NOAA
Copy link
Collaborator Author

Hello @aronroland, thank you for looking into this and offering insight. I will do some more checking on my end with what you've said. I have a quick follow up, though

Regarding your question, if it is b4b for the simple explicit scheme? I can say I tested this scheme (grid a) with different MPI counts, and it is not b4b.

If I could ask one clarification question? You suggested we use IOSTYP=0. So in this case UNST_PDLIB_WRITE_TO_FILE will not be called, and it will only write the spectra directly from w3iors (structured)?

As I said I will do some tests on my side and will let you know what I find.

@aronroland
Copy link
Collaborator

Yes, as I see things IOSTYP = 0 does a direct write and 1 analogous to the ww3 logics for the structured grids

@MatthewMasarik-NOAA
Copy link
Collaborator Author

Hi @aronroland,
Okay, I confirmed the program flow through w3iors when using IOSTYP=0 vs. IOSTYP=1.

When IOSTYP=0, this block of code gets executed for writing spectra:

WW3/model/src/w3iorsmd.F90

Lines 630 to 639 in f66b6d4

IF ( .NOT.IOSFLG .OR. (NAPROC.EQ.1.AND.NAPRST.EQ.1) ) THEN
#ifdef W3_MPI
DO JSEA=1, NSEAL
CALL INIT_GET_ISEA(ISEA, JSEA)
NREC = ISEA + 2
RPOS = 1_8 + LRECL*(NREC-1_8)
WRITEBUFF(:) = 0.
WRITEBUFF(1:NSPEC) = VA(1:NSPEC,JSEA)
WRITE (NDSR,POS=RPOS,ERR=803,IOSTAT=IERR) WRITEBUFF
END DO

Though when IOSTYP=1, the ELSE branch is taken, which makes the call to the unstructured write routine (UNST_PDLIB_WRITE_TO_FILE):

WW3/model/src/w3iorsmd.F90

Lines 654 to 663 in f66b6d4

ELSE
!
IF (LPDLIB) THEN
#endif
#ifdef W3_TIMINGS
CALL PRINT_MY_TIME("Before UNST_PDLIB_WRITE_TO_FILE")
#endif
#ifdef W3_PDLIB
CALL UNST_PDLIB_WRITE_TO_FILE(NDSR)
#endif

I did runs for each IOSTYP, for two different arbitrary choices for MPI task counts: 40, 20. When I display the sizes of the restart files, the restarts generated from the runs with IOSTYP=0 have varying file sizes based on number of MPI tasks.

[hfe02 ww3_ufs1.1]$ du -k IOSTYP_*/restart001.ww3     
11752   IOSTYP_0.work_unstr_b_20/restart001.ww3
6588    IOSTYP_0.work_unstr_b_40/restart001.ww3
190336  IOSTYP_1.work_unstr_b_20/restart001.ww3
190336  IOSTYP_1.work_unstr_b_40/restart001.ww3

Is it expected that you would get different sized restart files for different MPI task counts, when using IOSTYP=0?

@aronroland
Copy link
Collaborator

aronroland commented Mar 27, 2024

Hi @MatthewMasarik-NOAA, indeed this is strange and I wonder whether both restart files work? Did u check this, anyway it is something that needs to be cleaned.

@MatthewMasarik-NOAA
Copy link
Collaborator Author

Hi @aronroland, I have already done some different types of checking on these restart files. To answer whether both restarts work? They both warm start a run and finish without issue, but the output fields from those runs aren't reproducible.

I can confirm that in either case, IOSTYP=0,1, the restart files themselves are not b4b for differing MPI counts. The out_grd from warm starting with a differing MPI count is also not b4b.

It's concerning that for IOSTYP=0 the restart file size will change with MPI count. Do you still think the best path is going with IOSTYP=0?

@JessicaMeixner-NOAA
Copy link
Collaborator

@MatthewMasarik-NOAA can you share the runs where the output fields of the runs are not reproducible. This is not was I was expecting or have seen in my testing when IOSTYPE=0 (I would have to double check that though).

@MatthewMasarik-NOAA
Copy link
Collaborator Author

@MatthewMasarik-NOAA can you share the runs where the output fields of the runs are not reproducible.

@JessicaMeixner-NOAA, please clarify what you mean by 'runs where the output fields of the runs are not reproducible.' I qualified those as warm-start runs from a restart file from a run with a different MPI count. Was that your interpretation?

@MatthewMasarik-NOAA
Copy link
Collaborator Author

I was referring to the situation that this issue was started for:

When using the Block-explicit solver, the runs are not MPI restart reproducible. Meaning, if you perform a run using MPI tasks=X, and use one of those restart files to initiate a separate run, which uses MPI tasks=Y, the runs are not reproducible.

@JessicaMeixner-NOAA
Copy link
Collaborator

I thought the other output fields (gridded, point) except the restart file itself replicates.

@MatthewMasarik-NOAA
Copy link
Collaborator Author

MatthewMasarik-NOAA commented Mar 28, 2024

They do not. The test.comp output is in the header listing those.

@aronroland
Copy link
Collaborator

Hi @MatthewMasarik-NOAA, @JessicaMeixner-NOAA at the time when @aliabdolali and me worked on it it was b4b, otherwise it could not have been merged. As for the restart file size i think that this should not be different for a different number of tasks. This does not look right.

@aliabdolali
Copy link
Contributor

back in date, it was tested by the emc code managers, and merged after review and approval, and from what I recall, all of ufs_1.1, ufs1.1_unstr, ufs1.2 and ufs1.3 tess have mpi and restart reproducibility checks. If not, I think you just need to activate them in matrix.base. Let me know if you need me to show you where I added them.
Id suggest go back to that hash and test it, and if was b4b, march forward to find the problematic PR, otherwise let us know and we will take a look according to our priority list.

@DeniseWorthen also did some checks in the coupled application.

@DeniseWorthen
Copy link
Contributor

@aliabdolali I don't recall ever trying a case where I generated a restart file running w/ a certain decomp, used it to restart a different decomp and compared answers. If I understand correctly, that is the issue being discussed here.

@JessicaMeixner-NOAA
Copy link
Collaborator

I'm going to run some tests and will report back tomorrow.

@aliabdolali
Copy link
Contributor

@aliabdolali I don't recall ever trying a case where I generated a restart file running w/ a certain decomp, used it to restart a different decomp and compared answers. If I understand correctly, that is the issue being discussed here.

I do not remember exactly Denise, but I remember you were checking the ghost nodes, for b4b reproducibility, it might be for mpi not restart or vice versa. It is a long time since we did it.

@DeniseWorthen
Copy link
Contributor

@aliabdolali I set up a test case for the ATM-WAV RT test in UFS using the small mesh we did a lot of initial testing on. I ran out 2 hours on 10PEs, and used the restart after 1 hour to start up either a 10PE case or a 14PE case. All three case have identical ATM and mediator restart files at the end of 2 hours. I did run in debug mode, which might prevent some optimizations (ie, order of operations) which could conceivably change answers. But this basic test appears to pass for me.

@aliabdolali
Copy link
Contributor

Super @DeniseWorthen, it helps us a lot. Let's see is the outcomes of standalone test. We can compare and see the differences, it might be in the configuration.

@aronroland
Copy link
Collaborator

As for the future i suggest that we add sanity tests for the source term and solver part, in this way we will be able to easier find out where the issue occurs. As for the UG part of the code we have "solver_coherence" introduced within a cpp pragma. I will do some checks based on the ideas of @MathieuDutSik

@JessicaMeixner-NOAA
Copy link
Collaborator

Hi Everyone - Thanks for your patience as I ran extra tests. I did not repeat the coupled tests as I got the same results as @DeniseWorthen before and she confirmed that so I focused on standalone tests. I do see what @MatthewMasarik-NOAA saw with the fact that the out_grid had b4b differences in the standalone set-up. The differences are limited to the initial time step for the restart and were in wind, current, Charnock and Ustar. For Wind we have a known issue: #218 that explains why this is different. I suspect this is also the reason for current. Charnock and Ustart - I'll be honest I don't remember if those had differences in the structured case, this is something that we could check. They are both written to the restart file, so this could be somewhere to look into further for answer changes but again all other fields at the initial time are the same in the gridded out and all fields at all other times are the same, so I think we are getting the same results as we were before, I think we just didn't look at the first time step before due to the wind issue. @aronroland additional sanity checks are always great. My suspicion is that something is being written to the restart file that is not actually being used and that is causing the differences. They're just really hard to track down in the binary file output.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants