-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
history and restart write are hanging (or running too slow) in the ATM+SCH+WW3 case #1
Comments
@yunfangsun There are some configuration options in PIO side that we could try to optimize the I/O and prevent CMEPS to hang when it is writing history and restart files. @jedwards4b and @DeniseWorthen might have some idea. @jedwards4b I wonder if there is any specific I/O option that we could try to test with this high-resolution case in CMEPS side? |
@yunfangsun and all, I transfer this issue to CMEPS since it seems it is related with it. |
@jedwards4b and @DeniseWorthen Since this is UFS the relevant part that gets the PIO options are in https://github.com/ESCOMP/CMEPS/blob/e84e8a1f4fbe4073e82435c72459352de6077bb2/mediator/med_io_mod.F90#L177. So, you could see the defaults PIO options for UFS/CMEPS. |
@yunfangsun Following is the options used by default for
It seems that it is not using parallel I/O. You could try to use parallel I/O by setting
You could also try to use different number for |
Outside of the CESM, you can set the PIO options (numio tasks, etc) via config. See Line 184 in 437d5e6
Also, a question. Why are you having mediator history files written? That is a lot of I/O! Normally history files are used for debugging or diagnosing field exchange issues. They are not used in production runs. |
@DeniseWorthen I agree with you. In the development, I am activating mediator history and restart to check the exchanged fields but @yunfangsun could disable in his production runs. On the other hand, it would be nice to figure out the issue with the mediator I/O. So, once we ned to debug something with the high-res case, it would be available. I bet that it is related with the serial I/O (which is default). |
I think that there is an issue with the history write alarm, nothing to do with pio - I am working on that today. |
@uturuncoglu Yes, I agree that you need to switch to pnetcdf at a minimum. I also believe there is an issue w/ WW3 restarts when using PDLIB. So another idea might be testing w/o the ww3 restarts (set the |
@DeniseWorthen Thanks. I think we need to open an issue related with the WW3 restart in our end to track it. We did not test coastal specific ocean models restart capability. So, we don't know they will restart perfectly or not in a coupled application. I raised this couple of times in our internal meetings but at this point, we don't have enough resource to check them. @janahaddad maybe you could create an issue for the restart capability and once we have time we could just focus restart capability of ROMS and SCHSIM. |
@uturuncoglu I understand about not testing restart capability at this time. The issue I've heard about second-hand is that restart writing when using PDLIB is very very slow. If you don't need WW3 restarts, I wouldn't write them. |
@DeniseWorthen Okay. Thanks for the clarification and your help. I was thinking there is an issue in the restart files itself. Good to know. Maybe this is more problematic for the high resolution cases. @yunfangsun you could also try to disable writing restart files as @DeniseWorthen suggested to see any performance improvement. |
Hi @uturuncoglu ,
Thank you for the help. |
@yunfangsun From your mediator.log file, I see that it is still using a default of 4 iotasks. If you have 1600 cores for CMEPS, I would suggest that you try setting the number of iotasks higher. For a stride of 4 you should be able to get 400 IO tasks. Set the number of tasks with |
@yunfangsun That is great news. Glad that it worked. I agree with @DeniseWorthen about increasing number of task for I/O. If you have time and don't mind, could you have couple of run (different number of I/O tasks, side etc.) and collect some timing results. It would be nice to change one parameter at a time to see its effect. I think that would be very helpful for the future and we could use it as reference for other cases. In actual runs, we could disable mediatory history and restart or write them just end of the simulation to optimize the I/O more. |
Hi @uturuncoglu , Sure, I will do the test when Hercules is back online. |
Hi @uturuncoglu and @DeniseWorthen , I have tried to use different pio_numiotasks settings: pio_numiotasks = 8 The speeds have no differences for the three cases. |
@yunfangsun Thanks for having additional tests. The results are little bit Interesting. I wonder your frequency to write history and restart files. I think if you increase that frequency you might start seeing some difference. If you don't mind could you check your configuration? |
Hi @uturuncoglu The frequency for the history and restart files is the same for all the tests, which is as follows
|
@yunfangsun the history and restart file interval basically configured via ufs.configure and it is not related with the run sequence. Could you share your ufs.configure? |
Hi @uturuncoglu You can check out the case in /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_4 The part related to history and restart is as follows:
|
@yunfangsun It seems that you are writing history file every hour and restart in every 12 hours. If you don't mind, could you confirm it from your run directory.
|
Hi @uturuncoglu , Yes, I can confirm it. |
@uturuncoglu @saeed-moghimi-noaa @janahaddad @pvelissariou1
This is to document a bug related to med_phases_history_write med_phases_restart_write.
The original Run Sequence for ufs_atm2sch2ww3 is as follows:
It works for the coastal_ike_shinnecock_atm2sch2ww3 case, however, when this configuration is applied to a higher resolution mesh (HSOFS, 1.8 million nodes), the simulation stopped after a 1-hour simulation with 1600 cores. This case is located at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel on Hercules.
Then I tried to remove
MED med_phases_history_write
, and using 200 cores for the same case, it stopped after 16-hour simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3).With the same configuration as above but with 6000 cores, the case stopped after 12-hour simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4).
Since the
restart_n = 12
is in this ufs.configure, Then the restart is turned off byrestart_option = never
, the case with 6000 cores could finish the total 17-day simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4_1)The problem is that in the 200 cores run, the restart is also set as
restart_n = 12
, but the simulation stopped at 16-hour.It seems
med_phases_history_write med_phases_restart_write
can not function well in this case (have to turn off both), the question is then, how could we correctly configure thisufs.configure
.The details are at oceanmodeling/ufs-weather-model#103
The text was updated successfully, but these errors were encountered: