history and restart write are hanging (or running too slow) in the ATM+SCH+WW3 case #1

yunfangsun · 2024-06-24T19:53:03Z

@uturuncoglu @saeed-moghimi-noaa @janahaddad @pvelissariou1

This is to document a bug related to med_phases_history_write med_phases_restart_write.

The original Run Sequence for ufs_atm2sch2ww3 is as follows:

# Run Sequence #
runSeq::
@3600
  MED med_phases_prep_atm
  MED med_phases_prep_ocn_accum
  MED med_phases_prep_ocn_avg
  MED med_phases_prep_wav_accum
  MED med_phases_prep_wav_avg
  MED -> ATM :remapMethod=redist
  MED -> OCN :remapMethod=redist
  MED -> WAV :remapMethod=redist
  ATM
  OCN
  WAV
  ATM -> MED :remapMethod=redist
  OCN -> MED :remapMethod=redist
  WAV -> MED :remapMethod=redist
  MED med_phases_post_atm
  MED med_phases_post_ocn
  MED med_phases_post_wav
  MED med_phases_history_write
  MED med_phases_restart_write
@
::

It works for the coastal_ike_shinnecock_atm2sch2ww3 case, however, when this configuration is applied to a higher resolution mesh (HSOFS, 1.8 million nodes), the simulation stopped after a 1-hour simulation with 1600 cores. This case is located at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel on Hercules.

Then I tried to remove MED med_phases_history_write, and using 200 cores for the same case, it stopped after 16-hour simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test3).

With the same configuration as above but with 6000 cores, the case stopped after 12-hour simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4).

Since the restart_n = 12 is in this ufs.configure, Then the restart is turned off by restart_option = never, the case with 6000 cores could finish the total 17-day simulation (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_test4_1)

The problem is that in the 200 cores run, the restart is also set as restart_n = 12, but the simulation stopped at 16-hour.

It seems med_phases_history_write med_phases_restart_write can not function well in this case (have to turn off both), the question is then, how could we correctly configure this ufs.configure.

The details are at oceanmodeling/ufs-weather-model#103

The text was updated successfully, but these errors were encountered:

uturuncoglu · 2024-06-25T06:12:42Z

@yunfangsun There are some configuration options in PIO side that we could try to optimize the I/O and prevent CMEPS to hang when it is writing history and restart files. @jedwards4b and @DeniseWorthen might have some idea. @jedwards4b I wonder if there is any specific I/O option that we could try to test with this high-resolution case in CMEPS side?

uturuncoglu · 2024-06-25T06:19:05Z

@yunfangsun and all, I transfer this issue to CMEPS since it seems it is related with it.

uturuncoglu · 2024-06-25T06:25:45Z

@jedwards4b and @DeniseWorthen Since this is UFS the relevant part that gets the PIO options are in https://github.com/ESCOMP/CMEPS/blob/e84e8a1f4fbe4073e82435c72459352de6077bb2/mediator/med_io_mod.F90#L177. So, you could see the defaults PIO options for UFS/CMEPS.

uturuncoglu · 2024-06-25T07:50:32Z

@yunfangsun Following is the options used by default for rt_324728_atmschww3_06132024 configuration (seen in mediator.log file).

 (med_io_init) : pio_netcdf_format = 64BIT_OFFSET         512
 (med_io_init) : pio_typename = NETCDF           2
 (med_io_init) : pio_root =            1
 (med_io_init) : pio_stride =          -99
 (med_io_init) : pio_numiotasks =          -99
 (med_io_init) : update pio_numiotasks =            4
 (med_io_init) : update pio_stride =          400
 (med_io_init) : pio_rearranger = SUBSET           2
 (med_io_init) calling pio init
 (med_io_init) : pio_debug_level =            0
 (med_io_init) : pio_rearr_comm_type = P2P           0
 (med_io_init) : pio_rearr_comm_fcd = 2DENABLE           0
 (med_io_init) : pio_rearr_comm_enable_hs_comp2io =  T
 (med_io_init) : pio_rearr_comm_enable_isend_comp2io =  F
 (med_io_init) : pio_rearr_comm_max_pend_req_comp2io =            0
 (med_io_init) : pio_rearr_comm_enable_hs_io2comp =  F
 (med_io_init) : pio_rearr_comm_enable_isend_io2comp =  T
 (med_io_init) : pio_rearr_comm_max_pend_req_io2comp =           64
 (med_io_init) calling pio_set_rearr_opts

It seems that it is not using parallel I/O. You could try to use parallel I/O by setting pio_typename. Here is an example for your ufs.configure file as following (you just need to update MED section).

# MED #
MED_model:                      cmeps
MED_petlist_bounds:             0 1599
MED_omp_num_threads:            1
MED_attributes::
  ATM_model = datm
  OCN_model = schism
  WAV_model = ww3
  history_n = 1
  history_option = nhours
  history_ymd = -999
  coupling_mode = coastal
  pio_typename = PNETCDF
::

You could also try to use different number for pio_stride (the stride of IO tasks across available compute tasks). You could try to set it to 4, 8 etc. to see any performance improvement. The default pio_rearranger is set to subset which is fine for high processor counts. The default pio_numiotasks is 4 which seems not enough. So, if you set pio_stride to 4 then you will have 1600/4 = 400 pio_numiotasks. Anyway, please experiment those numbers and let me know how it goes. Please also check mediator.log file and be sure that the PIO options are changing based on your configuration file. If you have successful run then you might collect timing to find the best configuration for this case. The numbers could be different for the other cases that uses more core.

DeniseWorthen · 2024-06-25T09:39:30Z

Outside of the CESM, you can set the PIO options (numio tasks, etc) via config. See

CMEPS/mediator/med_io_mod.F90

Line 184 in 437d5e6

! query component specific PIO attributes

Also, a question. Why are you having mediator history files written? That is a lot of I/O! Normally history files are used for debugging or diagnosing field exchange issues. They are not used in production runs.

uturuncoglu · 2024-06-25T14:22:10Z

@DeniseWorthen I agree with you. In the development, I am activating mediator history and restart to check the exchanged fields but @yunfangsun could disable in his production runs. On the other hand, it would be nice to figure out the issue with the mediator I/O. So, once we ned to debug something with the high-res case, it would be available. I bet that it is related with the serial I/O (which is default).

jedwards4b · 2024-06-25T14:27:33Z

I think that there is an issue with the history write alarm, nothing to do with pio - I am working on that today.

DeniseWorthen · 2024-06-25T14:27:42Z

@uturuncoglu Yes, I agree that you need to switch to pnetcdf at a minimum. I also believe there is an issue w/ WW3 restarts when using PDLIB. So another idea might be testing w/o the ww3 restarts (set the date%restart2%stride to some value > than the run length. You need to set the value in seconds.

uturuncoglu · 2024-06-25T14:36:23Z

@DeniseWorthen Thanks. I think we need to open an issue related with the WW3 restart in our end to track it. We did not test coastal specific ocean models restart capability. So, we don't know they will restart perfectly or not in a coupled application. I raised this couple of times in our internal meetings but at this point, we don't have enough resource to check them. @janahaddad maybe you could create an issue for the restart capability and once we have time we could just focus restart capability of ROMS and SCHSIM.

DeniseWorthen · 2024-06-25T14:50:22Z

@uturuncoglu I understand about not testing restart capability at this time. The issue I've heard about second-hand is that restart writing when using PDLIB is very very slow. If you don't need WW3 restarts, I wouldn't write them.

uturuncoglu · 2024-06-25T14:54:54Z

@DeniseWorthen Okay. Thanks for the clarification and your help. I was thinking there is an issue in the restart files itself. Good to know. Maybe this is more problematic for the high resolution cases. @yunfangsun you could also try to disable writing restart files as @DeniseWorthen suggested to see any performance improvement.

yunfangsun · 2024-06-27T13:16:50Z

Hi @uturuncoglu ,

I have added pio_typename = PNETCDF to the 1600 cores case, it did work, and the model could continue to run without hanging, the case location is at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5
With a different pio_stride number didn't make any change for the speed, the case is located at /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5
_1

Thank you for the help.

DeniseWorthen · 2024-06-27T13:26:02Z

@yunfangsun From your mediator.log file, I see that it is still using a default of 4 iotasks. If you have 1600 cores for CMEPS, I would suggest that you try setting the number of iotasks higher. For a stride of 4 you should be able to get 400 IO tasks. Set the number of tasks with pio_numiotasks=.

uturuncoglu · 2024-06-27T15:43:11Z

@yunfangsun That is great news. Glad that it worked. I agree with @DeniseWorthen about increasing number of task for I/O. If you have time and don't mind, could you have couple of run (different number of I/O tasks, side etc.) and collect some timing results. It would be nice to change one parameter at a time to see its effect. I think that would be very helpful for the future and we could use it as reference for other cases. In actual runs, we could disable mediatory history and restart or write them just end of the simulation to optimize the I/O more.

yunfangsun · 2024-06-27T16:05:11Z

Hi @uturuncoglu ,

Sure, I will do the test when Hercules is back online.

yunfangsun · 2024-07-01T13:37:36Z

Hi @uturuncoglu and @DeniseWorthen ,

I have tried to use different pio_numiotasks settings:

pio_numiotasks = 8 (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_2)
pio_numiotasks = 16 (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_3)
pio_numiotasks = 32 (/work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_4)

The speeds have no differences for the three cases.

uturuncoglu · 2024-07-01T15:48:33Z

@yunfangsun Thanks for having additional tests. The results are little bit Interesting. I wonder your frequency to write history and restart files. I think if you increase that frequency you might start seeing some difference. If you don't mind could you check your configuration?

yunfangsun · 2024-07-01T15:52:59Z

Hi @uturuncoglu

The frequency for the history and restart files is the same for all the tests, which is as follows

runSeq::
@3600
  MED med_phases_prep_atm
  MED med_phases_prep_ocn_accum
  MED med_phases_prep_ocn_avg
  MED med_phases_prep_wav_accum
  MED med_phases_prep_wav_avg
  MED -> ATM :remapMethod=redist
  MED -> OCN :remapMethod=redist
  MED -> WAV :remapMethod=redist
  ATM
  OCN
  WAV
  ATM -> MED :remapMethod=redist
  OCN -> MED :remapMethod=redist
  WAV -> MED :remapMethod=redist
  MED med_phases_post_atm
  MED med_phases_post_ocn
  MED med_phases_post_wav
  MED med_phases_history_write
  MED med_phases_restart_write
@
::

uturuncoglu · 2024-07-01T15:56:54Z

@yunfangsun the history and restart file interval basically configured via ufs.configure and it is not related with the run sequence. Could you share your ufs.configure?

yunfangsun · 2024-07-01T16:01:43Z

Hi @uturuncoglu

You can check out the case in /work2/noaa/nos-surge/yunfangs/stmp/yunfangs/FV3_RT/rt_324728_atmschww3_06132024/coastal_ian_hsofs_atm2sch2ww3_intel_5_4

The part related to history and restart is as follows:

# MED #
MED_model:                      cmeps
MED_petlist_bounds:             0 1599
MED_omp_num_threads:            1
MED_attributes::
  ATM_model = datm
  OCN_model = schism
  WAV_model = ww3
  history_n = 1
  history_option = nhours
  history_ymd = -999
  coupling_mode = coastal
  pio_typename = PNETCDF
  pio_numiotasks = 32
::

ALLCOMP_attributes::
  ScalarFieldCount = 3
  ScalarFieldIdxGridNX = 1
  ScalarFieldIdxGridNY = 2
  ScalarFieldIdxNextSwCday = 3
  ScalarFieldName = cpl_scalars
  start_type = startup
  restart_dir = RESTART/
  case_name = ufs.cpld
  restart_n = 12
  restart_option = nhours
  restart_ymd = -999
  orb_eccen = 1.e36
  orb_iyear = 2022
  orb_iyear_align = 2022
  orb_mode = fixed_year
  orb_mvelp = 1.e36
  orb_obliq = 1.e36
  stop_n = 120
  stop_option = nhours
  stop_ymd = -999
::

uturuncoglu · 2024-07-01T16:41:50Z

@yunfangsun It seems that you are writing history file every hour and restart in every 12 hours. If you don't mind, could you confirm it from your run directory.

  history_n = 1
  history_option = nhours
  history_ymd = -999

  restart_n = 12
  restart_option = nhours
  restart_ymd = -999

yunfangsun · 2024-07-01T16:49:12Z

Hi @uturuncoglu ,

Yes, I can confirm it.

yunfangsun added the bug Something isn't working label Jun 24, 2024

uturuncoglu transferred this issue from oceanmodeling/ufs-weather-model Jun 25, 2024

uturuncoglu changed the title ~~CMEPS (med_phases_history_write med_phases_restart_write) in the ATM+SCH+WW3 case~~ history and restart write are hanging (or running too slow) in the ATM+SCH+WW3 case Jun 25, 2024

DeniseWorthen mentioned this issue Jun 28, 2024

Enable cmeps to use PIO+PNETCDF for IO in UFS ufs-community/ufs-weather-model#2347

Open

janahaddad mentioned this issue Jun 28, 2024

Parallel IO for history and restart writing of high-res cases oceanmodeling/ufs-weather-model#111

Open

janahaddad assigned yunfangsun Jul 1, 2024

janahaddad added this to ufs-coastal project Jul 1, 2024

janahaddad moved this to In Progress in ufs-coastal project Jul 1, 2024

janahaddad mentioned this issue Jul 1, 2024

Test restart capabilities for each model component oceanmodeling/ufs-weather-model#113

Open

yunfangsun closed this as completed Aug 26, 2024

github-project-automation bot moved this from In Progress to Done in ufs-coastal project Aug 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

history and restart write are hanging (or running too slow) in the ATM+SCH+WW3 case #1

history and restart write are hanging (or running too slow) in the ATM+SCH+WW3 case #1

yunfangsun commented Jun 24, 2024

uturuncoglu commented Jun 25, 2024

uturuncoglu commented Jun 25, 2024

uturuncoglu commented Jun 25, 2024

uturuncoglu commented Jun 25, 2024

DeniseWorthen commented Jun 25, 2024 •

edited

Loading

uturuncoglu commented Jun 25, 2024

jedwards4b commented Jun 25, 2024

DeniseWorthen commented Jun 25, 2024 •

edited

Loading

uturuncoglu commented Jun 25, 2024

DeniseWorthen commented Jun 25, 2024

uturuncoglu commented Jun 25, 2024

yunfangsun commented Jun 27, 2024

DeniseWorthen commented Jun 27, 2024

uturuncoglu commented Jun 27, 2024

yunfangsun commented Jun 27, 2024

yunfangsun commented Jul 1, 2024

uturuncoglu commented Jul 1, 2024

yunfangsun commented Jul 1, 2024

uturuncoglu commented Jul 1, 2024

yunfangsun commented Jul 1, 2024

uturuncoglu commented Jul 1, 2024

yunfangsun commented Jul 1, 2024

history and restart write are hanging (or running too slow) in the ATM+SCH+WW3 case #1

history and restart write are hanging (or running too slow) in the ATM+SCH+WW3 case #1

Comments

yunfangsun commented Jun 24, 2024

uturuncoglu commented Jun 25, 2024

uturuncoglu commented Jun 25, 2024

uturuncoglu commented Jun 25, 2024

uturuncoglu commented Jun 25, 2024

DeniseWorthen commented Jun 25, 2024 • edited Loading

uturuncoglu commented Jun 25, 2024

jedwards4b commented Jun 25, 2024

DeniseWorthen commented Jun 25, 2024 • edited Loading

uturuncoglu commented Jun 25, 2024

DeniseWorthen commented Jun 25, 2024

uturuncoglu commented Jun 25, 2024

yunfangsun commented Jun 27, 2024

DeniseWorthen commented Jun 27, 2024

uturuncoglu commented Jun 27, 2024

yunfangsun commented Jun 27, 2024

yunfangsun commented Jul 1, 2024

uturuncoglu commented Jul 1, 2024

yunfangsun commented Jul 1, 2024

uturuncoglu commented Jul 1, 2024

yunfangsun commented Jul 1, 2024

uturuncoglu commented Jul 1, 2024

yunfangsun commented Jul 1, 2024

DeniseWorthen commented Jun 25, 2024 •

edited

Loading

DeniseWorthen commented Jun 25, 2024 •

edited

Loading