Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add parallel netcdf read/write from EnKF for sfc files (paranc option) #707 #709

Merged
merged 14 commits into from
Mar 19, 2024

Conversation

tsga
Copy link
Contributor

@tsga tsga commented Feb 26, 2024

DUE DATE for merger of this PR into develop is 4/8/2024 (six weeks after PR creation).

Description
Currently, when the soil analysis is used we turn off parallel read/write, since the parallel routines to read/write the soil states and increments have not been coded. However, the default in operations is to do parallel read/write (paranc=.true.).
This "feature add" enables writing land states read and write in parallel (or in serial as in the past) based on user config.

Fixes #707

Type of change

  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

The 7 regression tests have passed on hera.

Start testing: Feb 26 16:52 UTC

1/7 Testing: global_4denvar
Test time = 2711.52 sec
Test Passed.

2/7 Testing: rtma
Test time = 1999.66 sec
Test Passed.

3/7 Testing: rrfs_3denvar_glbens
Test time = 3678.76 sec
Test Passed.

4/7 Testing: netcdf_fv3_regional
Test time = 672.00 sec
Test Passed.

5/7 Testing: hafs_4denvar_glbens
Test time = 1825.27 sec
Test Passed.

6/7 Testing: hafs_3denvar_hybens
Test time = 2484.58 sec
Test Passed.

7/7 Testing: global_enkf
Test time = 2749.58 sec
Test Passed.

End testing: Feb 26 21:21 UTC

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

@RussTreadon-NOAA
Copy link
Contributor

@tsga , two questions

  1. Who do you want as peer reviewers for this PR. Two peer reviews and approvals are needed.
  2. Do you have WCOSS2 access? ctests must be run on WCOSS2 and at least one NOAA RDHPCS machine. You ran ctests on Hera. This salsifies the later. The former needs to be completed.

@tsga
Copy link
Contributor Author

tsga commented Feb 27, 2024

@RussTreadon-NOAA, I think @ClaraDraper-NOAA and @jswhit would be the right people to do the reviews.

I have WCOSS account, but am having hard time logging into it. I will post the tests here as soon as I manage to run them.

Copy link
Contributor

@ClaraDraper-NOAA ClaraDraper-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tsga A few small changes. Also, if you have not already, can you please run the soil analysis, using the parallel and non-parallel read/write versions, and check that the results are the same? Should we expect bit-compatability?

src/enkf/controlvec.f90 Outdated Show resolved Hide resolved
src/enkf/controlvec.f90 Outdated Show resolved Hide resolved
src/enkf/gridio_gfs.f90 Outdated Show resolved Hide resolved
@tsga
Copy link
Contributor Author

tsga commented Mar 12, 2024

@tsga A few small changes. Also, if you have not already, can you please run the soil analysis, using the parallel and non-parallel read/write versions, and check that the results are the same? Should we expect bit-compatability?

@ClaraDraper-NOAA thank you. I have now made the changes you recommended.
The soil moisture increments are the same for both the parallel and non-parallel versions (using ncdiff and diff). The writeincrements_paranc has three more variables (rwmr_inc; snmr_inc; grle_inc(lev, lat, lon)) but the variables common to both parallel and non-parallel have same values.

@tsga
Copy link
Contributor Author

tsga commented Mar 12, 2024

@tsga , two questions

  1. Who do you want as peer reviewers for this PR. Two peer reviews and approvals are needed.
  2. Do you have WCOSS2 access? ctests must be run on WCOSS2 and at least one NOAA RDHPCS machine. You ran ctests on Hera. This salsifies the later. The former needs to be completed.

@RussTreadon-NOAA the "global_4denvar" test keeps failing due to "permission denied on file access".

image

image

The full log file is at: /lfs/h2/emc/da/noscrub/tseganeh.gichamo/GSI/logerr_global_4denvar

@RussTreadon-NOAA
Copy link
Contributor

@tsga , file prepbufr_profl is a restricted access (rstprod) file. I checked your Cactus account. You do not belong to the rstprod group.

russ.treadon@clogin05:/u> groups tseganeh.gichamo
tseganeh.gichamo : emc da global backupsys

To learn more about restricted data, including how to request access, go to https://www.nco.ncep.noaa.gov/pmb/docs/restricted_data/

@tsga
Copy link
Contributor Author

tsga commented Mar 12, 2024

@RussTreadon-NOAA, thank you! I got added to rstprod, and the test passes now.
image

@ClaraDraper-NOAA
Copy link
Contributor

@tsga A few small changes. Also, if you have not already, can you please run the soil analysis, using the parallel and non-parallel read/write versions, and check that the results are the same? Should we expect bit-compatability?

@ClaraDraper-NOAA thank you. I have now made the changes you recommended. The soil moisture increments are the same for both the parallel and non-parallel versions (using ncdiff and diff). The writeincrements_paranc has three more variables (rwmr_inc; snmr_inc; grle_inc(lev, lat, lon)) but the variables common to both parallel and non-parallel have same values.

Great, thanks.

@RussTreadon-NOAA RussTreadon-NOAA self-assigned this Mar 18, 2024
Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only looked at syntax, not science. Changes look OK.

The existing global_enkf ctest does not exercise the new paranc option, right? Should consider updating global_enkf at some point to ensure the new functionality added by the PR is not broken by subsequent PR.

src/enkf/controlvec.f90 Outdated Show resolved Hide resolved
src/enkf/gridio_gfs.f90 Outdated Show resolved Hide resolved
@ClaraDraper-NOAA
Copy link
Contributor

I only looked at syntax, not science. Changes look OK.

The existing global_enkf ctest does not exercise the new paranc option, right? Should consider updating global_enkf at some point to ensure the new functionality added by the PR is not broken by subsequent PR.

@RussTreadon-NOAA paranc is used in operations, so I'd expect it to be used in the existing ctests. what this PR does is allow us to use paranc with the soil analysis (up to now, I've just been turning it off). I haven't yet added a ctest for the soil analysis, but that's where we'd want to test this code ( I think?).

@RussTreadon-NOAA
Copy link
Contributor

Thank you @ClaraDraper-NOAA, could we turn on soil analysis for this case by changing namelist options and updating input files? global_enkf runs a 10 member ensemble at C48L127. If setting up a soil analysis test isn't trivial, it be captured in a new issue and subsequent PR.

@ClaraDraper-NOAA
Copy link
Contributor

Thank you @ClaraDraper-NOAA, could we turn on soil analysis for this case by changing namelist options and updating input files? global_enkf runs a 10 member ensemble at C48L127. If setting up a soil analysis test isn't trivial, it be captured in a new issue and subsequent PR.

Adding a c-test is probably a good idea. I've not added a c-test to GSI before, but I don't think it'd be too hard. The output of turning on the soil analysis is a sfc increment file, with soil moisture and soil temperature increments. The input is changed convinfo and anavinfo files, and some namelist variables.

@tsga - my global_workflow PR has been merged, you can pull the necessary changes from that.

@RussTreadon-NOAA
Copy link
Contributor

@ClaraDraper-NOAA and @tsga , we do not want to add another ctest. GSI ctests are actually regression tests. They are not quick and fast.

If we can turn on the soil analysis in the current C48L127 global_enkf case, great. If not, we can open a new issue to update the global_4denvar and global_enkf ctests to mimic GFS v17. The GSI needs modifications (issue #719) to run with Thompson microphysics. A new C96/C48L127 GFS v17 case could be created to cover both soil analysis (global_enkf) and atmospheric DA (global_4denvar).

@RussTreadon-NOAA
Copy link
Contributor

@tsga , are you done working on tsga:feature/paranc_sfc? may this PR be sent to the GSI handling review team for final approval?

@tsga
Copy link
Contributor Author

tsga commented Mar 19, 2024

@tsga , are you done working on tsga:feature/paranc_sfc? may this PR be sent to the GSI handling review team for final approval?

@RussTreadon-NOAA, yes I am done. Thank you.

@RussTreadon-NOAA
Copy link
Contributor

One final run of ctests on WCOSS2 (Cactus) with the following results

Test project /lfs/h2/emc/da/noscrub/russ.treadon/git/gsi/pr709/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  484.18 sec
2/7 Test #3: rrfs_3denvar_glbens ..............   Passed  486.74 sec
3/7 Test #7: global_enkf ......................   Passed  853.02 sec
4/7 Test #2: rtma .............................   Passed  968.82 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1213.51 sec
6/7 Test #5: hafs_4denvar_glbens ..............   Passed  1333.14 sec
7/7 Test #1: global_4denvar ...................   Passed  1682.81 sec

100% tests passed, 0 tests failed out of 7

Total Test time (real) = 1682.82 sec

@RussTreadon-NOAA
Copy link
Contributor

Orion ctests

Test project /work2/noaa/da/rtreadon/git/gsi/pr709/build
    Start 1: global_4denvar
    Start 2: rtma
    Start 3: rrfs_3denvar_glbens
    Start 4: netcdf_fv3_regional
    Start 5: hafs_4denvar_glbens
    Start 6: hafs_3denvar_hybens
    Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional ..............   Passed  604.22 sec
2/7 Test #3: rrfs_3denvar_glbens ..............***Failed  846.41 sec
3/7 Test #7: global_enkf ......................   Passed  849.27 sec
4/7 Test #2: rtma .............................***Failed  1088.60 sec
5/7 Test #6: hafs_3denvar_hybens ..............   Passed  1461.27 sec

rrfs_3denvar_glbens failed due to

The runtime for rrfs_3denvar_glbens_hiproc_updat is 148.812041 seconds.  This has exceeded maximum allowable threshold time of 113.789052 seconds, resulting in Failure of timethresh2 the regression test.

gsi.x wall times for the various jobs are

rrfs_3denvar_glbens_hiproc_contrl/stdout:The total amount of wall time                        = 75.859368
rrfs_3denvar_glbens_hiproc_updat/stdout:The total amount of wall time                        = 148.812041
rrfs_3denvar_glbens_loproc_contrl/stdout:The total amount of wall time                        = 200.123084
rrfs_3denvar_glbens_loproc_updat/stdout:The total amount of wall time                        = 256.197317

While the updat times are considerably higher than contrl, the /work fileset in which the jobs ran can result in highly variable run times.

The rtma test failed due to

The runtime for rtma_loproc_updat is 273.179041 seconds.  This has exceeded maximum allowable threshold time of 263.744633 seconds, resulting in Failure time-thresh of the regression test.

A check of gsi.x wall times shows

rtma_hiproc_contrl/stdout:The total amount of wall time                        = 220.252255
rtma_hiproc_updat/stdout:The total amount of wall time                        = 212.423120
rtma_loproc_contrl/stdout:The total amount of wall time                        = 239.767849
rtma_loproc_updat/stdout:The total amount of wall time                        = 273.179041

The hiproc_updat ran faster than the contrl. The opposite is true for the loproc jobs.

The hafs_4denvar_glbens test failed due to

The runtime for hafs_4denvar_glbens_loproc_updat is 535.172052 seconds.  This has exceeded maximum allowable threshold time of 454.057084 seconds, resulting in Failure time-thresh of the regression test.

The gsi.x wall times are

hafs_4denvar_glbens_hiproc_contrl/stdout:The total amount of wall time                        = 309.854839
hafs_4denvar_glbens_hiproc_updat/stdout:The total amount of wall time                        = 304.649220
hafs_4denvar_glbens_loproc_contrl/stdout:The total amount of wall time                        = 412.779168
hafs_4denvar_glbens_loproc_updat/stdout:The total amount of wall time                        = 535.172052

The hiproc wall times are comparable. The updat wall time is considerably higher than contrl for the loproc configuration.

While not desirable, none of these failures are regarded as fatal given the impact the Orion /work fileset has on job run times.

@ClaraDraper-NOAA
Copy link
Contributor

@ClaraDraper-NOAA and @tsga , we do not want to add another ctest. GSI ctests are actually regression tests. They are not quick and fast.

If we can turn on the soil analysis in the current C48L127 global_enkf case, great. If not, we can open a new issue to update the global_4denvar and global_enkf ctests to mimic GFS v17. The GSI needs modifications (issue #719) to run with Thompson microphysics. A new C96/C48L127 GFS v17 case could be created to cover both soil analysis (global_enkf) and atmospheric DA (global_4denvar).

In that case, I suggest we open a new issue to update the two tests that you identified. Together with the new microphysics works for me.

@RussTreadon-NOAA
Copy link
Contributor

@ShunLiu-NOAA , @hu5970 , and @CoryMartin-NOAA : This PR is ready for merger into develop

Clara & Jeff reviewed and approved the chagnes. Ctests run and yield acceptable results on WCOSS2 (Cactus), Hera, and Orion.

@RussTreadon-NOAA RussTreadon-NOAA merged commit 4e8107c into NOAA-EMC:develop Mar 19, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add parallel netcdf read/write from EnKF for sfc files (paranc option)
4 participants