Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assimilate GMI in GSI (#689) #692

Merged
merged 11 commits into from
Mar 12, 2024
Merged

Conversation

xincjin-NOAA
Copy link
Contributor

@xincjin-NOAA xincjin-NOAA commented Jan 24, 2024

DUE DATE for merger of this PR into develop is 3/6/2024 (six weeks after PR creation).

Description
This pull request is to related to #689
Resolves #689

The original code for assimilating GMI in GSI is not working properly.
The main changes are:

  • modified / simplified superobbing algorithm in ssmis_spatial_average_mod.f90
  • various changes in read_gmi.f90, e.g., increase maxobs value;
  • remove clw impacts on the bias predictor 6 and 7 in setuprad.f90;
  • modified clw impact on the bias predictors in radiance_mod.f90;
  • modified the function deter_sfc_gmi so that it can applied to bigger domain;
  • added a condition in function gmi_37pol_diff to exclude unreasonable data;

Type of change

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)

How Has This Been Tested?

The changes have been verified by a few experiments with more than two months running time

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing tests pass with my changes
  • Any dependent changes have been merged and published

@xincjin-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Could you add @emilyhcliu , @ADCollard , and @azadeh-gh as reviewers?

I have run the regression tests on hera and all 7 tests are passed.

@RussTreadon-NOAA
Copy link
Contributor

Please post ctest results in this PR or the originating issue, #689. Ctests need to be run on WCOSS2, Hera, Orion, and Hercules. This PR will be returned to closed and returned to the develop if at least two peer reviews with approvals and ctests are completed.

@xincjin-NOAA
Copy link
Contributor Author

  1. The ctest on orion is passed.

  2. There is one failure on WCOSS2:

The following tests FAILED:

      1 - global_4denvar (Failed)

It seems a few files are not existed, e.g.

amsuabufr_db -> /lfs/h2/emc/da/noscrub/russ.treadon/CASES/regtest/gfs/prod/gdas.20221109/00/atmos/gdas.t00z.amuadb.tm00.bufr_d

The tmpdir is /lfs/h2/emc/ptmp/xin.c.jin/GSI/tmpreg_global_4denvar/global_4denvar_loproc_updat>

  1. There are two fails on Hercules (under review)

@xincjin-NOAA
Copy link
Contributor Author

The two failure for ctest on Hercules are:

  1. hafs_3denvar:
    The runtime for hafs_3denvar_hybens_loproc_updat is 254.941863 seconds and is within the maximum allowable operational time of 1200 seconds, continuing with regression test.

The runtime for hafs_3denvar_hybens_loproc_updat is 254.941863 seconds and is within the allowable threshold time of 413.289465 seconds, continuing with regression test.

The runtime for hafs_3denvar_hybens_hiproc_updat is 200.579084 seconds and is within the allowable threshold time of 331.662442 seconds, continuing with regression test.

The memory for hafs_3denvar_hybens_loproc_updat is 2532516 KBs and is within the maximum allowable memory of 2780342 KBs, continuing with regression test.

The results (penalty) between the two runs (hafs_3denvar_hybens_loproc_updat and hafs_3denvar_hybens_loproc_contrl) are reproducible.

The fv3_dynvars are reproducible
The fv3_sfcdata are reproducible
The fv3_tracer are reproducible
The results between the two runs (hafs_3denvar_hybens_loproc_updat and hafs_3denvar_hybens_loproc_contrl) are reproducible since the corresponding results are identical.

The results (penalty) between the two runs (hafs_3denvar_hybens_loproc_updat and hafs_3denvar_hybens_hiproc_updat) are reproducible

The fv3_sfcdata are reproducible
The fv3_tracer are reproducible
The results between the two runs (hafs_3denvar_hybens_loproc_updat and hafs_3denvar_hybens_hiproc_updat) are not reproducible Thus, the case has Failed siganl of the regression tests.

  1. hafs_4denvar:

The runtime for hafs_4denvar_glbens_loproc_updat is 328.091246 seconds and is within the maximum allowable operational time of 1200 seconds, continuing with regression test.

The runtime for hafs_4denvar_glbens_loproc_updat is 328.091246 seconds. This has exceeded maximum allowable threshold time of 326.818903 seconds, resulting in Failure time-thresh of the regression test.

The runtime for hafs_4denvar_glbens_hiproc_updat is 254.226137 seconds and is within the allowable threshold time of 282.596399 second, continuing with regression test.

The memory for hafs_4denvar_glbens_loproc_updat is 2858764 KBs and is within the maximum allowable memory of 3199819 KBs, continuing with regression test.

The results (penalty) between the two runs (hafs_4denvar_glbens_loproc_updat and hafs_4denvar_glbens_loproc_contrl) are reproducible.

The fv3_dynvars are reproducible
The fv3_sfcdata are reproducible
The fv3_tracer are reproducible
The results between the two runs (hafs_4denvar_glbens_loproc_updat and hafs_4denvar_glbens_loproc_contrl) are reproducible
since the corresponding results are identical.

The results (penalty) between the two runs (hafs_4denvar_glbens_loproc_updat and hafs_4denvar_glbens_hiproc_updat) are reproducible

The fv3_dynvars are reproducible
The fv3_sfcdata are reproducible
The fv3_tracer are reproducible
The results between the two runs (hafs_4denvar_glbens_loproc_updat and hafs_4denvar_glbens_hiproc_updat) are reproducible
since the corresponding results are identical.

Any comments on how to deal with these failure.

@RussTreadon-NOAA
Copy link
Contributor

WCOSS2 failure

stdout in /lfs/h2/emc/ptmp/xin.c.jin/GSI/tmpreg_global_4denvar/global_4denvar_loproc_updat contains the following traceback

forrtl: Permission denied
forrtl: severe (9): permission to access file denied, unit 15, file /lfs/h2/emc/ptmp/xin.c.jin/GSI/tmpreg_global_4denvar/global_4denvar_loproc_up\
dat/prepbufr_profl
Image              PC                Routine            Line        Source
gsi.x              00000000020006BB  Unknown               Unknown  Unknown
gsi.x              000000000201C3E0  Unknown               Unknown  Unknown
gsi.x              00000000008D30AB  read_obsmod_mp_re         201  read_obs.F90
gsi.x              00000000008C75C6  read_obsmod_mp_re        1092  read_obs.F90
gsi.x              00000000007F1C0C  observermod_mp_se         331  observer.F90
gsi.x              00000000010A5223  glbsoi_                   222  glbsoi.f90
gsi.x              00000000006572D7  gsisub_                   200  gsisub.F90
gsi.x              00000000004137FD  gsimod_mp_gsimain        2414  gsimod.F90

File prepbufr_profl is a restricted access (rstprod) file

-rw-r----- 1 russ.treadon rstprod 15192992 Nov 11  2022 prepbufr_profl

@xincjin-NOAA , you do not belong to the WCOSS2 rstprod group.

xin.c.jin : emc da global backupsys

You need to request rstprod access. Please visit NCO's Restricted Data Information page to learn how to request rstprod access.

@RussTreadon-NOAA
Copy link
Contributor

Hercules failure

The hafs_3denvar_hybens failure is a known problem on Hercules. On a fairly regular basis the contents of analysis file fv3_dynvars vary between the loproc and hiproc runs of gsi.x. Repeating the ctest can yield a Passed or another Failed. GSI issue #697 was opened to report this reproducibility problem. The regional DA team is aware of this problem.

The hafs_4denvar_glbens failure is not a fatal fail. A check of the gsi.x wall times for the various runs shows

hafs_4denvar_glbens_hiproc_contrl/stdout:The total amount of wall time                        = 256.905818
hafs_4denvar_glbens_hiproc_updat/stdout:The total amount of wall time                        = 254.226137
hafs_4denvar_glbens_loproc_contrl/stdout:The total amount of wall time                        = 297.108094
hafs_4denvar_glbens_loproc_updat/stdout:The total amount of wall time                        = 328.091246

The loproc_updat ran 31 seconds longer than the loproc_contrl. The work/ fileset in which the ctests ran is know to suffer from latency issues. The above range of gsi.x times is acceptable on Hercules.

@RussTreadon-NOAA
Copy link
Contributor

Please post Hera ctests results when they are available.

Since this PR adds code to assimilate GMI, we need confirmation either in this PR or in the originating issue, #689 , that the changes in this PR result in gsi.x and enkf.x correctly assimilating GMI. At some point we should update at least the global ctests to a case which assimilates GMI. Otherwise, we risk the possibility of future GSI PRs breaking the GMI assimilation capability this PR adds.

@xincjin-NOAA
Copy link
Contributor Author

@xincjin-NOAA
Copy link
Contributor Author

Ctest on Hera passed 6 tests. The test not passed has the following information:

[Xin.C.Jin@hfe08 regression]$ more hafs_4denvar_glbens_regression_results.txt
The runtime for hafs_4denvar_glbens_loproc_updat is 360.988536 seconds and is within the maximum allowable operational time of 1200 seconds, continuing with regression test.

The runtime for hafs_4denvar_glbens_loproc_updat is 360.988536 seconds. This has exceeded maximum allowable threshold time of 329.516453 seconds, resulting in Failure time-thresh of the regression test.

The runtime for hafs_4denvar_glbens_hiproc_updat is 263.256873 seconds. This has exceeded maximum allowable threshold time of 260.885010 seconds, resulting in Failure of timethresh2 the regression test.

The memory for hafs_4denvar_glbens_loproc_updat is 2882748 KBs and is within the maximum allowable memory of 3244072 KBs, continuing with regression test.

The results (penalty) between the two runs (hafs_4denvar_glbens_loproc_updat and hafs_4denvar_glbens_loproc_contrl) are reproducible.

The fv3_dynvars are reproducible
The fv3_sfcdata are reproducible
The fv3_tracer are reproducible
The results between the two runs (hafs_4denvar_glbens_loproc_updat and hafs_4denvar_glbens_loproc_contrl) are reproducible
since the corresponding results are identical.

The results (penalty) between the two runs (hafs_4denvar_glbens_loproc_updat and hafs_4denvar_glbens_hiproc_updat) are reproducible

The fv3_dynvars are reproducible
The fv3_sfcdata are reproducible
The fv3_tracer are reproducible
The results between the two runs (hafs_4denvar_glbens_loproc_updat and hafs_4denvar_glbens_hiproc_updat) are reproducible
since the corresponding results are identical.

@RussTreadon-NOAA
Copy link
Contributor

@xincjin-NOAA , both of the failed checks for the Hera hafs_4denvar_glbens ctest involved wall time. Please examine the wall times for all four hafs_4denvar_glbens runs to see if any timings look anomalous.

@xincjin-NOAA
Copy link
Contributor Author

The wall time are as below:
hafs_4denvar_glbens_hiproc_contrl/stdout:The total amount of wall time = 237.168191
hafs_4denvar_glbens_hiproc_updat/stdout:The total amount of wall time = 263.256873
hafs_4denvar_glbens_loproc_contrl/stdout:The total amount of wall time = 299.560412
hafs_4denvar_glbens_loproc_updat/stdout:The total amount of wall time = 360.988536

The updat runs are slower than the contrl ones

The breakdown for the loproc runs are as below:

hafs_4denvar_glbens_loproc_contrl/stdout
The total amount of wall time = 299.560412
The total amount of time in user mode = 259.579043
The total amount of time in sys mode = 27.728789

hafs_4denvar_glbens_loproc_updat/stdout
The total amount of wall time = 360.988536
The total amount of time in user mode = 289.760696
The total amount of time in sys mode = 19.428450

I guess I don't have enough knowledge to judge if they are normal or not.

@RussTreadon-NOAA
Copy link
Contributor

Some options to consider

  1. run ctest hafs_4denvar_glbens on Hera at a different time of day (e.g., early morning, late evening) to see if wall clock times improve for updat.
  2. check gsi.x wall times for other ctests on Hera. Is similar behavior observed?
  3. check hafs_4denvar_glbens wall times for gsi.x on other machines. Is similar behavior observed?

@xincjin-NOAA
Copy link
Contributor Author

hafs_4denvar_glbens_hiproc_contrl/stdout:The total amount of wall time = 267.771050
hafs_4denvar_glbens_hiproc_updat/stdout:The total amount of wall time = 245.148727
hafs_4denvar_glbens_loproc_contrl/stdout:The total amount of wall time = 303.032343
hafs_4denvar_glbens_loproc_updat/stdout:The total amount of wall time = 340.304098

These are the results from a new ctest in which I EXCHANGED the locations of contrl and updat. This means that the updat represent the GSI codes from develop branch.

I am not sure if you can check the git branch on the directory of: /scratch1/NCEPDEV/da/Xin.C.Jin/git/GSI and /scratch1/NCEPDEV/da/Xin.C.Jin/git/develop

It seems that the wall time is related to the order of the test runs

@xincjin-NOAA
Copy link
Contributor Author

Ctest for WCOSS2 is passed.

@xincjin-NOAA
Copy link
Contributor Author

xincjin-NOAA commented Feb 15, 2024

As for updating the global ctests to include assimilating GMI. do we need to make ctest after this update. Because the ctest will fail

@RussTreadon-NOAA
Copy link
Contributor

Reminder: Due date for merger of this PR into develop is 3/6/2024. This is two weeks from today (2/21/2024). @xincjin-NOAA, please reach out to your peer reviewers. We need at least two peer reviews with approval. If this PR is not merged into develop by 3/6/2024, the PR will be closed and returned to @xincjin-NOAA.

@RussTreadon-NOAA
Copy link
Contributor

ctest note

The ObsProc team may be able to generate GMI bufr dump files for portions of February 2024. If this is possible, we should update ctests global_4denvar and global_enkf to a near real-time case. Currently these tests use data from 2022110900.

@RussTreadon-NOAA
Copy link
Contributor

@xincjin-NOAA , please bring xincjin-NOAA:gmi_new up to date with the current head of NOAA-EMC/GSI develop. Your forked branch xincjin-NOAA:gmi_new is 10 commits behind the current head of develop.

@xincjin-NOAA
Copy link
Contributor Author

xincjin-NOAA commented Feb 23, 2024

@RussTreadon-NOAA , updated xincjin-NOAA:gmi_new, thanks for remind this.

Copy link
Contributor

@ADCollard ADCollard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good

@RussTreadon-NOAA
Copy link
Contributor

RussTreadon-NOAA commented Feb 29, 2024

Update global ctest case to 2024022300

C96C48L127 background files to run global_4denvar and global_enkf for 2024022300 have been created. The 2024022300 observations include a non-zero GMI bufr dump file.

The following files were modified in a working copy of xincjin-NOAA:gmi_new on Hera

  • regression/global_4denvar.sh - paths and file prefixes updated to operational naming convention
  • regression/global_enkf.sh - paths and file prefixes updated to operational naming convention
  • regression/regression_namelists.sh - set dsfcalc` to 0 for atms_n21
  • regression/regression_var.sh - change global_adate date from 2022110900 to 2024022300

The global_4denvar and global_enkf ctests were run with gmi_new at a0acbee as updat and develop at 7c4a571 as the contrl. Both tests Passed on Hera.

Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr692/build
    Start 1: global_4denvar
1/1 Test #1: global_4denvar ...................   Passed  3120.35 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 3121.55 sec
Hera(hfe01):/scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr692/build$ ctest -R global_enkf
Test project /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr692/build
    Start 7: global_enkf
1/1 Test #7: global_enkf ......................   Passed  1984.88 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) = 1984.89 sec

Files needed to run the 20240223 global case have been rsync'd to the CASES directory Orion (Hercules) and WCOSS2 (Cactus and Dogwood).

@xincjin-NOAA , I recommend that you update

  • global_4denvar.sh
  • global_enkf.sh
  • regression_namelists.sh
  • regression_var.sh

in your gmi_new branch with my modified files in Hera /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/pr692/regression

Tagging @ADCollard since you asked about updating the GSI global ctests. We may want to tweak global ctest namelist variables in regressions_namelists.sh. Notes: I did not change regression_nameslists_db.sh since I almost never use this file.

@RussTreadon-NOAA
Copy link
Contributor

@xincjin-NOAA , I looked at lfs/h2/emc/da/noscrub/xin.c.jin/develop. Your develop adds abi2km code to setuprad.f90. I looked in /lfs/h2/emc/da/noscrub/xin.c.jin/gmi_new. setuprad.f90 in this working copy differs from what is committed to branch gmi_new. Am I looking in the correct directories?

@xincjin-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA Yes, the directory is correct. I am changing this file and doing some other test now

@RussTreadon-NOAA
Copy link
Contributor

OK, so my result using develop at fca6bea and gmi_new at 224cb8d could possibly be valid. If true, this is at odds with previous comments in this PR.

@xincjin-NOAA
Copy link
Contributor Author

After refactor the code and apply back the reverted commit, all ctests on WCOSS2 were passed (install develop with fca6bea and gmi_new with 8078902)

Test project /lfs/h2/emc/da/noscrub/xin.c.jin/gmi_new/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_glbens
Start 4: netcdf_fv3_regional
Start 5: hafs_4denvar_glbens
Start 6: hafs_3denvar_hybens
Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional .............. Passed 604.42 sec
2/7 Test #3: rrfs_3denvar_glbens .............. Passed 849.27 sec
3/7 Test #7: global_enkf ...................... Passed 1153.56 sec
4/7 Test #2: rtma ............................. Passed 1272.22 sec
5/7 Test #6: hafs_3denvar_hybens .............. Passed 1334.87 sec
6/7 Test #5: hafs_4denvar_glbens .............. Passed 1519.23 sec
7/7 Test #1: global_4denvar ................... Passed 2104.26 sec

will test on other platform then.

@xincjin-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA @ADCollard @emilyhcliu @TingLei-NOAA After new update of this PR, all ctests are passed on WCOSS2, Hera, and Orion. There is one failure on Hercules:

Hercules:

Test project /work/noaa/da/xinjin/git/gmi_new/build
Start 1: global_4denvar
Start 2: rtma
Start 3: rrfs_3denvar_glbens
Start 4: netcdf_fv3_regional
Start 5: hafs_4denvar_glbens
Start 6: hafs_3denvar_hybens
Start 7: global_enkf
1/7 Test #4: netcdf_fv3_regional .............. Passed 486.81 sec
2/7 Test #3: rrfs_3denvar_glbens .............. Passed 606.71 sec
3/7 Test #7: global_enkf ...................... Passed 733.78 sec
4/7 Test #2: rtma ............................. Passed 967.92 sec
5/7 Test #6: hafs_3denvar_hybens ..............***Failed 1095.98 sec
6/7 Test #5: hafs_4denvar_glbens .............. Passed 1335.46 sec
7/7 Test #1: global_4denvar ................... Passed 1682.78 sec

86% tests passed, 1 tests failed out of 7

Total Test time (real) = 1682.79 sec

The following tests FAILED:
6 - hafs_3denvar_hybens (Failed)
Errors while running CTest
Output from these tests are in: /work/noaa/da/xinjin/git/gmi_new/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.

The results between the two runs (hafs_3denvar_hybens_loproc_updat and hafs_3denvar_hybens_hiproc_updat) are not reproducible

I am not sure if this is a known issue.

@RussTreadon-NOAA
Copy link
Contributor

Thank you @xincjin-NOAA for refactoring the code. It's great to see reproducible results once again on WCOSS2. @TingLei-NOAA is working on the Hercules hafs_3denvar_hybens failure.

@RussTreadon-NOAA
Copy link
Contributor

@xincjin-NOAA , please request a re-review from the peer reviewers for this PR. Your refactored changes need to be reviewed.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor comments, otherwise the code changes look good from a coding perspective.

What about the science perspective? If gmi is assimilated, does the refactored code yield the intended results? The global_4denvar test only processes gmi in monitor mode. It does not yet assimilate gmi data.

src/gsi/clw_mod.f90 Outdated Show resolved Hide resolved
src/gsi/deter_sfc_mod.f90 Show resolved Hide resolved
src/gsi/deter_sfc_mod.f90 Outdated Show resolved Hide resolved
src/gsi/setuprad.f90 Outdated Show resolved Hide resolved
@xincjin-NOAA
Copy link
Contributor Author

@RussTreadon-NOAA, From the science perspective, If gmi is assimilated, the refactored code will yield the intended results.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @xincjin-NOAA for removing unnecessary computations.

@RussTreadon-NOAA
Copy link
Contributor

@xincjin-NOAA , have you requested re-reviews from Emily, Andrew, and Azadeh? If not, please do so. This PR has passed it's due date.

@emilyhcliu
Copy link
Contributor

The code changes due to GMI look good, and they do (should) not change regression results. Approved!
Later, when we have GMI in operation, we should turn on GMI in the regression tests.

Copy link
Contributor

@RussTreadon-NOAA RussTreadon-NOAA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given two peer review approvals and documentation of ctests results on WCOSS2, Hera, Orion, and Hercules, approve and pass this PR onto GSI Handling Review team for merger into develop.

@RussTreadon-NOAA RussTreadon-NOAA merged commit f282a94 into NOAA-EMC:develop Mar 12, 2024
4 checks passed
@xincjin-NOAA
Copy link
Contributor Author

@ALL Thanks everyone for making this PR close!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Assimilate GMI in GSI
6 participants