Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[production/AQM.v7] concatenate_nexus_post_split.py - unstable output file size #775

Closed
lgannoaa opened this issue May 3, 2023 · 7 comments
Labels
bug Something isn't working

Comments

@lgannoaa
Copy link

lgannoaa commented May 3, 2023

There is a finding that $HOMEaqm/sorc/arl_nexus/utils/python/concatenate_nexus_post_split.py utility generated corrupted NEXUS_Expt_combined.nc file. This file has normal size around 858MB. However, some tests have shown corrupted size 572MB.

Job nexus_post_split used concatenate_nexus_post_split.py to create NEXUS_Expt_combined.nc and used by $HOMEaqm/sorc/arl_nexus/utils/python/make_nexus_output_pretty.py within the same job. The make_nexus_output_pretty.py failed with AssertionError without generate output file NEXUS_Expt_pretty.nc. However, nexus_post_split completed without necessary exception handling.

Since the corrupted output and AssertionError was not handled by exception handling. The nexus_post_split completed. The forecast job failed not finding NEXUS_Expt_pretty.nc file. Error:

  • create_symlink_to_file.sh[124]: print_err_msg_exit 'Cannot create symlink to specified target file because the latter does
    not exist or is not a file:

Recommend for developer to do stability test on concatenate_nexus_post_split.py utility.
Recommend for developer to patch exception handling in the ex-script of the job nexus_post_split - exregional_nexus_post_split.sh.

Machines affected

wcoss

Debug Output Saved in /lfs/h2/emc/global/noscrub/lin.gan/canned/concatenate_nexus_post_split_debug

aqm_nexus_post_split_00.o56988419-BAD (nexus_post_split job log that does should have failed due to bad output file with AssertionError)
NEXUS_Expt_combined.nc-BAD (The corrupted file)
aqm_nexus_post_split_00.o57049044-GOOD (a rerun of the nexus_post_split job log that created the same file with correct file size)
NEXUS_Expt_combined.nc-GOOD (The good output generated from rerun)

@lgannoaa lgannoaa added the bug Something isn't working label May 3, 2023
@JianpingHuang-NOAA
Copy link

@chan-hoo Do you remember when did we merge the updated concatenate_nexus_post_split.py into the workflow?

@bbakernoaa
Copy link
Contributor

@JianpingHuang-NOAA I think it had to be around December of last year

@bbakernoaa
Copy link
Contributor

bbakernoaa commented May 4, 2023

I just did a test using the community code with 72 hour forecasts every 6 hourly cycles and didn't see an issue

path on wcoss here: /lfs/h2/emc/stmp/barry.baker/aqm_community_aqmna13

-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040100/INPUT/aqm.t00z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040106/INPUT/aqm.t06z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040112/INPUT/aqm.t12z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040118/INPUT/aqm.t18z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:02 2023040200/INPUT/aqm.t00z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:02 2023040206/INPUT/aqm.t06z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:02 2023040212/INPUT/aqm.t12z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040218/INPUT/aqm.t18z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040300/INPUT/aqm.t00z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040306/INPUT/aqm.t06z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:02 2023040312/INPUT/aqm.t12z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040318/INPUT/aqm.t18z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040400/INPUT/aqm.t00z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040406/INPUT/aqm.t06z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040412/INPUT/aqm.t12z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040418/INPUT/aqm.t18z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040500/INPUT/aqm.t00z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040506/INPUT/aqm.t06z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:02 2023040512/INPUT/aqm.t12z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040518/INPUT/aqm.t18z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040600/INPUT/aqm.t00z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040606/INPUT/aqm.t06z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040612/INPUT/aqm.t12z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040618/INPUT/aqm.t18z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040700/INPUT/aqm.t00z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040706/INPUT/aqm.t06z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:02 2023040712/INPUT/aqm.t12z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040718/INPUT/aqm.t18z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:02 2023040800/INPUT/aqm.t00z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040806/INPUT/aqm.t06z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040812/INPUT/aqm.t12z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:02 2023040818/INPUT/aqm.t18z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040900/INPUT/aqm.t00z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040906/INPUT/aqm.t06z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040912/INPUT/aqm.t12z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023040918/INPUT/aqm.t18z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023041000/INPUT/aqm.t00z.NEXUS_Expt.nc
-rw-r--r-- 1 barry.baker emc 4.0G May  4 15:03 2023041006/INPUT/aqm.t06z.NEXUS_Expt.nc

@chan-hoo chan-hoo changed the title concatenate_nexus_post_split.py - unstable output file size [production/AQM.v7] concatenate_nexus_post_split.py - unstable output file size May 8, 2023
@chan-hoo
Copy link
Collaborator

The error check has already been added to the command line:

@JianpingHuang-NOAA
Copy link

Is there any relationship between this issue and the issue #86 with EMC/AQM ?

@lgannoaa Is this issue resolved? Did you see this happened again recently?

@lgannoaa
Copy link
Author

This issue has not been tested yet. We need to merge the latest package and test it with both AQM realtime parallel and ecflow.
Please keep it open.

@JianpingHuang-NOAA
Copy link

JianpingHuang-NOAA commented May 16, 2023 via email

@chan-hoo chan-hoo reopened this May 16, 2023
michelleharrold pushed a commit to michelleharrold/ufs-srweather-app that referenced this issue Jun 7, 2023
…fs-community#775)

* update input namelist of chgres_cube

* update diag_table templates

* update scripts

* back to original

* specify miniconda version on Jet
michelleharrold pushed a commit to michelleharrold/ufs-srweather-app that referenced this issue Jun 7, 2023
* Bug fix with FIELD_TABLE_FN

* Modify crontab management, use config_defaults.sh.

* Add status badge.

* Update cheyenne crontab management.

* source lmod-setup

* Add main to set_predef_grid

* Bug fix in predef_grid

* Don't import dead params.

* Fix bug in resetting VERBOSE

* Minor fix in INI config.

* Construct var_defns components from dictionary.

* Allow also lower case variables to be exported.

* Updates to python workflow due to PR ufs-community#776

* Use python versions of link_fix and set_FV3_sfc in job script.

* Use python versions of create_diag/model.

* Some fixes addressing Christina's suggestions.

* Delete shell workflow

* Append pid to temp files.

* Update scripts to work with the latest hashes of UFS_UTILS and UPP (ufs-community#775)

* update input namelist of chgres_cube

* update diag_table templates

* update scripts

* back to original

* specify miniconda version on Jet

* Remove -S option from link_fix call.

* Fixes due to merge

* Cosmoetic changes.

Co-authored-by: Chan-Hoo.Jeon-NOAA <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants