Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get zstd compression in netcdf on wcoss2 operation #2319

Open
Hang-Lei-NOAA opened this issue Jun 11, 2024 · 25 comments · May be fixed by #2444
Open

Get zstd compression in netcdf on wcoss2 operation #2319

Hang-Lei-NOAA opened this issue Jun 11, 2024 · 25 comments · May be fixed by #2444
Labels
enhancement New feature or request

Comments

@Hang-Lei-NOAA
Copy link

Hang-Lei-NOAA commented Jun 11, 2024

Description

The zstd compression in netcdf have been tested before. Now need to have it on wcoss2

Will fully test on our end, and then need to get zstd on operational machines, and deliver the whole packages.

@Hang-Lei-NOAA Hang-Lei-NOAA added the enhancement New feature or request label Jun 11, 2024
@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Jun 20, 2024

@BrianCurtis-NOAA I have a zstd enabled netcdf and hdf5 combinations added on acorn. Please test them:
module use /lfs/h1/emc/nceplibs/noscrub/hpc-stack/libs/hpc-stack/modulefiles/compiler/intel/19.1.3.304
module load zstd

module use /lfs/h1/emc/nceplibs/noscrub/hpc-stack/libs/hpc-stack/modulefiles/mpi/intel/19.1.3.304/cray-mpich/8.1.9
module load hdf5/1.14.0
module load netcdf/4.9.2

This has reproduced the UFS compression tests that previous done by Dusan.

@DusanJovic-NOAA
Copy link
Collaborator

@Hang-Lei-NOAA Please point us at your version of modulefiles/ufs_acorn.intel.lua you used for testing. Thanks.

@Hang-Lei-NOAA
Copy link
Author

Hi, All, Please copy both
ufs_common /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/ufscompression/develop/modulefiles/ufs_common.lua
ufs_acorn.intel /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/ufscompression/develop/modulefiles/ufs_acorn.intel.lua

@DusanJovic-NOAA
Copy link
Collaborator

Hi, All, Please copy both ufs_common /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/ufscompression/develop/modulefiles/ufs_common.lua ufs_acorn.intel /lfs/h1/emc/nceplibs/noscrub/Hang.Lei/works/ufscompression/develop/modulefiles/ufs_acorn.intel.lua

Thanks. Where is 'zstd' module loaded?

@Hang-Lei-NOAA
Copy link
Author

@DusanJovic-NOAA Please update the ufs_common.lua file again.

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA @edwardhartnett since acorn is still not available, is there a way we can move this forward? Several UFS applications are waiting to run experiments with this feature. Thank you!

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Jul 22, 2024 via email

@BrianCurtis-NOAA
Copy link
Collaborator

@Hang-Lei-NOAA I've used the modulefiles from your runs with Dusan a while ago and the tests that use ZSTANDARD_LEVEL=5 have passed. Did we need to run the full suite with these tests or should this be enough to move things forward?

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Aug 5, 2024 via email

@edwardhartnett
Copy link
Contributor

Is there more testing that will be done, or is the UFS team confident this works?

@BrianCurtis-NOAA
Copy link
Collaborator

fail_test_control_p8_atmlnd_debug_intel
fail_test_control_p8_atmlnd_intel
fail_test_control_p8_atmlnd_sbs_intel
fail_test_datm_cdeps_lnd_era5_intel
fail_test_datm_cdeps_lnd_gswp3_intel

all failed due to wallclock

/lfs/h1/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/netcdf_zstd/tests

Last bit before it stops running, seems early in the process:

NOTE from PE     0: MPP_IO_SET_STACK_SIZE: stack size set to     131072.
NOTE from PE     0: MPP_DOMAINS_SET_STACK_SIZE: stack size set to  3000000.
 num_files=           2
 num_file=           1 filename_base= atm output_file= netcdf
 num_file=           2 filename_base= sfc output_file= netcdf
 grid_id=            1  output_grid= cubed_sphere_grid
 ideflate=           0
 quantize_mode=quantize_bitround quantize_nsd=           0
 zstandard_level=           0 
 af wrtState reconcile, FBcount=           8
 af get wrtfb=output_atm_bilinear rc=           0
 af get wrtfb=output_restart_fv_core.res rc=           0
 af get wrtfb=output_restart_fv_srf_wnd.res rc=           0
 af get wrtfb=output_restart_fv_tracer.res rc=           0
 af get wrtfb=output_restart_phy_data rc=           0
 af get wrtfb=output_restart_sfc_data rc=           0
 af get wrtfb=output_sfc_bilinear rc=           0
 af get wrtfb=output_sfc_nearest_stod rc=           0
 in fv3cap init, time wrtcrt/regrdst   1.52245288199993
 in fv3 cap init, output_startfh=  0.0000000E+00  iau_offset=           0
 output_fh=  0.2000000       1.000000       2.000000       3.000000
   4.000000       5.000000       6.000000       7.000000       8.000000
   9.000000       10.00000       11.00000       12.00000       13.00000
   14.00000       15.00000       16.00000       17.00000       18.00000
   19.00000       20.00000       21.00000       22.00000       23.00000
   24.00000     lflname_fulltime= F
 fcst_advertise, cpl_grid_id=           1
 fcst_realize, cpl_grid_id=           1
 zeroing coupling accumulated fields at kdt=            1

None of these tests use IDEFLATE=1 or ZSTANDARD_LEVEL=5, but maybe they need it with these lib changes?

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA Did you build ESMF with netcdf/zstd? It seems the PIO issue coming in this ESMF build too, can you check how the ESMF is built with netcdf/zlib? Since atm-land test does not use the compression at all (both ideflate and zstandard_level = 0), they should not be impacted at all, but now we see the PIO issue.

20240821 211108.437 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile1.nc, (PIO/PNetCDF error = 
NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240821 211108.438 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile2.nc, (PIO/PNetCDF error = NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240821 211108.438 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile3.nc, (PIO/PNetCDF error = 
NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240821 211108.439 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile4.nc, (PIO/PNetCDF error = 
NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240821 211108.440 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile5.nc, (PIO/PNetCDF error = 
NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240821 211108.440 WARNING          PET150 ESMCI_PIO_Handler.C:1404 ESMCI::PIO_Handler::openOneTileF  Unable to open existing file: INPUT/oro_data.tile6.nc, (PIO/PNetCDF error = 
NetCDF: Attempt to use feature that was not turned on when netCDF was built.)
20240821 211108.440 ERROR            PET150 ESMCI_PIO_Handler.C:617 ESMCI::PIO_Handler::arrayReadOne Unable to read from file  - file not open
20240821 211108.440 ERROR            PET150 ESMCI_IO_Handler.C:405 ESMCI::IO_Handler::arrayRead() Unable to read from file  - Internal subroutine call returned Error
20240821 211108.440 ERROR            PET150 ESMCI_IO.C:382 ESMCI::IO::read() Unable to read from file  - Internal subroutine call returned Error
20240821 211108.440 ERROR            PET150 ESMCI_IO.C:282 ESMCI::IO::read() Unable to read from file  - Internal subroutine call returned Error
20240821 211108.440 ERROR            PET150 ESMCI_IO_F.C:210 c_esmc_ioread() Unable to read from file  - Internal subroutine call returned Error
20240821 211108.440 ERROR            PET150 ESMF_IO.F90:397 ESMF_IOAddArray() Unable to read from file  - Internal subroutine call returned Error
20240821 211108.440 ERROR            PET150 ESMF_FieldBundle.F90:14436 ESMF_FieldBundleRead() Unable to read from file  - Internal subroutine call returned Error

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Aug 23, 2024 via email

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA Can you check if the esmf library is loaded correctly in Brian's testing?

/lfs/h1/emc/nems/noscrub/brian.curtis/git/BrianCurtis-NOAA/ufs-weather-model/netcdf_zstd/modulefiles/ufs_acorn.intel.lua

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Aug 23, 2024 via email

@BrianCurtis-NOAA
Copy link
Collaborator

You're correct, they're ignored on WCOSS2. This testing is specifically for WCOSS2, since spack-stack is the official source for Acorn. I would say then that this is a successful test on Acorn for netcdf and zstd. @junwang-noaa do you agree?

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA I am not asking to fix the the atmlnd case. These tests are currently working on Acorn in the develop branch with spack-stack library (please see links below), they are skipped on wcoss2 and NOAA cloud. These tests failed on acorn when Brian tested the model with the new acorn module file with zstd netcdf library updates.

https://github.com/ufs-community/ufs-weather-model/blob/develop/modulefiles/ufs_acorn.intel.lua

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/rt.conf#L307

https://github.com/ufs-community/ufs-weather-model/blob/develop/tests/logs/RegressionTests_acorn.log#L282

@BrianCurtis-NOAA
Copy link
Collaborator

@junwang-noaa my acorn modulefile is modified from the wcoss2 modulefile, so it acts most like WCOSS2 instead of Acorn.

@BrianCurtis-NOAA
Copy link
Collaborator

I didn't want to hack too much of the rt system to make it run the wcoss2 tests only as well. So it runs the Acorn tests.

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Aug 23, 2024 via email

@junwang-noaa
Copy link
Collaborator

@Hang-Lei-NOAA would you please post the RT test log from your modified new module files?

@Hang-Lei-NOAA
Copy link
Author

Hang-Lei-NOAA commented Aug 26, 2024 via email

@BrianCurtis-NOAA
Copy link
Collaborator

Hang's modified modulefile (with .brian at the end) shows the same as before, but since those tests are not run on WCOSS2, we should be able to proceed. @junwang-noaa are you OK with this?

@Hang-Lei-NOAA
Copy link
Author

The UFS model develop branch runs on acorn, but loading the ufs_wcoss2.intel.lua file in run time.

@BrianCurtis-NOAA
Copy link
Collaborator

yes, for clarification, i'm using a modified modulefile from WCOSS2 but using acorn RT tests. For future testing, it might be helpful to have a setup that works to confuse rt.sh to think its running wcoss2 tests.

@BrianCurtis-NOAA BrianCurtis-NOAA linked a pull request Sep 23, 2024 that will close this issue
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants