Fix multiprocessing and consolidate QC #68

kyleredilla · 2024-10-02T17:01:03Z

The initial goal with this PR was simply to improve the reliability of using multiprocessing to scrape the grid info from files in the batch file generation step of the regridding pipeline, and in the QC section of sanity checking all regridded files. The behavior we consistently see in using multiprocessing (or concurrent.futures paradigm) is that whatever method is used to dispatch some function that operates on multiple netcdf files can sometimes hang indefinitely. This PR has taken some steps to improve this, but it would seem that total reliability here is out of scope for now. It seems that simply processing fewer files by breaking things up into smaller groups, such as we have begun doing for the prefect flows (with v1_1 and v1_2 variables etc), is a sane way forward for now. The QC step that was checking every single new file has been changed as well to only check a random subset of the files, which should help with the hanging symptoms. You will notice that the quality control step step that was originally outside of the QC notebook has been moved into that notebook, so that we only have one QC product to evaluate following a flow run.

To test, simply run a regridding flow using prefect, probably for a subset of variables and frequencies, such as monthly v1_2 etc and check out the quality control notebook.

cstephen

I like the idea of folding all of the QC work into a single spot, and this looks great overall! I ran into a few small issues along the way though:

While running this Prefect flow using the v1_config branch of the prefect repo, it complained that the --qc_notebook option was missing from the QC command and quit/stalled out. I replaced the system call here https://github.com/ua-snap/prefect/blob/58843c3f0070bd8ef9094326d826b58e92136bfa/regridding/regridding_functions.py#L394 with the following and it seemed to work after that:

f"export PATH=$PATH:/opt/slurm-22.05.4/bin:/opt/slurm-22.05.4/sbin:$HOME/miniconda3/bin && python {run_qc_script} --qc_notebook '{visual_qc_notebook}' --conda_init_script '{conda_init_script}' --conda_env_name '{conda_env_name}' --cmip6_directory '{cmip6_directory}' --output_directory '{output_directory}' --repo_regridding_directory '{repo_regridding_directory}' --vars '{vars}' --freqs '{freqs}' --models '{models}' --scenarios '{scenarios}'"

The QC notebook complained that error_file was not defined in a couple places. See PR code review comments.
After removing the error_file references to run the QC notebook to completion, the random src vs. regrid files it chose to inspect produced the following error:
```
AssertionError: No files found for regridded file clt_Amon_MPI-ESM1-2-HR_historical_regrid_196201-196212.nc in /beegfs/CMIP6/arctic-cmip6/CMIP6/CMIP/DKRZ/MPI-ESM1-2-HR/historical with */Amon/clt/*/*/clt_Amon_MPI-ESM1-2-HR_historical_*.nc.
```
My second run chose a different set of random files & succeeded, however. And it looks great!! Once those small issues are fixed, I think this is good to merge.

cstephen · 2024-10-14T20:19:53Z

regridding/qc.ipynb

+    "print(f\"QC process complete: {error_count} errors found.\")\n",
+    "if len(ds_errors) > 0:\n",
+    "    print(\n",
+    "        f\"Errors in opening some datasets. {len(ds_errors)} files could not be opened. See {str(error_file)} for error log.\"\n",


The QC notebook complained that error_file was not defined here and execution stopped. I got around this temporarily just by removing the reference to error_file here.

cstephen · 2024-10-14T20:20:14Z

regridding/qc.ipynb

    "    )\n",
+    "if len(value_errors) > 0:\n",
+    "    print(\n",
+    "        f\"Errors in dataset values. {len(value_errors)} files have regridded values outside of source file range. See {str(error_file)} for error log.\"\n",


Same as my above comment, error_file was not defined here either.

Joshdpaul and others added 16 commits August 30, 2024 15:46

move multiprocessing out of for loop

82eaee6

add qc_config and job array to qc sbatch

f879224

add print statement to track file names/times

f92c051

use actual variable count in sbatch params

0c1a6b7

Combine qc script and notebook and simplify code

7bf8b11

drop refs to visual qc for runner script

5896faa

make qc scripts and notebook consistent

67f9f9f

small fixes for regridding qc

db5ce3d

remove unused args in qc runner

8a80efe

pull subsampling code into qc module

bdc7ef3

checkpoint for script to combine regridded data for rasdaman

60dc1b4

finalize script to combine regridded files for rasdaman

1aadf51

fix regridding batch files script to handle MPI-M institution ID

f09963a

add empty variables if missing

6001ac5

remove rasdaman preprocessing script for monthly common cmip6

9b90485

remove unused dict from regridding config

0cce56a

kyleredilla requested a review from cstephen October 2, 2024 17:01

kyleredilla changed the title ~~Fix multiprocessing~~ Fix multiprocessing and consolidate QC Oct 2, 2024

cstephen requested changes Oct 14, 2024

View reviewed changes

kyleredilla added 11 commits December 10, 2024 10:16

print job_id for prefect ssh to parse

aa91734

try command in place of conda_init_script

7172b07

print job IDs for regrid runner

89c2de7

print list of job ids as space-separated string

6bbf6a0

drop crop from target dataset

369dfe6

ensure lon dim is 1D when sorting

7cdfe9d

disable tryexcept for regrid call

e635d47

check for latlon dims before fixing

22f4675

add interp method as top level parameter

6725514

fix interp_method top level parameter

c06a64b

add missing kwarg

81ef5d1

kyleredilla added 9 commits December 11, 2024 16:07

fix positional arg

31593b2

fix script arg

407ab7e

fix script arg

16a28b0

print regrid qc slurm job id

78adf16

fix regrid qc sbatch script

3ca121f

drop ref to error file

33b8b13

regridding qc overhaul for generic target grid

7564311

clean up regrid qc nb

b94848b

drop bnds variables first in rasdafy

be821c2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix multiprocessing and consolidate QC #68

Fix multiprocessing and consolidate QC #68

kyleredilla commented Oct 2, 2024

cstephen left a comment

cstephen Oct 14, 2024

cstephen Oct 14, 2024

Fix multiprocessing and consolidate QC #68

Are you sure you want to change the base?

Fix multiprocessing and consolidate QC #68

Conversation

kyleredilla commented Oct 2, 2024

cstephen left a comment

Choose a reason for hiding this comment

cstephen Oct 14, 2024

Choose a reason for hiding this comment

cstephen Oct 14, 2024

Choose a reason for hiding this comment