-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there a way to cmorize out-of-memory datasets ? #166
Comments
Hi @aulemahal,
Yes, that's right. The
Absolutey, that is definitely on my roadmap since it is quite straight forward to implement, e.g., in my driving scripts i use something like this: file_chunksize = {
"1hr": "A",
"3hr": "A",
"6hr": "A",
"day": "5A",
"mon": "10A"
}
def get_chunks(ds, freq, **kwargs):
_, chunks = zip(*ds.resample(time=file_chunksize[freq]), **kwargs)
return chunks where i cmorize the output chunks from
My very words! I have struggled a lot with cmorization in the past (see, e.g., also this discussion), and i actually started a project (xcmor) where i have implemented some of my insights. The goal is exactly what you mentioned, e.g., to cmorize lazily, have some dynamic implementation of those rules that cmor implements and give the user more flexibilty in the actual output format and storage options. Finally this should replace also the |
Ah! Happy to see that I'm not the only one struggling with the unification of those shiny high-level tools (xarray) and the ancient and solid-as-titanium ones (cdo, nco, cmor). ;) I'll try out |
Hi!
I'm tasked with making our CORDEX (Ouranos MRCC5) data publishable. Hourly files at NAM-11 are, of course, quite large. Opening a single year of tas, I get an array of shape (8759, 628, 655), which would take 13.4 GB of RAM (float32). Of course, xarray and dask can help me here and I could in theory process this chunk by chunk. However, it seems at first glance that the cmorize tools in
py-cordex
will load the data, making dask useless.I think I see that the in-memory requirement comes from
cmor
itself, but I am asking here as this is where the xarray-compatible implementation is. Sorry if this isn't the best channel.What do others do in that situation ? Is enough RAM a hard requirement to use cmor ?
Similarly, the 1-year-per-file rule comes from the CORDEX file spec (I have access to the Feb 2023 draft). My data is stored in monthly netCDFs. Could the standardization process be done on the full dataset (all simulated years) and then multiple files would be written. The one year subsetting could even be automatic, based on the specs.
Finally, it seems to me that all this would be much easier if there was a function that takes in a xarray dataset and returns a standardized cmorized xarray dataset(s). Which I could save with
xr.save_mfdataset
afterwards. Does that exist ?Thanks and sorry for the long issue that's not a real issue.
The text was updated successfully, but these errors were encountered: