Processing entire time period with Dask is slower than processing smaller time chunks in series. #8833
Unanswered
claytharrison
asked this question in
Q&A
Replies: 1 comment 2 replies
-
Otherwise the approach is quite nice and exactly how I would write it! Nice work! |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
The process
I have many ordered satellite swath files (netCDF4) with just an observations (
obs
) dimension and a variable containing the ID of a grid point each observation is taken from. I am creating aggregations of the files grid-point-wise over chunks of time (e.g. weekly mean).To do this, I open up the files from the desired time range into a multifile dataset, then use flox's
xarray_reduce
to calculate the means for each grid point over the desired time chunks.Single swath file footprint (truncated)
Multifile-dataset footprint (truncated)
This results in a Dataset with a
time_chunks
dimension and alocation_id
dimension with the (e.g.) mean value of the desired variables for each location over each time chunk:Grouped dataset footprint (two weeks)
I then use
groupby('time_chunks')
andsave_mfdataset
to save a file to disk for each time chunk.Code example
The problem
My problem is that when I use huge masses of source data (e.g. several months at a time), the computation time blows way up beyond what is necessary.
For example, aggregating two weeks of data into week-long chunks takes about 5 minutes on my machine. At that rate, it should take about an hour to process 6 months of data by just tossing two-week chunks into the script in series. But if I toss in all six months at once, the process takes ten hours instead.
Surely I'm doing something wrong here. Is there anything I can do in terms of rechunking to make things more efficient?
I already tried a few different values for
chunk
inopen_mfdataset
, but nothing seemed to be much better than the default (where each source file is its own chunk). I also tried rechunking the grouped array, since by default thetime_chunks
dimension is all a single chunk (sorry, bad naming there...). But that didn't make much difference either.Any pointers would be greatly appreciated!
Beta Was this translation helpful? Give feedback.
All reactions