-
Hello, I have been a user of xarray for about a year and have recently been making an effort to better understand chunking in order to make my workflows more efficient (in my case pulling satellite imagery from STAC catalogs with stackstac and processing them in various ways). For my needs, I have generally found pretty good performance having each time-step in my stack be its own chunk containing the full spatial extent and all spectral bands. In other words, loading my stack into memory after dask lazy processing is faster when I force chunksize to In doing some testing, I noticed that my stacks were getting rechunked within my processing chain. I narrowed it down to However, I found that using other groupby functions such as mean() or max() do not result in rechunking. See example: In this case, I will probably just change median() to mean() and forget about it since 99% of the combinations occurring are between two observations anyway. But I just wanted to make this post to learn more about best practices in terms of chunking and how I could make my workflows more efficient. Thank you for taking the time to read this! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
Medians, and quantiles in general, can only be exactly calculated if you sort or partition the array in memory. (Not totally true, you can trade off amount of data in memory vs number of passes over data; but I'm having trouble finding the reference now. In any case, no one's implemented it for xarray/dask). That means you have to rechunk. Dask does this for you and ends up rearranging the other axes. In this case, since you know these are duplicates you could just do |
Beta Was this translation helpful? Give feedback.
Medians, and quantiles in general, can only be exactly calculated if you sort or partition the array in memory. (Not totally true, you can trade off amount of data in memory vs number of passes over data; but I'm having trouble finding the reference now. In any case, no one's implemented it for xarray/dask).
That means you have to rechunk. Dask does this for you and ends up rearranging the other axes.
In this case, since you know these are duplicates you could just do
groupby("time").first()
?