Why does groupby('time').median() rechunk a dataArray when groupby(time').mean() or max() does not? #8689

ZZMitch · 2024-01-31T19:48:03Z

ZZMitch
Jan 31, 2024

Hello,

I have been a user of xarray for about a year and have recently been making an effort to better understand chunking in order to make my workflows more efficient (in my case pulling satellite imagery from STAC catalogs with stackstac and processing them in various ways).

For my needs, I have generally found pretty good performance having each time-step in my stack be its own chunk containing the full spatial extent and all spectral bands. In other words, loading my stack into memory after dask lazy processing is faster when I force chunksize to (1, -1, -1, -1) when compared to what is chosen by default.

In doing some testing, I noticed that my stacks were getting rechunked within my processing chain. I narrowed it down to cube.groupby('time').median(), which I use to 'remove' same day observations (usually duplicates) after applying cube['time'].dt.floor('1D'). See rechunked example:

However, I found that using other groupby functions such as mean() or max() do not result in rechunking. See example:

In this case, I will probably just change median() to mean() and forget about it since 99% of the combinations occurring are between two observations anyway. But I just wanted to make this post to learn more about best practices in terms of chunking and how I could make my workflows more efficient.

Thank you for taking the time to read this!

Answered by dcherian

Jan 31, 2024

Medians, and quantiles in general, can only be exactly calculated if you sort or partition the array in memory. (Not totally true, you can trade off amount of data in memory vs number of passes over data; but I'm having trouble finding the reference now. In any case, no one's implemented it for xarray/dask).

That means you have to rechunk. Dask does this for you and ends up rearranging the other axes.

In this case, since you know these are duplicates you could just do groupby("time").first()?

View full answer

dcherian · 2024-01-31T20:42:00Z

dcherian
Jan 31, 2024
Maintainer

Medians, and quantiles in general, can only be exactly calculated if you sort or partition the array in memory. (Not totally true, you can trade off amount of data in memory vs number of passes over data; but I'm having trouble finding the reference now. In any case, no one's implemented it for xarray/dask).

That means you have to rechunk. Dask does this for you and ends up rearranging the other axes.

In this case, since you know these are duplicates you could just do groupby("time").first()?

2 replies

ZZMitch Jan 31, 2024
Author

Hey, thanks for the quick response! I suppose first() would be reasonable as well. Thinking about it more though, many of the observations being combined are not duplicates. For example, north-south scene overlap are not technically duplicates since the images are collected seconds/minutes apart as the satellite orbits. And in some cases (esp. near poles) you may have observations hours apart on the same day from overlapping paths on the east/west edge. But taking first() instead of mean() probably would have very minimal impact since the observation values should be very close either way.

Would using first() instead of mean() save me significant processing time when computing data into memory?

dcherian Jan 31, 2024
Maintainer

No. mean sounds more appropriate

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does groupby('time').median() rechunk a dataArray when groupby(time').mean() or max() does not? #8689

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Why does groupby('time').median() rechunk a dataArray when groupby(time').mean() or max() does not? #8689

ZZMitch Jan 31, 2024

Replies: 1 comment · 2 replies

dcherian Jan 31, 2024 Maintainer

ZZMitch Jan 31, 2024 Author

dcherian Jan 31, 2024 Maintainer

ZZMitch
Jan 31, 2024

Replies: 1 comment 2 replies

dcherian
Jan 31, 2024
Maintainer

ZZMitch Jan 31, 2024
Author

dcherian Jan 31, 2024
Maintainer