Why does the data array operation takes long even after subsetting/slicing? #9054

yorgunson · 2024-05-30T15:23:04Z

yorgunson
May 30, 2024

Hi,

I have two files; one is bigger simply due to its "time" dimension being longer, about 5 times longer. Thus, on average the computations I apply to the larger file take about ~5 times more than the smaller file. However, when I slice the time dimension of both files into a smaller number, the computation time is still very different with the sliced data from the larger file taking ~5 times longer even though all the dims are equal.

I am guessing even after slicing the larger files contains more stuff in the memory?

Once the file is read by xr.open_dataset (graf_data), the operations are like below:

                ga1 = graf_data.sel(site=site_list)[var]
                
                ga2 = ga1.sel(time=slice(None,time_slice))

                ga3 = ga2.weighted(weight_dataset['weight'])

                ga4 = ga2.sum(dim='site')`

TomNicholas · 2024-05-30T17:57:02Z

TomNicholas
May 30, 2024
Maintainer

If you're opening a single non-chunked netcdf file for example, then the whole variable's array does still need to be loaded into memory before your computation can proceed. It's likely that the IO cost of loading this data into memory is what's taking the majority of the time, so that only applying the processing step to a small slice doesn't lead to much speedup.

4 replies

yorgunson May 30, 2024
Author

Makes total sense, thank for the reply @TomNicholas.

So how can I avoid loading unnecessary data into memory? specifying chunk= within the open_dataset() function?

TomNicholas May 30, 2024
Maintainer

Potentially yes, but without knowing more about what your data looks like on disk you may just end up with one big chunk, which would behave the same. I recommend reading https://docs.xarray.dev/en/stable/user-guide/dask.html.

yorgunson May 30, 2024
Author

dcherian May 31, 2024
Maintainer

Is this a netCDF file? if so, ncdump -sh will tell you what the internal chunking is (_ChunkSizes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why does the data array operation takes long even after subsetting/slicing? #9054

{{title}}

Replies: 1 comment 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Why does the data array operation takes long even after subsetting/slicing? #9054

yorgunson May 30, 2024

Replies: 1 comment · 4 replies

TomNicholas May 30, 2024 Maintainer

yorgunson May 30, 2024 Author

TomNicholas May 30, 2024 Maintainer

yorgunson May 30, 2024 Author

dcherian May 31, 2024 Maintainer

yorgunson
May 30, 2024

Replies: 1 comment 4 replies

TomNicholas
May 30, 2024
Maintainer

yorgunson May 30, 2024
Author

TomNicholas May 30, 2024
Maintainer

yorgunson May 30, 2024
Author

dcherian May 31, 2024
Maintainer