Processing entire time period with Dask is slower than processing smaller time chunks in series. #8833

claytharrison · 2024-03-14T16:00:04Z

claytharrison
Mar 14, 2024

The process

I have many ordered satellite swath files (netCDF4) with just an observations (obs) dimension and a variable containing the ID of a grid point each observation is taken from. I am creating aggregations of the files grid-point-wise over chunks of time (e.g. weekly mean).

To do this, I open up the files from the desired time range into a multifile dataset, then use flox's xarray_reduce to calculate the means for each grid point over the desired time chunks.

Single swath file footprint (truncated)

<xarray.Dataset>
Dimensions:                            (obs: 63929)
Coordinates:
    latitude                           (obs) float64 ...
    longitude                          (obs) float64 ...
    time                               (obs) datetime64[ns] ...
Dimensions without coordinates: obs
Data variables: (12/18)
    location_id                        (obs) float64 ...
    as_des_pass                        (obs) float32 ...
    swath_indicator                    (obs) float32 ...
    surface_soil_moisture              (obs) float32 ...
    backscatter40                      (obs) float64 ...
    ...

Multifile-dataset footprint (truncated)

<xarray.Dataset>
Dimensions:                            (obs: 37726091)
Coordinates:
    latitude                           (obs) float64 dask.array<chunksize=(63929,), meta=np.ndarray>
    longitude                          (obs) float64 dask.array<chunksize=(63929,), meta=np.ndarray>
    time                               (obs) datetime64[ns] dask.array<chunksize=(63929,), meta=np.ndarray>
Dimensions without coordinates: obs
Data variables: (12/18)
    location_id                        (obs) float64 dask.array<chunksize=(63929,), meta=np.ndarray>
    as_des_pass                        (obs) float32 dask.array<chunksize=(63929,), meta=np.ndarray>
    swath_indicator                    (obs) float32 dask.array<chunksize=(63929,), meta=np.ndarray>
    surface_soil_moisture              (obs) float32 dask.array<chunksize=(63929,), meta=np.ndarray>
    backscatter40                      (obs) float64 dask.array<chunksize=(63929,), meta=np.ndarray>
    ...

This results in a Dataset with a time_chunks dimension and a location_id dimension with the (e.g.) mean value of the desired variables for each location over each time chunk:

Grouped dataset footprint (two weeks)

<xarray.Dataset>
Dimensions:                (time_chunks: 2, location_id: 586887)
Coordinates:
  * time_chunks            (time_chunks) int64 0 1
  * location_id            (location_id) int64 3 7 11 ... 3299991 3299993
Data variables:
    surface_soil_moisture  (time_chunks, location_id) float32 dask.array<chunksize=(2, 586887), meta=np.ndarray>
    backscatter40          (time_chunks, location_id) float64 dask.array<chunksize=(2, 586887), meta=np.ndarray>

I then use groupby('time_chunks') and save_mfdataset to save a file to disk for each time chunk.

Code example

from pathlib import Path
from datetime import datetime

import xarray as xr
import numpy as np
import dask.array as da

from flox.xarray import xarray_reduce

import utils


if __name__ == '__main__':
    date_range = (datetime(2018, 1, 1), datetime(2018, 1, 15))
    timedelta = np.timedelta64(1, "W")
    agg_func="mean"
    out_dir = Path("dummy_path/")
    filenames = utils.get_filenames(*date_range)
    first_file = filenames[0]
    ds = xr.open_mfdataset(filenames,
                           engine="netcdf4",
                           concat_dim='obs',
                           combine='nested',
                           combine_attrs='drop',
                           mask_and_scale=True,)

    # chunk size is just the num of observations in each file (~70000 or so)

    mask = ((ds["surface_flag"] != 0) |
            (ds["snow_cover_probability"] > 90))

    # this is faster than a .sel()
    ds = ds.where(~mask, drop=False)

    ds["time_chunks"] = (
        ds["time"] - np.datetime64(date_range[0], "ns")
    ) // timedelta

    agg_vars = ["surface_soil_moisture", "backscatter40"]

    # compute these now, it's fast and end result is relatively small
    expected_time_chunks = da.unique(ds["time_chunks"].data).compute()
    expected_location_ids = da.unique(ds["location_id"].data).compute()

    # remove NaN from the expected location ids (this was introduced by the masking)
    expected_location_ids = expected_location_ids[~np.isnan(expected_location_ids)]

    grouped_ds = xarray_reduce(ds[agg_vars],
                               ds["time_chunks"],
                               ds["location_id"],
                               expected_groups=(expected_time_chunks,
                                                expected_location_ids),
                               func=agg_func)

    grouped_ds["location_id"] = grouped_ds["location_id"].astype(int)

    # rechunk grouped_ds
    grouped_ds = grouped_ds.chunk({"time_chunks": 1})

    timechunks, datasets = zip(*grouped_ds.groupby("time_chunks"))
    paths = [out_dir / f"ascat_{np.datetime64(date_range[0]) + timedelta * chunk}.nc"
             for chunk in timechunks]
    xr.save_mfdataset(datasets, paths)

The problem

My problem is that when I use huge masses of source data (e.g. several months at a time), the computation time blows way up beyond what is necessary.

For example, aggregating two weeks of data into week-long chunks takes about 5 minutes on my machine. At that rate, it should take about an hour to process 6 months of data by just tossing two-week chunks into the script in series. But if I toss in all six months at once, the process takes ten hours instead.

Surely I'm doing something wrong here. Is there anything I can do in terms of rechunking to make things more efficient?

I already tried a few different values for chunk in open_mfdataset, but nothing seemed to be much better than the default (where each source file is its own chunk). I also tried rechunking the grouped array, since by default the time_chunks dimension is all a single chunk (sorry, bad naming there...). But that didn't make much difference either.

Any pointers would be greatly appreciated!

dcherian · 2024-03-14T16:58:47Z

dcherian
Mar 14, 2024
Maintainer

You're using the dask threaded scheduler (by default). This means the netCDF reads are blocking and proceeding sequentially. You could try processes or a distributed.Client set up to mix threads and processes.
You're aggregating to one chunk in the result. This tells me that flox is using the "map-reduce" strategy. You could try using method="cohorts". On the newest flox, it should infer cohorts by default. If it isn't doing that let me know.
I recommend using distributed and looking at the dashboard. Currently the aggregation to a single chunk is a blocking task and the rest of the I/O cannot proceed without it. With "cohorts" you should be able to stream more. Try using to_zarr instead of save_mfdataset to see if this is an improvement.

Otherwise the approach is quite nice and exactly how I would write it! Nice work!

2 replies

claytharrison Mar 15, 2024
Author

Thank you so much for the advice!

Using to_zarr and then reading back in and splitting out to several netCDFs afterwards solved things. In fact, I don't even need to use to_zarr but can just use to_netcdf.

From watching the dashboard, it looks to me like the same huge full-dataset processing graph was being run for each grouping of the dataset that I passed to save_mfdataset. So say I'm processing 12 weeks of data into weekly means: it only takes 30 minutes with to_zarr, but with save_mfdataset it takes 12*30 minutes. I assume there's some restructuring I could do to avoid this, but for now I'll stick with the easy win.

I tried using "cohorts", but I don't think that's the best strategy for my data, since the location_id variable doesn't have any useful order or structure to it. It seems like it tries to do some sorting by that variable but that can take forever.

I also experimented with the different schedulers but consistently had the best results with the default threaded scheduler. The distributed scheduler was really nice for debugging, though!

dcherian Mar 15, 2024
Maintainer

since the location_id variable doesn't have any useful order or structure to it.

But your time grouping does.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Processing entire time period with Dask is slower than processing smaller time chunks in series. #8833

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Processing entire time period with Dask is slower than processing smaller time chunks in series. #8833

claytharrison Mar 14, 2024

The process

The problem

Replies: 1 comment · 2 replies

dcherian Mar 14, 2024 Maintainer

claytharrison Mar 15, 2024 Author

dcherian Mar 15, 2024 Maintainer

claytharrison
Mar 14, 2024

Replies: 1 comment 2 replies

dcherian
Mar 14, 2024
Maintainer

claytharrison Mar 15, 2024
Author

dcherian Mar 15, 2024
Maintainer