Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial minimal working Cubed example for "map-reduce" #352

Merged
merged 15 commits into from
Apr 23, 2024

Conversation

tomwhite
Copy link
Contributor

This is a first step to implementing #224.

I added a separate code path to the Dask one, in cubed_groupby_agg, since it is sufficiently different (for example, the combine step in Cubed manages memory in a different way).

This PR relies on cubed-dev/cubed#442, which adds the ability to specify the size of the grouping axis.

I added a unit test based on the Dask one, which passes a few cases - but there are plenty more to support (nans, fill values, sorting, etc).

I have included a Jupyter notebook in this PR, which shouldn't be merged, but shows the code working with a cut-down version of the example in https://flox.readthedocs.io/en/latest/user-stories/climatology-hourly.html. (I haven't tried running anything at scale yet.)

Interested to get your feedback @dcherian and @TomNicholas!

flox/core.py Outdated Show resolved Hide resolved
flox/core.py Show resolved Hide resolved
flox/core.py Show resolved Hide resolved
flox/core.py Show resolved Hide resolved
if not has_dask:
if has_cubed:
if method is None:
method = "map-reduce"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will also need a assert reindex is True in _validate_reindex

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, not sure where that would go exactly.

@dcherian
Copy link
Collaborator

Thanks @tomwhite ! Did you have any thoughts on how to refactor this so that we can share code between the dask and cubed paths. Is it feasible to have just the combine stage be different between the two by applying a little adapter function that transforms between dicts and structured arrays?

@tomwhite
Copy link
Contributor Author

Thanks for reviewing @dcherian! The _finalize_results change is a good improvement!

Did you have any thoughts on how to refactor this so that we can share code between the dask and cubed paths. Is it feasible to have just the combine stage be different between the two by applying a little adapter function that transforms between dicts and structured arrays?

This might be possible, but it would obviously be more work. The logic for map-reduce, blockwise, and cohorts is combined in dask_groupby_agg, which complicates things. Cubed has a slightly different way to do map-reduce, blockwise may be the same (not sure yet), and Cubed may not need cohorts (#224 (comment)).

@tomwhite tomwhite marked this pull request as ready for review April 18, 2024 14:02
@tomwhite
Copy link
Contributor Author

I made a release of Cubed with the changes needed by this PR. I've also tried adding the Cubed tests to CI, so we'll see if that works.

I had a look at the grouping by multiple variables (2D) case, but it's not trivial so I'd rather do it as a follow up (#353).

@dcherian
Copy link
Collaborator

I'll take a look soon, I promise.

I'm the mean time it would be good to add that notebook as documentation.

@tomwhite
Copy link
Contributor Author

I'll take a look soon, I promise.

Thanks!

I'm the mean time it would be good to add that notebook as documentation.

Where would be a good place to add it do you think?

@dcherian
Copy link
Collaborator

Under tricks and stories is fine, it's got a collection of notebooks.

Longer term we can update the "Duck Array Support" page.

@dcherian
Copy link
Collaborator

Also I took a look and this looks good to me. I experimented with some refactoring but I agree that it's be good to see what's needed for blockwise/cohorts before we actually refactor things.

@dcherian dcherian merged commit 90393df into xarray-contrib:main Apr 23, 2024
19 checks passed
dcherian added a commit that referenced this pull request May 2, 2024
* main: (64 commits)
  import `normalize_axis_index` from `numpy.lib` on `numpy>=2` (#364)
  Optimize `min_count` when `expected_groups` is not provided. (#236)
  Use threadpool for finding labels in chunk (#327)
  Manually fuse reindexing intermediates with blockwise reduction for cohorts. (#300)
  Bump codecov/codecov-action from 4.1.1 to 4.3.1 (#362)
  Add cubed notebook for hourly climatology example using "map-reduce" method (#356)
  Optimize bitmask finding for chunk size 1 and single chunk cases (#360)
  Edits to climatology doc (#361)
  Fix benchmarks (#358)
  Trim CI (#355)
  [pre-commit.ci] pre-commit autoupdate (#350)
  Initial minimal working Cubed example for "map-reduce" (#352)
  Bump codecov/codecov-action from 4.1.0 to 4.1.1 (#349)
  `method` heuristics: Avoid dot product as much as possible (#347)
  Fix nanlen with strings (#344)
  Fix direct quantile reduction (#343)
  Fix upstream-dev CI, silence warnings (#341)
  Bump codecov/codecov-action from 4.0.0 to 4.1.0 (#338)
  Fix direct reductions of Xarray objects (#339)
  Test with py3.12 (#336)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants