[FEA] Support `COUNT_ALL` and `COUNT_VALID` as `reduce` aggregations #13756

mythrocks · 2023-07-25T22:42:31Z

This is a follow-up item that arose from #13727.

It turns out that COUNT_ALL and COUNT_VALID are not supported in cudf::reduce() as reduce_aggregations.
(This is likely because their values can trivially computed from the results of column::size() and column::null_count().)

However, this causes obfuscation in any code that attempts to dispatch reduce aggregations generically. E.g. Here.

It would be good to have cudf::reduce() support COUNT_ALL and COUNT_VALID natively, so as not to require special handling in the calling code.

The text was updated successfully, but these errors were encountered:

davidwendt · 2023-07-31T14:52:20Z

I disagree with this change. I think cudf::reduce has become too monolithic and difficult to maintain as more aggregations are added. Special rules have to be applied and adjusted (and documented) concerning combinations of parameters and parameter types. For example, certain aggregations do not accept init values and others only accept specific output_types.

I'd like to see libcudf split up the reduce algorithms into functional names like cudf::reduction::any(), cudf::reduction::all(), cudf::reduction::max(), etc
Internally the reduce/aggregation types are implemented as cudf::reduction::detail::any(), cudf::reduction::detail::all(), cudf::reduction::detail::max(), etc
This will make it easier to maintain and document as well as add new reductions in the future without breaking the monolithic API. I can open a separate issue to further discuss this more practically.

Regardless, I don't think we should provide a convenience wrapper on the existing cudf::column::size() or cudf::column::null_count()

mythrocks · 2023-08-02T17:09:53Z

Thank you for your perspective, @davidwendt.

Special rules have to be applied and adjusted (and documented) concerning combinations of parameters and parameter types.

I see what you mean.

What's your view on the code such as the following?

  auto const reduce_results = [&] {
    auto const return_dtype = cudf::detail::target_type(input.type(), aggr.kind);
    if (aggr.kind == aggregation::COUNT_ALL) {
      return cudf::make_fixed_width_scalar(input.size(), stream);
    } else if (aggr.kind == aggregation::COUNT_VALID) {
      return cudf::make_fixed_width_scalar(input.size() - input.null_count(), stream);
    } else {
      return cudf::reduction::detail::reduce(input,
                                             *convert_to<cudf::reduce_aggregation>(aggr),
                                             return_dtype,
                                             std::nullopt,
                                             stream,
                                             rmm::mr::get_current_device_resource());
    }
  }();

This is the special-casing that I would like to avoid. If cudf::reduce() worked with COUNT, there would be less friction here.

CC @harrism (who had advice on this in the past).

ttnghia · 2023-08-02T17:29:31Z

From my perspective, putting the commonly used feature at the lower level (libcudf) is better than at the application level. Why? Because we can just call libcudf API and avoid reimplementing that same feature multiple times in multiple applications, even such implementation is very simple.

bdice · 2023-08-02T17:39:45Z

If we were able to eliminate the enum aggregation::Kind by moving to a functional interface like cudf::reduction::any() then I would support the expectation that users should call cudf::column::size() as @davidwendt proposes. We wouldn't need a cudf::reduction::count(). I think that this interface is the direction we ought to go, but I haven't scoped any of what that work would require.

However, as it stands with aggregation::Kind today, I am not a fan of a design that allows only certain aggregations to be used as a reduction. If the reduction API accepts an enum aggregation::Kind, it should support all values of that enum (to the extent possible) even if they are redundant with an existing function, for "application level" / "special casing" reasons as @ttnghia and @mythrocks pointed out above.

Given the tension between these two potential next steps, I would encourage spending more effort towards the functional API that @davidwendt proposed, but I wouldn't oppose expanding the existing reduction API's aggregation::Kind support if we think there is enough justification, particularly potential for reuse in other parts of libcudf's aggregation code paths.

mythrocks added feature request New feature or request Needs Triage Need team to review and classify labels Jul 25, 2023

mythrocks self-assigned this Jul 25, 2023

mythrocks mentioned this issue Jul 25, 2023

Switch fully unbounded window functions to use aggregations #13727

Merged

3 tasks

GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 18, 2023

GregoryKimball added this to the Aggregations continuous improvement milestone Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support `COUNT_ALL` and `COUNT_VALID` as `reduce` aggregations #13756

[FEA] Support `COUNT_ALL` and `COUNT_VALID` as `reduce` aggregations #13756

mythrocks commented Jul 25, 2023 •

edited

Loading

davidwendt commented Jul 31, 2023

mythrocks commented Aug 2, 2023

ttnghia commented Aug 2, 2023 •

edited

Loading

bdice commented Aug 2, 2023 •

edited

Loading

[FEA] Support COUNT_ALL and COUNT_VALID as reduce aggregations #13756

[FEA] Support COUNT_ALL and COUNT_VALID as reduce aggregations #13756

Comments

mythrocks commented Jul 25, 2023 • edited Loading

davidwendt commented Jul 31, 2023

mythrocks commented Aug 2, 2023

ttnghia commented Aug 2, 2023 • edited Loading

bdice commented Aug 2, 2023 • edited Loading

[FEA] Support `COUNT_ALL` and `COUNT_VALID` as `reduce` aggregations #13756

[FEA] Support `COUNT_ALL` and `COUNT_VALID` as `reduce` aggregations #13756

mythrocks commented Jul 25, 2023 •

edited

Loading

ttnghia commented Aug 2, 2023 •

edited

Loading

bdice commented Aug 2, 2023 •

edited

Loading