Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support COUNT_ALL and COUNT_VALID as reduce aggregations #13756

Open
mythrocks opened this issue Jul 25, 2023 · 4 comments
Open

[FEA] Support COUNT_ALL and COUNT_VALID as reduce aggregations #13756

mythrocks opened this issue Jul 25, 2023 · 4 comments
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.

Comments

@mythrocks
Copy link
Contributor

mythrocks commented Jul 25, 2023

This is a follow-up item that arose from #13727.

It turns out that COUNT_ALL and COUNT_VALID are not supported in cudf::reduce() as reduce_aggregations.
(This is likely because their values can trivially computed from the results of column::size() and column::null_count().)

However, this causes obfuscation in any code that attempts to dispatch reduce aggregations generically. E.g. Here.

It would be good to have cudf::reduce() support COUNT_ALL and COUNT_VALID natively, so as not to require special handling in the calling code.

@mythrocks mythrocks added feature request New feature or request Needs Triage Need team to review and classify labels Jul 25, 2023
@mythrocks mythrocks self-assigned this Jul 25, 2023
@davidwendt
Copy link
Contributor

I disagree with this change. I think cudf::reduce has become too monolithic and difficult to maintain as more aggregations are added. Special rules have to be applied and adjusted (and documented) concerning combinations of parameters and parameter types. For example, certain aggregations do not accept init values and others only accept specific output_types.

I'd like to see libcudf split up the reduce algorithms into functional names like cudf::reduction::any(), cudf::reduction::all(), cudf::reduction::max(), etc
Internally the reduce/aggregation types are implemented as cudf::reduction::detail::any(), cudf::reduction::detail::all(), cudf::reduction::detail::max(), etc
This will make it easier to maintain and document as well as add new reductions in the future without breaking the monolithic API. I can open a separate issue to further discuss this more practically.

Regardless, I don't think we should provide a convenience wrapper on the existing cudf::column::size() or cudf::column::null_count()

@mythrocks
Copy link
Contributor Author

Thank you for your perspective, @davidwendt.

Special rules have to be applied and adjusted (and documented) concerning combinations of parameters and parameter types.

I see what you mean.

What's your view on the code such as the following?

  auto const reduce_results = [&] {
    auto const return_dtype = cudf::detail::target_type(input.type(), aggr.kind);
    if (aggr.kind == aggregation::COUNT_ALL) {
      return cudf::make_fixed_width_scalar(input.size(), stream);
    } else if (aggr.kind == aggregation::COUNT_VALID) {
      return cudf::make_fixed_width_scalar(input.size() - input.null_count(), stream);
    } else {
      return cudf::reduction::detail::reduce(input,
                                             *convert_to<cudf::reduce_aggregation>(aggr),
                                             return_dtype,
                                             std::nullopt,
                                             stream,
                                             rmm::mr::get_current_device_resource());
    }
  }();

This is the special-casing that I would like to avoid. If cudf::reduce() worked with COUNT, there would be less friction here.

CC @harrism (who had advice on this in the past).

@ttnghia
Copy link
Contributor

ttnghia commented Aug 2, 2023

From my perspective, putting the commonly used feature at the lower level (libcudf) is better than at the application level. Why? Because we can just call libcudf API and avoid reimplementing that same feature multiple times in multiple applications, even such implementation is very simple.

@bdice
Copy link
Contributor

bdice commented Aug 2, 2023

If we were able to eliminate the enum aggregation::Kind by moving to a functional interface like cudf::reduction::any() then I would support the expectation that users should call cudf::column::size() as @davidwendt proposes. We wouldn't need a cudf::reduction::count(). I think that this interface is the direction we ought to go, but I haven't scoped any of what that work would require.

However, as it stands with aggregation::Kind today, I am not a fan of a design that allows only certain aggregations to be used as a reduction. If the reduction API accepts an enum aggregation::Kind, it should support all values of that enum (to the extent possible) even if they are redundant with an existing function, for "application level" / "special casing" reasons as @ttnghia and @mythrocks pointed out above.

Given the tension between these two potential next steps, I would encourage spending more effort towards the functional API that @davidwendt proposed, but I wouldn't oppose expanding the existing reduction API's aggregation::Kind support if we think there is enough justification, particularly potential for reuse in other parts of libcudf's aggregation code paths.

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Aug 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code.
Projects
Status: In Progress
Development

No branches or pull requests

5 participants