Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JIT Groupby Apply idxmax/idxmin reductions return incorrect values when the data is all NaN #13832

Open
brandon-b-miller opened this issue Aug 8, 2023 · 1 comment
Assignees
Labels
bug Something isn't working numba Numba issue Python Affects Python cuDF API.

Comments

@brandon-b-miller
Copy link
Contributor

Describe the bug
When performing an idxmax or idxmin reduction on a dataframe column during JIT groupby apply, pandas returns NaN as the index label corresponding to the answer where as we return the index of the start of the group.

Steps/Code to reproduce bug

import pandas as pd
import cudf
df = pd.DataFrame({
    'a': [1, 1, 1, 2, 2, 2],
    'b': [float('nan')] * 6
})

gdf = cudf.from_pandas(df, nan_as_null=False)

expect = df.groupby('a').apply(lambda x: x['b'].idxmax())
got = gdf.groupby('a').apply(lambda x: x['b'].idxmax(), engine='jit')

print(expect)
print(got)
a
1   NaN
2   NaN
dtype: float64
a
1    0
2    3
dtype: int64

Expected behavior
Ideally we'd match pandas.

Environment overview (please complete the following information)
Bare Metal, 23.10

Additional context
Originally came up here, and then again here. This problem stems from the dtype of the answer being data dependent in pandas. In most cases, the idx_{max,min} functions return an int64 if the index is of type int64, however this edge case of all NaN returns a Nan which is of float type. This poses a compatibility problem for the JIT engine as numba decides the types of all the variables in the input code up front, and currently an idx_{max,min} operation returns an int64. This leads to three options in my mind:

  1. Return some kind of "sensible" int (current state). This leads to edge cases where our results differ from pandas.
  2. Type idxmax and idxmin operations to return a float, e.g. cast the resulting integer to a float and return nan in the edge case, correctly. This trades a value mismatch for a dtype mismatch.
  3. Raise in the edge case. This would require some engineering, related to [FEA] Properly raise when attempting to cast NA to bool inside UDFs #8774.
@brandon-b-miller brandon-b-miller added bug Something isn't working numba Numba issue Python Affects Python cuDF API. labels Aug 8, 2023
@brandon-b-miller brandon-b-miller self-assigned this Aug 8, 2023
@wence-
Copy link
Contributor

wence- commented Aug 9, 2023

  1. idxmin/max must be in [0, 2**31) (cudf::size_type is a 4 byte signed int). So if the return value is an unsigned type, you can use (uint32_t)-1 to signal a NaN result, and postprocess. The kernel could return whether or not it needs post-processing if you want to avoid passing over the data many times if it is unnecessary.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working numba Numba issue Python Affects Python cuDF API.
Projects
Status: Todo
Status: In Progress
Development

No branches or pull requests

2 participants