[BUG] JIT Groupby Apply `idxmax`/`idxmin` reductions return incorrect values when the data is all NaN #13832

brandon-b-miller · 2023-08-08T14:55:52Z

Describe the bug
When performing an idxmax or idxmin reduction on a dataframe column during JIT groupby apply, pandas returns NaN as the index label corresponding to the answer where as we return the index of the start of the group.

Steps/Code to reproduce bug

import pandas as pd
import cudf
df = pd.DataFrame({
    'a': [1, 1, 1, 2, 2, 2],
    'b': [float('nan')] * 6
})

gdf = cudf.from_pandas(df, nan_as_null=False)

expect = df.groupby('a').apply(lambda x: x['b'].idxmax())
got = gdf.groupby('a').apply(lambda x: x['b'].idxmax(), engine='jit')

print(expect)
print(got)

a
1   NaN
2   NaN
dtype: float64
a
1    0
2    3
dtype: int64

Expected behavior
Ideally we'd match pandas.

Environment overview (please complete the following information)
Bare Metal, 23.10

Additional context
Originally came up here, and then again here. This problem stems from the dtype of the answer being data dependent in pandas. In most cases, the idx_{max,min} functions return an int64 if the index is of type int64, however this edge case of all NaN returns a Nan which is of float type. This poses a compatibility problem for the JIT engine as numba decides the types of all the variables in the input code up front, and currently an idx_{max,min} operation returns an int64. This leads to three options in my mind:

Return some kind of "sensible" int (current state). This leads to edge cases where our results differ from pandas.
Type idxmax and idxmin operations to return a float, e.g. cast the resulting integer to a float and return nan in the edge case, correctly. This trades a value mismatch for a dtype mismatch.
Raise in the edge case. This would require some engineering, related to [FEA] Properly raise when attempting to cast NA to bool inside UDFs #8774.

The text was updated successfully, but these errors were encountered:

wence- · 2023-08-09T10:10:14Z

idxmin/max must be in [0, 2**31) (cudf::size_type is a 4 byte signed int). So if the return value is an unsigned type, you can use (uint32_t)-1 to signal a NaN result, and postprocess. The kernel could return whether or not it needs post-processing if you want to avoid passing over the data many times if it is unnecessary.

brandon-b-miller added bug Something isn't working numba Numba issue Python Affects Python cuDF API. labels Aug 8, 2023

brandon-b-miller self-assigned this Aug 8, 2023

brandon-b-miller mentioned this issue Aug 8, 2023

Return a Series from JIT GroupBy apply, rather than a DataFrame #13820

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] JIT Groupby Apply `idxmax`/`idxmin` reductions return incorrect values when the data is all NaN #13832

[BUG] JIT Groupby Apply `idxmax`/`idxmin` reductions return incorrect values when the data is all NaN #13832

brandon-b-miller commented Aug 8, 2023

wence- commented Aug 9, 2023

[BUG] JIT Groupby Apply idxmax/idxmin reductions return incorrect values when the data is all NaN #13832

[BUG] JIT Groupby Apply idxmax/idxmin reductions return incorrect values when the data is all NaN #13832

Comments

brandon-b-miller commented Aug 8, 2023

wence- commented Aug 9, 2023

[BUG] JIT Groupby Apply `idxmax`/`idxmin` reductions return incorrect values when the data is all NaN #13832

[BUG] JIT Groupby Apply `idxmax`/`idxmin` reductions return incorrect values when the data is all NaN #13832