You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When performing an idxmax or idxmin reduction on a dataframe column during JIT groupby apply, pandas returns NaN as the index label corresponding to the answer where as we return the index of the start of the group.
a
1 NaN
2 NaN
dtype: float64
a
1 0
2 3
dtype: int64
Expected behavior
Ideally we'd match pandas.
Environment overview (please complete the following information)
Bare Metal, 23.10
Additional context
Originally came up here, and then again here. This problem stems from the dtype of the answer being data dependent in pandas. In most cases, the idx_{max,min} functions return an int64 if the index is of type int64, however this edge case of all NaN returns a Nan which is of float type. This poses a compatibility problem for the JIT engine as numba decides the types of all the variables in the input code up front, and currently an idx_{max,min} operation returns an int64. This leads to three options in my mind:
Return some kind of "sensible" int (current state). This leads to edge cases where our results differ from pandas.
Type idxmax and idxmin operations to return a float, e.g. cast the resulting integer to a float and return nan in the edge case, correctly. This trades a value mismatch for a dtype mismatch.
idxmin/max must be in [0, 2**31) (cudf::size_type is a 4 byte signed int). So if the return value is an unsigned type, you can use (uint32_t)-1 to signal a NaN result, and postprocess. The kernel could return whether or not it needs post-processing if you want to avoid passing over the data many times if it is unnecessary.
Describe the bug
When performing an
idxmax
oridxmin
reduction on a dataframe column during JIT groupby apply, pandas returnsNaN
as the index label corresponding to the answer where as we return the index of the start of the group.Steps/Code to reproduce bug
Expected behavior
Ideally we'd match pandas.
Environment overview (please complete the following information)
Bare Metal, 23.10
Additional context
Originally came up here, and then again here. This problem stems from the dtype of the answer being data dependent in pandas. In most cases, the
idx_{max,min}
functions return anint64
if the index is of typeint64
, however this edge case of allNaN
returns aNan
which is of float type. This poses a compatibility problem for the JIT engine as numba decides the types of all the variables in the input code up front, and currently anidx_{max,min}
operation returns anint64
. This leads to three options in my mind:idxmax
andidxmin
operations to return a float, e.g. cast the resulting integer to a float and return nan in the edge case, correctly. This trades a value mismatch for a dtype mismatch.NA
to bool inside UDFs #8774.The text was updated successfully, but these errors were encountered: