-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify implementation of interval_range() and fix behaviour for floating freq
#13844
Simplify implementation of interval_range() and fix behaviour for floating freq
#13844
Conversation
(1.0, None, 2.5, 2), | ||
], | ||
) | ||
def test_interval_range_floating(start, stop, freq, periods): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the only new tests introduced in this PR. The other tests have been moved to this file from elsewhere.
if start is None: | ||
start = end - freq * periods | ||
elif freq is None: | ||
quotient, remainder = divmod((end - start).value, periods.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
divmod
seems not to work with cudf.Scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do all this manipulation on the host before scalarising at the end, that would have far fewer syncs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have tests written already that pass Scalar
objects as input :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The scalar caching should save us here, no? There shouldn't be a sync until device_value
is called later on. .value
should just return the cached host value if the scalar was constructed from a host scalar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, minor copyright comment.
[ | ||
(0.0, None, 0.2, 5), | ||
(0.0, 1.0, None, 5), | ||
# (0.0, 1.0, 0.2, None), # Pandas returns only 4 intervals here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@galipremsagar what's the right way to handle tests that will eventually pass with Pandas 2.x?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggested a change below: https://github.com/rapidsai/cudf/pull/13844/files#r1290106880
[ | ||
(0.0, None, 0.2, 5), | ||
(0.0, 1.0, None, 5), | ||
# (0.0, 1.0, 0.2, None), # Pandas returns only 4 intervals here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# (0.0, 1.0, 0.2, None), # Pandas returns only 4 intervals here | |
pytest.param( | |
(0.0, 1.0, 0.2, None), | |
marks=pytest.mark.xfail( | |
condition=not PANDAS_GE_210, | |
reason="https://github.com/pandas-dev/pandas/pull/54477", | |
), | |
) |
We can use this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to introduce PANDAS_GE_210
variable in cudf.core._compat.py
too
Co-authored-by: GALI PREM SAGAR <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to copy all the scalars to device, only to manipulate and copy back. Would it be better do do all the preprocessing on the host and then move stuff once?
if start is None: | ||
start = end - freq * periods | ||
elif freq is None: | ||
quotient, remainder = divmod((end - start).value, periods.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you do all this manipulation on the host before scalarising at the end, that would have far fewer syncs.
python/cudf/cudf/core/index.py
Outdated
periods = cudf.Scalar(int((end - start) / freq)) | ||
elif end is None: | ||
end = start + periods * freq | ||
|
||
if any( | ||
not _is_non_decimal_numeric_dtype(x.dtype) if x is not None else False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point all args must be not None.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix'd
step=freq.device_value, | ||
) | ||
left_col = bin_edges.slice(0, len(bin_edges) - 1) | ||
right_col = bin_edges.slice(1, len(bin_edges)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're doing linspace
here with arange
-like calls, are we sure the edge cases are handled correctly (specifically if periods
has not been provided).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, what edge cases would you be looking for?
To clarify the behaviour, when periods=
hasn't been specified, Pandas does something like arange(start, end, freq)
:
In [2]: pd.interval_range(start=1, end=3.0, freq=0.7)
Out[2]: IntervalIndex([(1.0, 1.7], (1.7, 2.4]], dtype='interval[float64, right]')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, they are also wrong then:
In [10]: pd.interval_range(start=1, end=3.0, freq=0.5)
Out[10]: IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0]], dtype='interval[float64, right]')
In [11]: pd.interval_range(start=1, end=3.0, freq=0.1)
Out[11]: IntervalIndex([(1.0, 1.1055555555555556], (1.1055555555555556, 1.211111111111111], (1.211111111111111, 1.3166666666666667], (1.3166666666666667, 1.4222222222222223], (1.4222222222222223, 1.5277777777777777] ... (2.3722222222222222, 2.477777777777778], (2.477777777777778, 2.583333333333333], (2.583333333333333, 2.688888888888889], (2.688888888888889, 2.7944444444444443], (2.7944444444444443, 2.9]], dtype='interval[float64, right]')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note how I asked for a frequency of 0.1 (so the intervals should be (1, 1.1]
etc...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, because the implementation does loads of stuff with floats that suffers from rounding errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup should hopefully be fixed in 2.1 in pandas to just do an arange in these cases pandas-dev/pandas#54477
/merge |
Closes #13843
Closes #13847
This PR simplifies the implementation of
interval_range()
and fixes a few different bugs in the process. It also moves all tests for interval indexes totests/indexes/test_interval.py
. Finally, while working on this PR, I ran into #13847; a fix for that is also included in this PR.Checklist