-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) #59616
String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) #59616
Conversation
Does this modify categorical behavior as well? I assume not, but the reason I ask is having come across #36241 when looking for motivation for the current design I'm somewhat unsure about changing this, mostly from a fear of changing things without comprehensively solving them. On one hand, this aligns behavior with floating point and temporal types, but moves away from the nullable extension type behavior. I assume in its current state it also introduces a discrepancy between string and categorical, so whether this is a net positive or not is hard to say |
Ah, good question. It seems indeed not:
I updated the tests also covering |
Really hard to say; regardless of which path we choose, there is going to be some inconsistency in missing value handling (until PDEP-16) I normally prefer punting fixes like this until they can be done more comprehensively, but overall I'm +/0 on this one. Worth getting input from the wider team since this might be a breaking change. I think @rhshadrach has expressed interest in maintaining the status quo for the object-dtype with NaN in the past (sorry if misquoting!) |
To be clear, this PR is not changing the behaviour if you have object dtype, only for the NaN-variant of StringDtype (I know that many cases of object dtype will of course become that new str dtype in 3.0, but just to be explicit about the scope of the current changes)
FWIW, there are certainly good arguments either way for propagating NaNs vs using False for the new default string dtype, but if we choose one or the other behaviour, I personally don't really see a reason to not do that consistently for |
Definitely agree on that point |
…ates-nan-propagation
…ates-nan-propagation
…d categorical behaviour
Updated this after some of the recent refactor PRs, and to align the categorical[str] behaviour to also propagate NaNs as False. |
What is an example operation with floating point types that is comparable?
No - only that we be diligent in understanding the impact on users when making changes, and evaluate them carefully. |
I think I see, from the linked issue
So a comparable operation would be I'm onboard with the two phase approach to strings:
and I see this as being a part of that two phase approach. |
@WillAyd @jbrockmendel this is ready for review |
Should we include the object dtype in this? I do worry that there is going to be an intermediate phase where users have a lot of |
For object dtype the situation is a bit more complex, because object dtype can contain anything, not just strings and missing values. And in that case, we default to return NaN for anything that is not a string, and do this in all methods and not just boolean predicates:
Of course we can say that for boolean predicates, we also return But I would personally start with doing that for just StringDtype. It's certainly a good point and we can discuss that more (maybe on the issue?), but in any case it will be easier to do that in a separate PR, because changing it for object dtype is not something we want to backport to 2.3 |
Sounds good. I don't have a strong preference - just trying to think through what inconsistencies this will surface. On board with the change if the rest of the team is |
…ates-nan-propagation
452e992
to
e401b55
Compare
Another ping here. I have to update this to resolve conflicts, but the diff itself should still be reviewable. |
…ates-nan-propagation
Owee, I'm MrMeeseeks, Look at me. There seem to be a conflict, please backport manually. Here are approximate instructions:
And apply the correct labels and milestones. Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon! Remember to remove the If these instructions are inaccurate, feel free to suggest an improvement. |
…tartswith) (pandas-dev#59616) (cherry picked from commit 88554d0)
Manual backport -> #60014 |
This ensures to propagate NaN as
False
for all string predicate methods (i.e. the ones that typically (absent of NA values) return a boolean result) if using the NaN-variant string dtype.Additionally, for the few methods that have an
na
keyword that controls the value to propagate, I updated the default tono_default
. There is some related discussion about how thisna
keyword should be treated if it is passed a non-boolean value in #59561Still have to update docstrings and type annotations.
xref #54792