String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) #59616

jorisvandenbossche · 2024-08-26T15:54:26Z

This ensures to propagate NaN as False for all string predicate methods (i.e. the ones that typically (absent of NA values) return a boolean result) if using the NaN-variant string dtype.
Additionally, for the few methods that have an na keyword that controls the value to propagate, I updated the default to no_default. There is some related discussion about how this na keyword should be treated if it is passed a non-boolean value in #59561

Still have to update docstrings and type annotations.

closes API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.

xref #54792

…tartswith)

…ates-nan-propagation

WillAyd · 2024-08-27T13:05:53Z

Does this modify categorical behavior as well? I assume not, but the reason I ask is having come across #36241 when looking for motivation for the current design

I'm somewhat unsure about changing this, mostly from a fear of changing things without comprehensively solving them. On one hand, this aligns behavior with floating point and temporal types, but moves away from the nullable extension type behavior. I assume in its current state it also introduces a discrepancy between string and categorical, so whether this is a net positive or not is hard to say

jorisvandenbossche · 2024-08-27T13:49:43Z

Does this modify categorical behavior as well? I assume not, but the reason I ask is having come across #36241 when looking for motivation for the current design

Ah, good question. It seems indeed not:

>>> pd.Series(["a", None], dtype=pd.StringDtype(na_value=np.nan)).str.startswith("a") 
0     True
1    False
dtype: bool

>>> pd.Series(["a", "b", None], dtype=pd.StringDtype(na_value=np.nan)).astype("category").str.startswith("a")
0     True
1      NaN
dtype: object

I updated the tests also covering category, but seems I did that a bit blindly without actually considering the category case.
Now, given that the result is a plain bool array (or object if there are missing values, when starting with object dtype categories), and not a categorical of bool categories, I think the end result should just be the same for both examples above?

WillAyd · 2024-08-27T14:43:31Z

I think the end result should just be the same for both examples above?

Really hard to say; regardless of which path we choose, there is going to be some inconsistency in missing value handling (until PDEP-16)

I normally prefer punting fixes like this until they can be done more comprehensively, but overall I'm +/0 on this one.

Worth getting input from the wider team since this might be a breaking change. I think @rhshadrach has expressed interest in maintaining the status quo for the object-dtype with NaN in the past (sorry if misquoting!)

jorisvandenbossche · 2024-08-27T16:50:31Z

maintaining the status quo for the object-dtype with NaN in the past

To be clear, this PR is not changing the behaviour if you have object dtype, only for the NaN-variant of StringDtype (I know that many cases of object dtype will of course become that new str dtype in 3.0, but just to be explicit about the scope of the current changes)

I think the end result should just be the same for both examples above?

Really hard to say; regardless of which path we choose, there is going to be some inconsistency in missing value handling (until PDEP-16)

FWIW, there are certainly good arguments either way for propagating NaNs vs using False for the new default string dtype, but if we choose one or the other behaviour, I personally don't really see a reason to not do that consistently for category[str].
(meaning, if going through with this PR, I think I should also update string categories to follow the same change)

WillAyd · 2024-08-27T16:52:29Z

(meaning, if going through with this PR, I think I should also update string categories to follow the same change)

Definitely agree on that point

…ates-nan-propagation

…d categorical behaviour

jorisvandenbossche · 2024-09-06T18:35:45Z

Updated this after some of the recent refactor PRs, and to align the categorical[str] behaviour to also propagate NaNs as False.

rhshadrach · 2024-09-07T10:02:30Z

this [PR] aligns behavior with floating point and temporal types

What is an example operation with floating point types that is comparable?

I think @rhshadrach has expressed interest in maintaining the status quo for the object-dtype with NaN in the past (sorry if misquoting!)

No - only that we be diligent in understanding the impact on users when making changes, and evaluate them carefully.

rhshadrach · 2024-09-07T10:08:54Z

I think I see, from the linked issue

"String columns will follow NaN-semantics for missing values, where NaN gives False in boolean operations such as comparisons or predicates", i.e. ser == "a"

So a comparable operation would be ser > 0.

I'm onboard with the two phase approach to strings:

Introduce NaN-semantic strings in a consistent manner
Transition to NA-semantics (across all dtypes)

and I see this as being a part of that two phase approach.

jorisvandenbossche · 2024-09-09T20:55:56Z

@WillAyd @jbrockmendel this is ready for review

WillAyd · 2024-09-09T21:50:16Z

Should we include the object dtype in this? I do worry that there is going to be an intermediate phase where users have a lot of .astype(object) calls laying around that now have different NA-semantics than what they get by default.

jorisvandenbossche · 2024-09-10T06:39:30Z

For object dtype the situation is a bit more complex, because object dtype can contain anything, not just strings and missing values. And in that case, we default to return NaN for anything that is not a string, and do this in all methods and not just boolean predicates:

>>> pd.Series(["string", 10, np.nan], dtype=object).str.startswith("str")
0    True
1     NaN
2     NaN
dtype: object

>>> pd.Series(["string", 10, np.nan], dtype=object).str.upper()
0    STRING
1       NaN
2       NaN
dtype: object

Of course we can say that for boolean predicates, we also return False for anything that is not a string and so would have [True, False, False] in the case above.

But I would personally start with doing that for just StringDtype. It's certainly a good point and we can discuss that more (maybe on the issue?), but in any case it will be easier to do that in a separate PR, because changing it for object dtype is not something we want to backport to 2.3

WillAyd · 2024-09-10T12:17:41Z

Sounds good. I don't have a strong preference - just trying to think through what inconsistencies this will surface.

On board with the change if the rest of the team is

…ates-nan-propagation

jorisvandenbossche · 2024-10-09T19:08:40Z

Another ping here. I have to update this to resolve conflicts, but the diff itself should still be reviewable.

…ates-nan-propagation

lumberbot-app · 2024-10-10T13:04:52Z

Owee, I'm MrMeeseeks, Look at me.

There seem to be a conflict, please backport manually. Here are approximate instructions:

Checkout backport branch and update it.

git checkout 2.3.x
git pull

Cherry pick the first parent branch of the this PR on top of the older branch:

git cherry-pick -x -m1 88554d0ca77c7b80605a34f9ece838b834db8720

You will likely have some merge/cherry-pick conflict here, fix them and commit:

git commit -am 'Backport PR #59616: String dtype: propagate NaNs as False in predicate methods (eg .str.startswith)'

Push to a named branch:

git push YOURFORK 2.3.x:auto-backport-of-pr-59616-on-2.3.x

Create a PR against branch 2.3.x, I would have named this PR:

"Backport PR #59616 on branch 2.3.x (String dtype: propagate NaNs as False in predicate methods (eg .str.startswith))"

And apply the correct labels and milestones.

Congratulations — you did some good work! Hopefully your backport PR will be tested by the continuous integration and merged soon!

Remember to remove the Still Needs Manual Backport label once the PR gets merged.

If these instructions are inaccurate, feel free to suggest an improvement.

…tartswith) (pandas-dev#59616) (cherry picked from commit 88554d0)

jorisvandenbossche · 2024-10-10T13:18:04Z

Manual backport -> #60014

…ethods (eg .str.startswith) (#59616) (#60014) * String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) (#59616) (cherry picked from commit 88554d0) * ignore object dtype inference warnings

String dtype: propagate NaNs as False in predicate methods (eg .str.s…

9620e00

…tartswith)

jorisvandenbossche added API Design Strings String extension data type and string data labels Aug 26, 2024

jorisvandenbossche mentioned this pull request Aug 26, 2024

API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805

Closed

jorisvandenbossche added 4 commits August 26, 2024 18:02

use no_default for ArrowEA._str_endswith as well

b06764e

update type annotations

b235735

update docstrings

562118e

more type annotations

ef05ade

jorisvandenbossche added this to the 2.3 milestone Aug 27, 2024

jorisvandenbossche added 3 commits August 27, 2024 11:04

Merge remote-tracking branch 'upstream/main' into string-dtype-predic…

7d2a746

…ates-nan-propagation

test and fix startswith/endswith

b9612fc

test ismethods

cf242a2

jorisvandenbossche requested review from jbrockmendel and WillAyd August 27, 2024 10:22

jorisvandenbossche mentioned this pull request Aug 27, 2024

REF (string): de-duplicate str_endswith, startswith #59568

Merged

jorisvandenbossche added 8 commits August 31, 2024 19:36

Merge remote-tracking branch 'upstream/main' into string-dtype-predic…

f9ffff7

…ates-nan-propagation

fix warnings

ad0d6e1

try fix typing

bf02000

Merge remote-tracking branch 'upstream/main' into string-dtype-predic…

b650064

…ates-nan-propagation

follow same behaviour for categorical[str]

377ff3a

simplify fill_null calls for string[pyarrow] case

2dfd50b

fix na_value handling for categorical case + update tests for expecte…

adf2b99

…d categorical behaviour

fix typing + fix conversion for old pyarrow

ddd531a

Merge remote-tracking branch 'upstream/main' into string-dtype-predic…

e401b55

…ates-nan-propagation

jorisvandenbossche force-pushed the string-dtype-predicates-nan-propagation branch from 452e992 to e401b55 Compare September 16, 2024 09:01

WillAyd approved these changes Oct 9, 2024

View reviewed changes

Merge remote-tracking branch 'upstream/main' into string-dtype-predic…

3cb1b55

…ates-nan-propagation

jorisvandenbossche merged commit 88554d0 into pandas-dev:main Oct 10, 2024
51 checks passed

lumberbot-app bot added the Still Needs Manual Backport label Oct 10, 2024

jorisvandenbossche deleted the string-dtype-predicates-nan-propagation branch October 10, 2024 13:12

jorisvandenbossche added a commit to jorisvandenbossche/pandas that referenced this pull request Oct 10, 2024

String dtype: propagate NaNs as False in predicate methods (eg .str.s…

c3aa924

…tartswith) (pandas-dev#59616) (cherry picked from commit 88554d0)

jorisvandenbossche mentioned this pull request Oct 10, 2024

[backport 2.3.x] String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) (#59616) #60014

Merged

jorisvandenbossche added backported and removed Still Needs Manual Backport labels Oct 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) #59616

String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) #59616

jorisvandenbossche commented Aug 26, 2024 •

edited

Loading

WillAyd commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

WillAyd commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

WillAyd commented Aug 27, 2024

jorisvandenbossche commented Sep 6, 2024

rhshadrach commented Sep 7, 2024

rhshadrach commented Sep 7, 2024

jorisvandenbossche commented Sep 9, 2024

WillAyd commented Sep 9, 2024

jorisvandenbossche commented Sep 10, 2024 •

edited

Loading

WillAyd commented Sep 10, 2024

jorisvandenbossche commented Oct 9, 2024

lumberbot-app bot commented Oct 10, 2024

jorisvandenbossche commented Oct 10, 2024

String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) #59616

String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) #59616

Conversation

jorisvandenbossche commented Aug 26, 2024 • edited Loading

WillAyd commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

WillAyd commented Aug 27, 2024

jorisvandenbossche commented Aug 27, 2024

WillAyd commented Aug 27, 2024

jorisvandenbossche commented Sep 6, 2024

rhshadrach commented Sep 7, 2024

rhshadrach commented Sep 7, 2024

jorisvandenbossche commented Sep 9, 2024

WillAyd commented Sep 9, 2024

jorisvandenbossche commented Sep 10, 2024 • edited Loading

WillAyd commented Sep 10, 2024

jorisvandenbossche commented Oct 9, 2024

lumberbot-app bot commented Oct 10, 2024

jorisvandenbossche commented Oct 10, 2024

jorisvandenbossche commented Aug 26, 2024 •

edited

Loading

jorisvandenbossche commented Sep 10, 2024 •

edited

Loading