Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805

Closed
Tracked by #54792
jorisvandenbossche opened this issue Aug 28, 2023 · 5 comments
Labels
API Design Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

From #54533 (comment) (and relevant for new String dtype #54792)

Currently, when having a string column with missing values, and calling one of the string methods that return a boolean series (such as .str.startswith(..)), the NaN or None values are preserved, and the result is an object-dtype series containing a mix of True/False and NaN/None. This is true for the current default object dtype with strings, but also for the specific StringDtype:

>>> pd.Series(["a", "b", None], dtype="object").str.startswith("a")
0     True
1    False
2     None
dtype: object

>>> pd.Series(["a", "b", None], dtype="string[pyarrow_numpy]").str.startswith("a")
0     True
1    False
2      NaN
dtype: object

This behaviour is also present when using the "nullable" version of StringDtype (with string_storage of "python" or "pyarrow") or using the ArrowDtype("string"):

>>> pd.Series(["a", "b", None], dtype="string[python]").str.startswith("a")
0     True
1    False
2     <NA>
dtype: boolean

Here, this makes sense and doesn't pose any usability problems, because the resulting boolean dtype is also nullable.

But in the first two examples, where the resulting boolean dtype would be the numpy bool dtype, we fall back to object-dtype when missing values are present.
And this gives some usability issues for the result. For example, you can't use that result for boolean indexing:

>>> ser = pd.Series(["a", "b", None], dtype="string[pyarrow_numpy]")
>>> ser[ser.str.startswith("a")]
...
ValueError: Cannot mask with non-boolean array containing NA / NaN values

I know this has been long standing behaviour for the object dtype way of using strings (and I can't remember getting too many complaints about this?). But when making the change for 3.0 with the new default string dtype, I think we do have a chance to make this easier to work with, and ensure those methods always return a bool dtype (by using False instead of propagating NaN).
On the other hand, for the nullable versions of the string dtype, we probably want to keep the propagating behaviour, and so that would introduce a new inconsistency between the different string storage types (but, this is also an inconsistency that already exists for other cases, such as comparison operators like == propagating NA vs giving False).

@WillAyd
Copy link
Member

WillAyd commented Aug 1, 2024

I know this has been long standing behaviour for the object dtype way of using strings (and I can't remember getting too many complaints about this?). But when making the change for 3.0 with the new default string dtype, I think we do have a chance to make this easier to work with, and ensure those methods always return a bool dtype (by using False instead of propagating NaN).

To be honest I don't see the point in trying to change this behavior - we can go down a rabbit hole trying to optimize how this behavior works with NaN, but all the while the pd.NA variant works just fine. I think development effort would be better spent towards leveraging the latter and probably working towards PDEP-16 #58988

@jorisvandenbossche
Copy link
Member Author

We discussed this briefly in a meeting at some point, but replying here explicitly as well.

we can go down a rabbit hole trying to optimize how this behavior works with NaN, but all the while the pd.NA variant works just fine.

While the pd.NA variant works fine, that is because indexing with boolean dtype with NAs works fine. While indexing with object dtype with NaN does not work at all.

As a result, a typical use case for those predicate methods, filtering your dataframe or series, does not work with NaN:

# works fine for NA-variant
>>> ser = pd.Series(["a", "b", None], dtype=pd.StringDtype(na_value=pd.NA))
>>> ser[ser.str.startswith("a")]
0    a
dtype: string

# does not work for NaN-variant currently
>>> ser = pd.Series(["a", "b", None], dtype=pd.StringDtype(na_value=np.nan))
>>> ser[ser.str.startswith("a")]
...
ValueError: Cannot mask with non-boolean array containing NA / NaN values

So while we indeed introduce yet another behaviour specifically for those string predicate methods (returning False for NaN in the input, instead of preserving NaN for object dtype or NA for the NA-string dtype), this does ensure consistency with other predicate operations within the string dtype (and with other default dtypes; worded in the PDEP as "String columns will follow NaN-semantics for missing values, where NaN gives False in boolean operations such as comparisons or predicates", i.e. ser == "a" does return False for NaNs in ser already), and also ensures consistency in the kind of full workflow cases like the example above.

On the point of development effort: this behaviour of which value is propagated is mostly centralized in _str_map, so the actual implementation change is relatively small (I do expect the bigger change will be in the tests to update those ..)

@WillAyd
Copy link
Member

WillAyd commented Aug 26, 2024

While the pd.NA variant works fine, that is because indexing with boolean dtype with NAs works fine. While indexing with object dtype with NaN does not work at all.

This is another good motivating factor for PDEP-13 I think. If we had a logical boolean data type that was always returned from this method we wouldn't have to deal with the nuances of the object-backed data type

@jorisvandenbossche
Copy link
Member Author

Yep, certainly!

Draft PR -> #59616

@jorisvandenbossche jorisvandenbossche added this to the 2.3 milestone Aug 26, 2024
@jorisvandenbossche
Copy link
Member Author

Closed by #59616

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

2 participants