API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805

jorisvandenbossche · 2023-08-28T12:39:17Z

From #54533 (comment) (and relevant for new String dtype #54792)

Currently, when having a string column with missing values, and calling one of the string methods that return a boolean series (such as .str.startswith(..)), the NaN or None values are preserved, and the result is an object-dtype series containing a mix of True/False and NaN/None. This is true for the current default object dtype with strings, but also for the specific StringDtype:

>>> pd.Series(["a", "b", None], dtype="object").str.startswith("a")
0     True
1    False
2     None
dtype: object

>>> pd.Series(["a", "b", None], dtype="string[pyarrow_numpy]").str.startswith("a")
0     True
1    False
2      NaN
dtype: object

This behaviour is also present when using the "nullable" version of StringDtype (with string_storage of "python" or "pyarrow") or using the ArrowDtype("string"):

>>> pd.Series(["a", "b", None], dtype="string[python]").str.startswith("a")
0     True
1    False
2     <NA>
dtype: boolean

Here, this makes sense and doesn't pose any usability problems, because the resulting boolean dtype is also nullable.

But in the first two examples, where the resulting boolean dtype would be the numpy bool dtype, we fall back to object-dtype when missing values are present.
And this gives some usability issues for the result. For example, you can't use that result for boolean indexing:

>>> ser = pd.Series(["a", "b", None], dtype="string[pyarrow_numpy]")
>>> ser[ser.str.startswith("a")]
...
ValueError: Cannot mask with non-boolean array containing NA / NaN values

I know this has been long standing behaviour for the object dtype way of using strings (and I can't remember getting too many complaints about this?). But when making the change for 3.0 with the new default string dtype, I think we do have a chance to make this easier to work with, and ensure those methods always return a bool dtype (by using False instead of propagating NaN).
On the other hand, for the nullable versions of the string dtype, we probably want to keep the propagating behaviour, and so that would introduce a new inconsistency between the different string storage types (but, this is also an inconsistency that already exists for other cases, such as comparison operators like == propagating NA vs giving False).

The text was updated successfully, but these errors were encountered:

WillAyd · 2024-08-01T03:19:42Z

I know this has been long standing behaviour for the object dtype way of using strings (and I can't remember getting too many complaints about this?). But when making the change for 3.0 with the new default string dtype, I think we do have a chance to make this easier to work with, and ensure those methods always return a bool dtype (by using False instead of propagating NaN).

To be honest I don't see the point in trying to change this behavior - we can go down a rabbit hole trying to optimize how this behavior works with NaN, but all the while the pd.NA variant works just fine. I think development effort would be better spent towards leveraging the latter and probably working towards PDEP-16 #58988

jorisvandenbossche · 2024-08-26T15:49:20Z

We discussed this briefly in a meeting at some point, but replying here explicitly as well.

we can go down a rabbit hole trying to optimize how this behavior works with NaN, but all the while the pd.NA variant works just fine.

While the pd.NA variant works fine, that is because indexing with boolean dtype with NAs works fine. While indexing with object dtype with NaN does not work at all.

As a result, a typical use case for those predicate methods, filtering your dataframe or series, does not work with NaN:

# works fine for NA-variant
>>> ser = pd.Series(["a", "b", None], dtype=pd.StringDtype(na_value=pd.NA))
>>> ser[ser.str.startswith("a")]
0    a
dtype: string

# does not work for NaN-variant currently
>>> ser = pd.Series(["a", "b", None], dtype=pd.StringDtype(na_value=np.nan))
>>> ser[ser.str.startswith("a")]
...
ValueError: Cannot mask with non-boolean array containing NA / NaN values

So while we indeed introduce yet another behaviour specifically for those string predicate methods (returning False for NaN in the input, instead of preserving NaN for object dtype or NA for the NA-string dtype), this does ensure consistency with other predicate operations within the string dtype (and with other default dtypes; worded in the PDEP as "String columns will follow NaN-semantics for missing values, where NaN gives False in boolean operations such as comparisons or predicates", i.e. ser == "a" does return False for NaNs in ser already), and also ensures consistency in the kind of full workflow cases like the example above.

On the point of development effort: this behaviour of which value is propagated is mostly centralized in _str_map, so the actual implementation change is relatively small (I do expect the bigger change will be in the tests to update those ..)

WillAyd · 2024-08-26T15:53:45Z

While the pd.NA variant works fine, that is because indexing with boolean dtype with NAs works fine. While indexing with object dtype with NaN does not work at all.

This is another good motivating factor for PDEP-13 I think. If we had a logical boolean data type that was always returned from this method we wouldn't have to deal with the nuances of the object-backed data type

jorisvandenbossche · 2024-08-26T15:55:25Z

Yep, certainly!

Draft PR -> #59616

jorisvandenbossche · 2024-10-28T15:03:01Z

Closed by #59616

jorisvandenbossche added API Design Strings String extension data type and string data labels Aug 28, 2023

This was referenced Aug 28, 2023

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

Implement Arrow String Array that is compatible with NumPy semantics #54533

Merged

jorisvandenbossche mentioned this issue May 7, 2024

PDEP-14: Dedicated string data type for pandas 3.0 #58551

Merged

jorisvandenbossche mentioned this issue Aug 26, 2024

String dtype: propagate NaNs as False in predicate methods (eg .str.startswith) #59616

Merged

4 tasks

jorisvandenbossche added this to the 2.3 milestone Aug 26, 2024

jorisvandenbossche closed this as completed Oct 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805

API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805

jorisvandenbossche commented Aug 28, 2023

WillAyd commented Aug 1, 2024

jorisvandenbossche commented Aug 26, 2024

WillAyd commented Aug 26, 2024

jorisvandenbossche commented Aug 26, 2024

jorisvandenbossche commented Oct 28, 2024

API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805

API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805

Comments

jorisvandenbossche commented Aug 28, 2023

WillAyd commented Aug 1, 2024

jorisvandenbossche commented Aug 26, 2024

WillAyd commented Aug 26, 2024

jorisvandenbossche commented Aug 26, 2024

jorisvandenbossche commented Oct 28, 2024