-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: string dtype propagation of NaNs in predicate methods (eg .str.startswith) #54805
Comments
To be honest I don't see the point in trying to change this behavior - we can go down a rabbit hole trying to optimize how this behavior works with NaN, but all the while the pd.NA variant works just fine. I think development effort would be better spent towards leveraging the latter and probably working towards PDEP-16 #58988 |
We discussed this briefly in a meeting at some point, but replying here explicitly as well.
While the pd.NA variant works fine, that is because indexing with boolean dtype with NAs works fine. While indexing with object dtype with NaN does not work at all. As a result, a typical use case for those predicate methods, filtering your dataframe or series, does not work with NaN: # works fine for NA-variant
>>> ser = pd.Series(["a", "b", None], dtype=pd.StringDtype(na_value=pd.NA))
>>> ser[ser.str.startswith("a")]
0 a
dtype: string
# does not work for NaN-variant currently
>>> ser = pd.Series(["a", "b", None], dtype=pd.StringDtype(na_value=np.nan))
>>> ser[ser.str.startswith("a")]
...
ValueError: Cannot mask with non-boolean array containing NA / NaN values So while we indeed introduce yet another behaviour specifically for those string predicate methods (returning False for NaN in the input, instead of preserving NaN for object dtype or NA for the NA-string dtype), this does ensure consistency with other predicate operations within the string dtype (and with other default dtypes; worded in the PDEP as "String columns will follow NaN-semantics for missing values, where NaN gives False in boolean operations such as comparisons or predicates", i.e. On the point of development effort: this behaviour of which value is propagated is mostly centralized in |
This is another good motivating factor for PDEP-13 I think. If we had a logical boolean data type that was always returned from this method we wouldn't have to deal with the nuances of the object-backed data type |
Yep, certainly! Draft PR -> #59616 |
Closed by #59616 |
From #54533 (comment) (and relevant for new String dtype #54792)
Currently, when having a string column with missing values, and calling one of the string methods that return a boolean series (such as
.str.startswith(..)
), the NaN or None values are preserved, and the result is an object-dtype series containing a mix of True/False and NaN/None. This is true for the current default object dtype with strings, but also for the specific StringDtype:This behaviour is also present when using the "nullable" version of StringDtype (with string_storage of "python" or "pyarrow") or using the ArrowDtype("string"):
Here, this makes sense and doesn't pose any usability problems, because the resulting boolean dtype is also nullable.
But in the first two examples, where the resulting boolean dtype would be the numpy bool dtype, we fall back to object-dtype when missing values are present.
And this gives some usability issues for the result. For example, you can't use that result for boolean indexing:
I know this has been long standing behaviour for the object dtype way of using strings (and I can't remember getting too many complaints about this?). But when making the change for 3.0 with the new default string dtype, I think we do have a chance to make this easier to work with, and ensure those methods always return a bool dtype (by using
False
instead of propagatingNaN
).On the other hand, for the nullable versions of the string dtype, we probably want to keep the propagating behaviour, and so that would introduce a new inconsistency between the different string storage types (but, this is also an inconsistency that already exists for other cases, such as comparison operators like
==
propagating NA vs giving False).The text was updated successfully, but these errors were encountered: