String dtype: overview of breaking behaviour changes #59328

jorisvandenbossche · 2024-07-26T13:50:35Z

In context of the new default string dtype in 3.0 (#54792 / PDEP-14), currently enabled with pd.options.future.infer_string = True, there are a bunch of breaking changes that we will have to document.
In preparation of documenting, I want to use this issue to list all the behaviour changes that we are aware of (or run into) / potentially need to discuss if we actually want those changes.

First, there are a few obvious breaking changes that are also mentioned in the PDEP (and that are the main goals of the change):

Constructors and IO methods will now infer string data as a str dtype, instead of using object dtype.
Code checking for the dtype (e.g. ser.dtype == object) assuming object dtype, will break
The missing value sentinel is now always NaN, and for example no longer None (we still accept None as input, but it will be converted to NaN)

But additionally, there are some other less obvious changes or secondary consequences (or changes we already had a long time with the existing opt-in string dtype but will now be relevant for all).
Starting to list some of them here (and please add comments with other examples if you think of more).

`astype(str)` preserving missing values (no longer converting NaN to a string "nan")

This is a long standing "bug" (or at least generally agreed undesirable behaviour), as discussed in #25353.
Currently something like pd.Series(["foo", np.nan]).astype(str) would essentially convert every element to a string, including the missing values:

>>> ser = pd.Series(["foo", np.nan], dtype=object)
>>> ser
0    foo
1    NaN
dtype: object
>>> ser.astype(str)
0    foo
1    nan
dtype: object
>>> ser.astype(str).values
array(['foo', 'nan'], dtype=object)

Generally we expect missing values to propagate in astype(). And as a result of making str an alias for the new default string dtype (#59685), this will now follow a different code path and making use of the general StringDtype construction, which does preserve missing values;

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["foo", np.nan], dtype=object)
>>> ser.astype(str)
0    foo
1    NaN
dtype: str
>>> ser.astype(str).values
<StringArrayNumpySemantics>
['foo', nan]
Length: 2, dtype: str

Because

Mixed dtype operations

Any working code that previously relied on the object dtype allowing mixed types, where the initial data is now inferred as string dtype. Because the string dtype is now strict about only allowing strings, that means certain workflows will no longer work (unless users explicitly ensure to keep using object dtype).

For example, setitem with a non string:

>>> ser = pd.Series(["a", "b"], dtype="object")
>>> ser[0] = 1
>>> ser
0    1
1    b
dtype: object

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"])
>>> ser[0] = 1
...
TypeError: Scalar must be NA or str

~~The same happens if you try to fill a column of strings and missing values with a non-string:~~

>>> pd.Series(["a", None]).fillna(0)
...
TypeError: Invalid value '0' for dtype string

Update: the above is kept working with upcasting to object dtype (see #60296)

Numeric aggregations

With object dtype strings, we do allow sum and prod in certain cases:

>>> pd.Series(["a", "b"], dtype="object").sum()
'ab'
>>> pd.Series(["a", "b"], dtype="string").sum()
...
TypeError: Cannot perform reduction 'sum' with string dtype

# prod only in case of 1 string (can be other missing values)
>>> pd.Series(["a"], dtype="object").prod()
'a'
>>> pd.Series(["a"], dtype="string").prod()
...
TypeError: Cannot perform reduction 'prod' with string dtype

Based on the discussion below, we decided to keep sum() working (#59853 is adding that functionality to string dtype), but prod() is fine to start raising.

Note: due to pyarrow implementation limitation, the sum is limited to 2GB result, see https://github.com/pandas-dev/pandas/pull/59853/files#r1794090618 (given this is about the size of a single Python string, that seems very unlikely to happen)

For any()/all() (which does work for object dtype, but didn't for the already existing StringDtype), we decided to keep this working for now for the new default string dtype, see #51939, #54591

Invalid unicode input

>>> pd.options.future.infer_string = False
>>> pd.Series(["\ud83d"])
0    \ud83d
dtype: object

>>> pd.options.future.infer_string = True
>>> pd.Series(["\ud83d"])
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

Users that want to keep the previous behaviour can explicitly specify dtype=object to keep working with object dtype.

The text was updated successfully, but these errors were encountered:

harshmarke · 2024-07-29T08:30:09Z

I am willing to work on this task.

jorisvandenbossche · 2024-07-29T14:03:34Z

@harshmarke there is not (yet) something directly actionable in this issue to work on. This issue is for now just meant to keep track of and discuss changes that we will need document at some later point.

Dr-Irv · 2024-08-28T18:34:15Z

Numeric aggregations
With object dtype strings, we do allow sum and prod in certain cases:

During the dev call on 8/28, @jbrockmendel brought up this issue. @mroeschke, @jbrockmendel and I were slightly in favor of keeping the current behavior (allowing .sum() to work), but think that the whole core team should be asked.

rhshadrach · 2024-09-02T13:02:36Z

I am also in favor of allowing .sum() to work, but I think .prod() should always raise (even on a group of size 1).

jorisvandenbossche · 2024-09-02T13:10:35Z

One data point is that we have been disallowing this for the nullable StringDtype and ArrowDtype(string) for quite a while, and (as far as I am aware / could find) no one raised an issue about this.

On the other hand, we explicitly do allow addition between two string operands (like ser + "string") for string concatenation, and then it seems most consistent to also allow sum (for prod we don't have that analogy, because multiplication is not allowed for two string operands, only between string and int).
That sounds like a good reason to allow sum and disallow prod.

rhshadrach · 2024-09-03T18:40:07Z

I have personally had use cases where I wanted to summarize sequence data, e.g.

df = pd.DataFrame(
    {
        "group": ["A", "B", "A", "A", "B", "C"],
        "location": ["0", "1", "3", "4", "2", "5"],
    }
)
result = df.assign(location=df["location"] + ", ").groupby("group")["location"].sum().str[:-2]

print(result)
# group
# A    0, 3, 4
# B       1, 2
# C          5
# Name: location, dtype: object

jorisvandenbossche · 2024-09-21T09:59:41Z

Given the above feedback, let's add a sum() implementation for StringDtype (while doing that, I also noticed that the current support for summing strings in object dtype is limited, because once you have missing values it errors).
PR for that -> #59853

Sidenote: I think there might be room for a "specialized" string concatenation (reduction) method (reduction variant of str.join()), where you can specify the marker to join the strings (this would be similar as string_agg in PostgreSQL: https://www.postgresql.org/docs/current/functions-aggregate.html).
Then df.assign(location=df["location"] + ", ").groupby("group")["location"].sum().str[:-2] could be simplified to something like df.groupby("group")["location"].str_concat(sep=",")

jorisvandenbossche · 2024-09-21T10:01:53Z

I also updated the top post to add a section about astype(str) preserving missing values (no longer converting NaN to a string "nan"), which is a consequence of making str an alias, so astype(str) casts to the StringDtype.
While a breaking change, this would resolve a long-standing undesired behaviour of astype(str), as discussed in #25353

WillAyd · 2024-09-22T19:13:09Z

Sidenote: I think there might be room for a "specialized" string concatenation (reduction) method (reduction variant of str.join()), where you can specify the marker to join the strings (this would be similar as string_agg in PostgreSQL:

I like the idea of having a dedicated function for this; I think most users don't expect sum to concatenate strings, and more often than not it can be a huge performance hit

rhshadrach · 2024-09-23T01:08:32Z

I think most users don't expect sum to concatenate strings

Why do you think this? 'x' + 'y' gives you 'xy'. What do you think are users expecting?

It seems to me sum is well known for concatenating strings. Having to search the API for what the particular package calls their string concatenation function is an annoyance I have experienced, albeit a minor one. It does seem to me that we should have sum concatenate strings. However, I would be okay with a join method (perhaps a different name). I think this should be implemented in groupby, resample, and window as well.

WillAyd · 2024-09-24T14:34:53Z

Why do you think this? 'x' + 'y' gives you 'xy'. What do you think are users expecting?

Python is not consistent in how this is handled. Using the built-in sum operator will throw:

>>> "a" + "b" + "c"
'abc'
>>> sum(["a", "b", "c"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'str'

I agree that join makes the most sense, since that is also the idiomatic approach in Python to handle string concatenation of a sequence

rhshadrach · 2024-09-24T20:47:55Z

Python is not consistent in how this is handled. Using the built-in sum operator will throw:

The reason Python does this is for performance: the generic implementation of sum is highly inefficient with strings. Since Python does implement + for strings, I do no think one can argue Python is adverse to defining a sum of strings.

simonjayhawkins · 2024-11-16T18:55:35Z

The same happens if you try to fill a column of strings and missing values with a non-string:
>>> pd.Series(["a", None]).fillna(0)
...
TypeError: Invalid value '0' for dtype string

following #60296 this can now be removed?

simonjayhawkins · 2024-11-16T18:58:47Z

There maybe some value to adding the breaking changes to the documentation (instead of tracking here) so that we can link from the 2.3 release notes

jorisvandenbossche · 2024-11-16T19:25:11Z

following #60296 this can now be removed?

Good point, that indeed now works again (I would personally find that a good change, but it's something we could definitely do trough a deprecation cycle, so no need to change it now)

There maybe some value to adding the breaking changes to the documentation (instead of tracking here)

Yes, see the first sentence of the top post, this issue is gathering the changes with the goal of documenting them

simonjayhawkins · 2024-11-16T19:29:44Z

following #60296 this can now be removed?

Good point, that indeed now works again (I would personally find that a good change, but it's something we could definitely do trough a deprecation cycle, so no need to change it now)

yep. the comment was to update the OP not to deprecate/change anything

There maybe some value to adding the breaking changes to the documentation (instead of tracking here)

Yes, see the first sentence of the top post, this issue is gathering the changes with the goal of documenting them

so the milestone on this issue should be 2.3 and not 3.0?

simonjayhawkins · 2024-11-17T10:55:11Z

For any()/all() (which does work for object dtype, but didn't for the already existing StringDtype), we decided to keep this working for now for the new default string dtype, see #51939, #54591

for missing values, bool(None) is False, bool(np.nan) is True and bool(pd.NA) is undefined.

We ignore missing values in .any and .all by default and for the object array we treat None as a missing value but don't coerce it to np.nan like the new default.

so we will have the following change

>>> pd.options.future.infer_string = False
>>> pd.Series(["a", "b", None]).all()
True
>>> 
>>> all(pd.Series(["a", "b", None]))
False
>>> 
>>> pd.Series(["a", "b", None]).all(skipna=False)
False
>>> 
>>> pd.options.future.infer_string = True
>>> pd.Series(["a", "b", None]).all()
True
>>> 
>>> all(pd.Series(["a", "b", None]))
True
>>> 
>>> pd.Series(["a", "b", None]).all(skipna=False)
True
>>>

is this too nuanced to include in the breaking changes docs? or would be included in the The missing value sentinel is now always NaN, and for example no longer None (we still accept None as input, but it will be converted to NaN) section?

jorisvandenbossche · 2024-11-17T11:51:55Z

or would be included in the The missing value sentinel is now always NaN, and for example no longer None (we still accept None as input, but it will be converted to NaN) section?

We could mention it indeed there, because this is not actually related to our .any()/.all() methods, but theany/all builtins, at which point it is indeed a consequence of using a different scalar sentinel (and which has the unfortunate behaviour of bool(NaN) being True ..)

jorisvandenbossche · 2024-11-17T11:57:50Z

because this is not actually related to our .any()/.all() methods, but theany/all builtins,

Sorry, that was a bit too optimistic, because also with None in the object dtype and using skipna=False in the all() method, you have a change in behaviour.
I.e. pd.Series(["a", "b", None]).all(skipna=False) changing from False to True is the behaviour change. I personally find that a strange change, but we have long documented that "If skipna is False, then NA are treated as True, because these are not equal to zero." (I would personally expect that missing values are treated as False)

And also before, pd.Series(["a", "b", np.nan], dtype=object).all(skipna=False) with object dtype but with NaN instead of None already did give True (i.e. matching what the string dtype will do). And in many cases we did create object dtype data with NaNs (e.g in read_csv, although then read_parquet gives you None)

jorisvandenbossche added API Design Strings String extension data type and string data labels Jul 26, 2024

jorisvandenbossche added this to the 3.0 milestone Jul 26, 2024

This was referenced Jul 26, 2024

TRACKER: new default String dtype (pyarrow-backed, numpy NaN semantics) #54792

Open

TST (string dtype): clean-up xpasssing tests with future string dtype #59323

Merged

jorisvandenbossche mentioned this issue Aug 6, 2024

TST (string dtype): fix groupby xfails with using_infer_string + update error message #59430

Merged

jorisvandenbossche mentioned this issue Aug 20, 2024

BUG: .str.contains na validation #59561

Closed

This was referenced Sep 20, 2024

String dtype: implement sum reduction #59853

Merged

String dtype: map builtin str alias to StringDtype #59685

Merged

jorisvandenbossche mentioned this issue Oct 30, 2024

BUG: Converting NumPy-nullable dtypes to str #60123

Closed

3 tasks

jorisvandenbossche mentioned this issue Nov 7, 2024

BUG/API: sum of a string column with all-NaN or empty #60229

Open

jorisvandenbossche modified the milestones: 3.0, 2.3 Nov 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

String dtype: overview of breaking behaviour changes #59328

String dtype: overview of breaking behaviour changes #59328

jorisvandenbossche commented Jul 26, 2024 •

edited

Loading

harshmarke commented Jul 29, 2024

jorisvandenbossche commented Jul 29, 2024

Dr-Irv commented Aug 28, 2024

rhshadrach commented Sep 2, 2024

jorisvandenbossche commented Sep 2, 2024

rhshadrach commented Sep 3, 2024

jorisvandenbossche commented Sep 21, 2024

jorisvandenbossche commented Sep 21, 2024

WillAyd commented Sep 22, 2024

rhshadrach commented Sep 23, 2024 •

edited

Loading

WillAyd commented Sep 24, 2024

rhshadrach commented Sep 24, 2024 •

edited

Loading

simonjayhawkins commented Nov 16, 2024

simonjayhawkins commented Nov 16, 2024

jorisvandenbossche commented Nov 16, 2024

simonjayhawkins commented Nov 16, 2024

simonjayhawkins commented Nov 17, 2024

jorisvandenbossche commented Nov 17, 2024 •

edited

Loading

jorisvandenbossche commented Nov 17, 2024 •

edited

Loading

String dtype: overview of breaking behaviour changes #59328

String dtype: overview of breaking behaviour changes #59328

Comments

jorisvandenbossche commented Jul 26, 2024 • edited Loading

astype(str) preserving missing values (no longer converting NaN to a string "nan")

Mixed dtype operations

Numeric aggregations

Invalid unicode input

harshmarke commented Jul 29, 2024

jorisvandenbossche commented Jul 29, 2024

Dr-Irv commented Aug 28, 2024

rhshadrach commented Sep 2, 2024

jorisvandenbossche commented Sep 2, 2024

rhshadrach commented Sep 3, 2024

jorisvandenbossche commented Sep 21, 2024

jorisvandenbossche commented Sep 21, 2024

WillAyd commented Sep 22, 2024

rhshadrach commented Sep 23, 2024 • edited Loading

WillAyd commented Sep 24, 2024

rhshadrach commented Sep 24, 2024 • edited Loading

simonjayhawkins commented Nov 16, 2024

simonjayhawkins commented Nov 16, 2024

jorisvandenbossche commented Nov 16, 2024

simonjayhawkins commented Nov 16, 2024

simonjayhawkins commented Nov 17, 2024

jorisvandenbossche commented Nov 17, 2024 • edited Loading

jorisvandenbossche commented Nov 17, 2024 • edited Loading

jorisvandenbossche commented Jul 26, 2024 •

edited

Loading

`astype(str)` preserving missing values (no longer converting NaN to a string "nan")

rhshadrach commented Sep 23, 2024 •

edited

Loading

rhshadrach commented Sep 24, 2024 •

edited

Loading

jorisvandenbossche commented Nov 17, 2024 •

edited

Loading

jorisvandenbossche commented Nov 17, 2024 •

edited

Loading