Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String dtype: overview of breaking behaviour changes #59328

Open
Tracked by #54792
jorisvandenbossche opened this issue Jul 26, 2024 · 19 comments
Open
Tracked by #54792

String dtype: overview of breaking behaviour changes #59328

jorisvandenbossche opened this issue Jul 26, 2024 · 19 comments
Labels
API Design Strings String extension data type and string data
Milestone

Comments

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jul 26, 2024

In context of the new default string dtype in 3.0 (#54792 / PDEP-14), currently enabled with pd.options.future.infer_string = True, there are a bunch of breaking changes that we will have to document.
In preparation of documenting, I want to use this issue to list all the behaviour changes that we are aware of (or run into) / potentially need to discuss if we actually want those changes.

First, there are a few obvious breaking changes that are also mentioned in the PDEP (and that are the main goals of the change):

  • Constructors and IO methods will now infer string data as a str dtype, instead of using object dtype.
  • Code checking for the dtype (e.g. ser.dtype == object) assuming object dtype, will break
  • The missing value sentinel is now always NaN, and for example no longer None (we still accept None as input, but it will be converted to NaN)

But additionally, there are some other less obvious changes or secondary consequences (or changes we already had a long time with the existing opt-in string dtype but will now be relevant for all).
Starting to list some of them here (and please add comments with other examples if you think of more).

astype(str) preserving missing values (no longer converting NaN to a string "nan")

This is a long standing "bug" (or at least generally agreed undesirable behaviour), as discussed in #25353.
Currently something like pd.Series(["foo", np.nan]).astype(str) would essentially convert every element to a string, including the missing values:

>>> ser = pd.Series(["foo", np.nan], dtype=object)
>>> ser
0    foo
1    NaN
dtype: object
>>> ser.astype(str)
0    foo
1    nan
dtype: object
>>> ser.astype(str).values
array(['foo', 'nan'], dtype=object)

Generally we expect missing values to propagate in astype(). And as a result of making str an alias for the new default string dtype (#59685), this will now follow a different code path and making use of the general StringDtype construction, which does preserve missing values;

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["foo", np.nan], dtype=object)
>>> ser.astype(str)
0    foo
1    NaN
dtype: str
>>> ser.astype(str).values
<StringArrayNumpySemantics>
['foo', nan]
Length: 2, dtype: str

Because

Mixed dtype operations

Any working code that previously relied on the object dtype allowing mixed types, where the initial data is now inferred as string dtype. Because the string dtype is now strict about only allowing strings, that means certain workflows will no longer work (unless users explicitly ensure to keep using object dtype).

For example, setitem with a non string:

>>> ser = pd.Series(["a", "b"], dtype="object")
>>> ser[0] = 1
>>> ser
0    1
1    b
dtype: object

>>> pd.options.future.infer_string = True
>>> ser = pd.Series(["a", "b"])
>>> ser[0] = 1
...
TypeError: Scalar must be NA or str

The same happens if you try to fill a column of strings and missing values with a non-string:

>>> pd.Series(["a", None]).fillna(0)
...
TypeError: Invalid value '0' for dtype string

Update: the above is kept working with upcasting to object dtype (see #60296)

Numeric aggregations

With object dtype strings, we do allow sum and prod in certain cases:

>>> pd.Series(["a", "b"], dtype="object").sum()
'ab'
>>> pd.Series(["a", "b"], dtype="string").sum()
...
TypeError: Cannot perform reduction 'sum' with string dtype

# prod only in case of 1 string (can be other missing values)
>>> pd.Series(["a"], dtype="object").prod()
'a'
>>> pd.Series(["a"], dtype="string").prod()
...
TypeError: Cannot perform reduction 'prod' with string dtype

Based on the discussion below, we decided to keep sum() working (#59853 is adding that functionality to string dtype), but prod() is fine to start raising.

Note: due to pyarrow implementation limitation, the sum is limited to 2GB result, see https://github.com/pandas-dev/pandas/pull/59853/files#r1794090618 (given this is about the size of a single Python string, that seems very unlikely to happen)

For any()/all() (which does work for object dtype, but didn't for the already existing StringDtype), we decided to keep this working for now for the new default string dtype, see #51939, #54591

Invalid unicode input

>>> pd.options.future.infer_string = False
>>> pd.Series(["\ud83d"])
0    \ud83d
dtype: object

>>> pd.options.future.infer_string = True
>>> pd.Series(["\ud83d"])
...
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 0: surrogates not allowed

Users that want to keep the previous behaviour can explicitly specify dtype=object to keep working with object dtype.

@harshmarke
Copy link

I am willing to work on this task.

@jorisvandenbossche
Copy link
Member Author

@harshmarke there is not (yet) something directly actionable in this issue to work on. This issue is for now just meant to keep track of and discuss changes that we will need document at some later point.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Aug 28, 2024

Numeric aggregations
With object dtype strings, we do allow sum and prod in certain cases:

During the dev call on 8/28, @jbrockmendel brought up this issue. @mroeschke, @jbrockmendel and I were slightly in favor of keeping the current behavior (allowing .sum() to work), but think that the whole core team should be asked.

@rhshadrach
Copy link
Member

I am also in favor of allowing .sum() to work, but I think .prod() should always raise (even on a group of size 1).

@jorisvandenbossche
Copy link
Member Author

One data point is that we have been disallowing this for the nullable StringDtype and ArrowDtype(string) for quite a while, and (as far as I am aware / could find) no one raised an issue about this.

On the other hand, we explicitly do allow addition between two string operands (like ser + "string") for string concatenation, and then it seems most consistent to also allow sum (for prod we don't have that analogy, because multiplication is not allowed for two string operands, only between string and int).
That sounds like a good reason to allow sum and disallow prod.

@rhshadrach
Copy link
Member

I have personally had use cases where I wanted to summarize sequence data, e.g.

df = pd.DataFrame(
    {
        "group": ["A", "B", "A", "A", "B", "C"],
        "location": ["0", "1", "3", "4", "2", "5"],
    }
)
result = df.assign(location=df["location"] + ", ").groupby("group")["location"].sum().str[:-2]

print(result)
# group
# A    0, 3, 4
# B       1, 2
# C          5
# Name: location, dtype: object

@jorisvandenbossche
Copy link
Member Author

Given the above feedback, let's add a sum() implementation for StringDtype (while doing that, I also noticed that the current support for summing strings in object dtype is limited, because once you have missing values it errors).
PR for that -> #59853

Sidenote: I think there might be room for a "specialized" string concatenation (reduction) method (reduction variant of str.join()), where you can specify the marker to join the strings (this would be similar as string_agg in PostgreSQL: https://www.postgresql.org/docs/current/functions-aggregate.html).
Then df.assign(location=df["location"] + ", ").groupby("group")["location"].sum().str[:-2] could be simplified to something like df.groupby("group")["location"].str_concat(sep=",")

@jorisvandenbossche
Copy link
Member Author

I also updated the top post to add a section about astype(str) preserving missing values (no longer converting NaN to a string "nan"), which is a consequence of making str an alias, so astype(str) casts to the StringDtype.
While a breaking change, this would resolve a long-standing undesired behaviour of astype(str), as discussed in #25353

@WillAyd
Copy link
Member

WillAyd commented Sep 22, 2024

Sidenote: I think there might be room for a "specialized" string concatenation (reduction) method (reduction variant of str.join()), where you can specify the marker to join the strings (this would be similar as string_agg in PostgreSQL:

I like the idea of having a dedicated function for this; I think most users don't expect sum to concatenate strings, and more often than not it can be a huge performance hit

@rhshadrach
Copy link
Member

rhshadrach commented Sep 23, 2024

I think most users don't expect sum to concatenate strings

Why do you think this? 'x' + 'y' gives you 'xy'. What do you think are users expecting?

It seems to me sum is well known for concatenating strings. Having to search the API for what the particular package calls their string concatenation function is an annoyance I have experienced, albeit a minor one. It does seem to me that we should have sum concatenate strings. However, I would be okay with a join method (perhaps a different name). I think this should be implemented in groupby, resample, and window as well.

@WillAyd
Copy link
Member

WillAyd commented Sep 24, 2024

Why do you think this? 'x' + 'y' gives you 'xy'. What do you think are users expecting?

Python is not consistent in how this is handled. Using the built-in sum operator will throw:

>>> "a" + "b" + "c"
'abc'
>>> sum(["a", "b", "c"])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unsupported operand type(s) for +: 'int' and 'str'

I agree that join makes the most sense, since that is also the idiomatic approach in Python to handle string concatenation of a sequence

@rhshadrach
Copy link
Member

rhshadrach commented Sep 24, 2024

Python is not consistent in how this is handled. Using the built-in sum operator will throw:

The reason Python does this is for performance: the generic implementation of sum is highly inefficient with strings. Since Python does implement + for strings, I do no think one can argue Python is adverse to defining a sum of strings.

@simonjayhawkins
Copy link
Member

The same happens if you try to fill a column of strings and missing values with a non-string:

>>> pd.Series(["a", None]).fillna(0)
...
TypeError: Invalid value '0' for dtype string

following #60296 this can now be removed?

@simonjayhawkins
Copy link
Member

There maybe some value to adding the breaking changes to the documentation (instead of tracking here) so that we can link from the 2.3 release notes

@jorisvandenbossche
Copy link
Member Author

following #60296 this can now be removed?

Good point, that indeed now works again (I would personally find that a good change, but it's something we could definitely do trough a deprecation cycle, so no need to change it now)

There maybe some value to adding the breaking changes to the documentation (instead of tracking here)

Yes, see the first sentence of the top post, this issue is gathering the changes with the goal of documenting them

@simonjayhawkins
Copy link
Member

following #60296 this can now be removed?

Good point, that indeed now works again (I would personally find that a good change, but it's something we could definitely do trough a deprecation cycle, so no need to change it now)

yep. the comment was to update the OP not to deprecate/change anything

There maybe some value to adding the breaking changes to the documentation (instead of tracking here)

Yes, see the first sentence of the top post, this issue is gathering the changes with the goal of documenting them

so the milestone on this issue should be 2.3 and not 3.0?

@simonjayhawkins
Copy link
Member

For any()/all() (which does work for object dtype, but didn't for the already existing StringDtype), we decided to keep this working for now for the new default string dtype, see #51939, #54591

for missing values, bool(None) is False, bool(np.nan) is True and bool(pd.NA) is undefined.

We ignore missing values in .any and .all by default and for the object array we treat None as a missing value but don't coerce it to np.nan like the new default.

so we will have the following change

>>> pd.options.future.infer_string = False
>>> pd.Series(["a", "b", None]).all()
True
>>> 
>>> all(pd.Series(["a", "b", None]))
False
>>> 
>>> pd.Series(["a", "b", None]).all(skipna=False)
False
>>> 
>>> pd.options.future.infer_string = True
>>> pd.Series(["a", "b", None]).all()
True
>>> 
>>> all(pd.Series(["a", "b", None]))
True
>>> 
>>> pd.Series(["a", "b", None]).all(skipna=False)
True
>>> 

is this too nuanced to include in the breaking changes docs? or would be included in the The missing value sentinel is now always NaN, and for example no longer None (we still accept None as input, but it will be converted to NaN) section?

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 17, 2024

or would be included in the The missing value sentinel is now always NaN, and for example no longer None (we still accept None as input, but it will be converted to NaN) section?

We could mention it indeed there, because this is not actually related to our .any()/.all() methods, but theany/all builtins, at which point it is indeed a consequence of using a different scalar sentinel (and which has the unfortunate behaviour of bool(NaN) being True ..)

@jorisvandenbossche
Copy link
Member Author

jorisvandenbossche commented Nov 17, 2024

because this is not actually related to our .any()/.all() methods, but theany/all builtins,

Sorry, that was a bit too optimistic, because also with None in the object dtype and using skipna=False in the all() method, you have a change in behaviour.
I.e. pd.Series(["a", "b", None]).all(skipna=False) changing from False to True is the behaviour change. I personally find that a strange change, but we have long documented that "If skipna is False, then NA are treated as True, because these are not equal to zero." (I would personally expect that missing values are treated as False)

And also before, pd.Series(["a", "b", np.nan], dtype=object).all(skipna=False) with object dtype but with NaN instead of None already did give True (i.e. matching what the string dtype will do). And in many cases we did create object dtype data with NaNs (e.g in read_csv, although then read_parquet gives you None)

@jorisvandenbossche jorisvandenbossche modified the milestones: 3.0, 2.3 Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

6 participants