-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) #44195
base: main
Are you sure you want to change the base?
GH-43683: [Python] Use pandas StringDtype when enabled (pandas 3+) #44195
Conversation
@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly |
Revision: 56b61f2 Submitted crossbow builds: ursacomputing/crossbow @ actions-c85b742ef7
|
python/pyarrow/tests/test_pandas.py
Outdated
e1 = pd.DataFrame( | ||
{'a': a_values}, | ||
index=pd.RangeIndex(0, 8, step=2, name='qux'), | ||
columns=pd.Index(['a'], dtype=object) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the column type created with the dict argument differ from this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is specifically using old metadata that specifies the dtype of the columns is object dtype, and then pyarrow tries to restore it that way.
It's the question if we should do that though .. Because every file written from a pandas DataFrame before pandas 3.0 will have that, so maybe we should specifically ignore object dtype here if the inferred type is that it contains all strings, so users consistently get a columns Index object using str
dtype
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that's tricky but I think going with the str data type as you suggested is better; I would expect that is a better UX in over 99% of instances
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, changed this to ensure we actually use str
dtype columns Index object, even if the pandas metadata of the pyarrow table says that the original table was using object dtype.
This ensures that all existing files will use (with pandas>= 3) the default str dtype for the columns, but that also has the trade-off that if you explicitly want to use object dtype with strings, that this will no longer roundtrip in pandas->pyarrow/parquet->pandas)
@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly |
Revision: 84b8234 Submitted crossbow builds: ursacomputing/crossbow @ actions-ac3103d3ba
|
@github-actions crossbow submit test-conda-python-3.11-pandas-nightly-numpy-nightly |
Revision: e5db09f Submitted crossbow builds: ursacomputing/crossbow @ actions-3c389cd49e
|
@github-actions crossbow submit -g python |
Revision: f9f960f Submitted crossbow builds: ursacomputing/crossbow @ actions-883577486f |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR @jorisvandenbossche . I'll try to further review (and try to understand better) once the upstream issue is fixed and CI is not failing :)
pa.types.is_string(field.type) | ||
or pa.types.is_large_string(field.type) | ||
or pa.types.is_string_view(field.type) | ||
) and field.name not in categories: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am curious on how were categories interpreted before inferring the new string type, was this just not taken into account on the arrow side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If field.name in categories
is true, that means the user asked to convert this column to a categorical dtype on the pandas side. This is handled on the C++ side to dictionary encode the column, and so in this case we don't have to specify any custom pandas extension dtype here, because then our conversion layer will convert that dictionary encoded column to a pandas categorical.
I added an xfail so the CI should not be failing anymore (but note that there is a failure on the nightly builds anyway, for a while, that is unrelated) |
Co-authored-by: Raúl Cumplido <[email protected]>
@github-actions crossbow submit -g python |
Revision: a7e5e34 Submitted crossbow builds: ursacomputing/crossbow @ actions-dfa36b7aca |
With the latest run, the failing tests is the dlpack one that is failing on |
I don't think there was an issue opened for the dlpack error, so I've opened: |
Rationale for this change
With pandas' PDEP-14 proposal, pandas is planning to introduce a default string dtype in pandas 3.0 (instead of the current object dtype).
This will become the default in pandas 3.0, and can be enabled with an option in the upcoming pandas 2.3 (
pd.options.future.infer_string = True
). To prepare for that, we should start using that string dtype into_pandas()
conversions when that option is enabled.What changes are included in this PR?
to_pandas()
calls use the default string dtype of pandas for string-like columns (string, large_string, string_view)Are these changes tested?
It is tested in the pandas-nightly crossbow build.
There is still one failure that is because of a bug on the pandas side (pandas-dev/pandas#59879)
Are there any user-facing changes?