-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet metadata persistence of DataFrame.attrs #54346
Conversation
On one hand, this is not ideal as the round trip is not complete. On the other hand, in the current implementation, if the column name of a pandas dataframe is not a string, it will be converted to a string, when converting to a parquet file so is still consistent with other behaviours. Similar to Duplicate column names and non-string columns names are not supported.
, we just need to add a note saying that non-string attrs keys are not supported and will will be converted to a string. |
|
The new_df.attrs ( |
oh yeah. Can't really think of another approach other than typecasting keys to an int if possible but doesn't seem feasible. |
@SanjithChockan I think it would be fine. The key to the attrs is not the only thing that would be converted to string when saving to parquet file. |
I'll fix the failing checks and wait for someone else to review to see if this approach is okay |
doc/source/whatsnew/v2.1.0.rst
Outdated
@@ -176,8 +176,8 @@ Other enhancements | |||
- Performance improvement in :func:`concat` with homogeneous ``np.float64`` or ``np.float32`` dtypes (:issue:`52685`) | |||
- Performance improvement in :meth:`DataFrame.filter` when ``items`` is given (:issue:`52941`) | |||
- Reductions :meth:`Series.argmax`, :meth:`Series.argmin`, :meth:`Series.idxmax`, :meth:`Series.idxmin`, :meth:`Index.argmax`, :meth:`Index.argmin`, :meth:`DataFrame.idxmax`, :meth:`DataFrame.idxmin` are now supported for object-dtype objects (:issue:`4279`, :issue:`18021`, :issue:`40685`, :issue:`43697`) | |||
- Added ``PANDAS_ATTRS`` to :attr:`Schema.metadata` in :class:`PyArrowImpl` for parquet metadata persistence using pyarrow engine (:issue:`54346`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Added ``PANDAS_ATTRS`` to :attr:`Schema.metadata` in :class:`PyArrowImpl` for parquet metadata persistence using pyarrow engine (:issue:`54346`) | |
- :meth:`DataFrame.to_parquet` and :func:`read_parquet` will now write and read ``attrs`` respectively (:issue:`54346`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done!
Thanks @SanjithChockan |
fastparquet here - would have appreciated at least some notification of this. |
Ah sorry @martindurant. The original request mentioned pyarrow so fastparquet slipped the review. Happy to have PRs to add this to the fastparquet engine too! |
@mroeschke, does this PR include the ability to read/write column-level metadata from
It's also a data sharing requirement of FAIR Data Principles. |
I believe not, no. This PR only supports attrs from |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.DataFrame.attrs seems to not be attached to pyarrow schema metadata as it is experimental. Added it to the pyarrow table (schema.metadata) so it persists when the parquet file is read back. One issue I am facing is DataFrame.attrs dictionary can't have int as keys as encoding it for pyarrow converts it to string, but not sure if this is a problem.