Parquet metadata persistence of DataFrame.attrs #54346

SanjithChockan · 2023-08-01T04:49:32Z

closes #ENH: Allow to_parquet to save the metadata from DataFrame.attrs and load it back #54321
Tests added and passed
All code checks passed.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

DataFrame.attrs seems to not be attached to pyarrow schema metadata as it is experimental. Added it to the pyarrow table (schema.metadata) so it persists when the parquet file is read back. One issue I am facing is DataFrame.attrs dictionary can't have int as keys as encoding it for pyarrow converts it to string, but not sure if this is a problem.

xiki-tempula · 2023-08-01T06:20:24Z

On one hand, this is not ideal as the round trip is not complete. On the other hand, in the current implementation, if the column name of a pandas dataframe is not a string, it will be converted to a string, when converting to a parquet file so is still consistent with other behaviours.

Similar to Duplicate column names and non-string columns names are not supported.

Several caveats.

Duplicate column names and non-string columns names are not supported.

, we just need to add a note saying that non-string attrs keys are not supported and will will be converted to a string.

SanjithChockan · 2023-08-01T19:01:20Z

On one hand, this is not ideal as the round trip is not complete.
I'm not sure I understand. Could you explain this?

xiki-tempula · 2023-08-01T19:06:28Z

One issue I am facing is DataFrame.attrs dictionary can't have int as keys as encoding it for pyarrow converts it to string, but not sure if this is a problem.
So I would imagine that
if I do

df.attrs = {1:1}
df.to_parquet('test.p')
new_df = pd.read_parquet('test.p')

The new_df.attrs ({'1':1}) would not be the same as the df.attrs ({1:1}). This is what I mean by round trip, where you go to parquet file then go back.

SanjithChockan · 2023-08-01T20:42:41Z

oh yeah. Can't really think of another approach other than typecasting keys to an int if possible but doesn't seem feasible.

xiki-tempula · 2023-08-01T20:44:53Z

@SanjithChockan I think it would be fine. The key to the attrs is not the only thing that would be converted to string when saving to parquet file.

SanjithChockan · 2023-08-01T20:47:37Z

I'll fix the failing checks and wait for someone else to review to see if this approach is okay

pandas/io/parquet.py

mroeschke · 2023-08-03T15:16:57Z

doc/source/whatsnew/v2.1.0.rst

@@ -176,8 +176,8 @@ Other enhancements
 - Performance improvement in :func:`concat` with homogeneous ``np.float64`` or ``np.float32`` dtypes (:issue:`52685`)
 - Performance improvement in :meth:`DataFrame.filter` when ``items`` is given (:issue:`52941`)
 - Reductions :meth:`Series.argmax`, :meth:`Series.argmin`, :meth:`Series.idxmax`, :meth:`Series.idxmin`, :meth:`Index.argmax`, :meth:`Index.argmin`, :meth:`DataFrame.idxmax`, :meth:`DataFrame.idxmin` are now supported for object-dtype objects (:issue:`4279`, :issue:`18021`, :issue:`40685`, :issue:`43697`)
+- Added ``PANDAS_ATTRS`` to :attr:`Schema.metadata` in :class:`PyArrowImpl` for parquet metadata persistence using pyarrow engine (:issue:`54346`)


Suggested change

- Added ``PANDAS_ATTRS`` to :attr:`Schema.metadata` in :class:`PyArrowImpl` for parquet metadata persistence using pyarrow engine (:issue:`54346`)

- :meth:`DataFrame.to_parquet` and :func:`read_parquet` will now write and read ``attrs`` respectively (:issue:`54346`)

mroeschke · 2023-08-03T20:39:42Z

Thanks @SanjithChockan

martindurant · 2023-11-09T19:05:33Z

fastparquet here - would have appreciated at least some notification of this.

dask/fastparquet#900

mroeschke · 2023-11-10T01:14:16Z

Ah sorry @martindurant. The original request mentioned pyarrow so fastparquet slipped the review. Happy to have PRs to add this to the fastparquet engine too!

aufdenkampe · 2023-11-21T15:24:31Z

@mroeschke, does this PR include the ability to read/write column-level metadata from pandas.Series.attrs? This feature is particularly important to environmental data scientists who need to keep track of column/variable metadata fields such as:

long name
description
units

It's also a data sharing requirement of FAIR Data Principles.

mroeschke · 2023-11-21T15:34:39Z

I believe not, no. This PR only supports attrs from DataFrame

SanjithChockan added 2 commits August 1, 2023 04:25

added df.attrs metadata to pyarrow table for persistence

16b75bd

hooks

9e92d13

SanjithChockan changed the title ~~Parquet metadata persistence~~ Parquet metadata persistence of DataFrame.attrs Aug 1, 2023

mroeschke added IO Parquet parquet, feather metadata _metadata, .attrs labels Aug 1, 2023

SanjithChockan added 2 commits August 2, 2023 19:52

placed unit test in correct class

f974af1

update unit test

b559ca8

mroeschke reviewed Aug 3, 2023

View reviewed changes

pandas/io/parquet.py Outdated Show resolved Hide resolved

mroeschke reviewed Aug 3, 2023

View reviewed changes

pandas/io/parquet.py Outdated Show resolved Hide resolved

SanjithChockan added 3 commits August 3, 2023 03:39

changed to consistent use of json

e6a34e6

added whatsnew

e3f2f2a

added gaurd to check if df.attrs exists

66e3992

SanjithChockan requested a review from mroeschke August 3, 2023 04:02

mroeschke reviewed Aug 3, 2023

View reviewed changes

updated whatsnew

6a96e36

SanjithChockan requested a review from mroeschke August 3, 2023 19:45

mroeschke added this to the 2.1 milestone Aug 3, 2023

mroeschke approved these changes Aug 3, 2023

View reviewed changes

mroeschke merged commit 152595c into pandas-dev:main Aug 3, 2023
30 of 32 checks passed

xiki-tempula mentioned this pull request Aug 3, 2023

Parquet file can now save metadata alchemistry/alchemlyb#331

Closed

SanjithChockan deleted the parquet-metadata branch August 3, 2023 22:09

tpvasconcelos mentioned this pull request Oct 2, 2023

ENH: Include df.attrs in to_json output #51012

Open

3 tasks

davetapley mentioned this pull request Nov 6, 2023

ENH: Allow to_parquet to save the metadata from DataFrame.attrs and load it back #54321

Closed

3 tasks

This was referenced Nov 6, 2023

ENH adding metadata argument to DataFrame.to_parquet #20521

Open

ENH: col descriptions that'd save in df schemas, helping users avoid creating separate documentation? #42582

Open

attrs persistance for Pandas dask/fastparquet#900

Closed

nicholas-ys-tan mentioned this pull request Jun 5, 2024

BUG: to_parquet does not seem to preserve attrs geopandas/geopandas#3320

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parquet metadata persistence of DataFrame.attrs #54346

Parquet metadata persistence of DataFrame.attrs #54346

SanjithChockan commented Aug 1, 2023 •

edited

Loading

xiki-tempula commented Aug 1, 2023 •

edited

Loading

SanjithChockan commented Aug 1, 2023

xiki-tempula commented Aug 1, 2023

SanjithChockan commented Aug 1, 2023

xiki-tempula commented Aug 1, 2023

SanjithChockan commented Aug 1, 2023

mroeschke Aug 3, 2023

SanjithChockan Aug 3, 2023

mroeschke commented Aug 3, 2023

martindurant commented Nov 9, 2023

mroeschke commented Nov 10, 2023

aufdenkampe commented Nov 21, 2023 •

edited

Loading

mroeschke commented Nov 21, 2023

	- Added ``PANDAS_ATTRS`` to :attr:`Schema.metadata` in :class:`PyArrowImpl` for parquet metadata persistence using pyarrow engine (:issue:`54346`)
	- :meth:`DataFrame.to_parquet` and :func:`read_parquet` will now write and read ``attrs`` respectively (:issue:`54346`)

Parquet metadata persistence of DataFrame.attrs #54346

Parquet metadata persistence of DataFrame.attrs #54346

Conversation

SanjithChockan commented Aug 1, 2023 • edited Loading

xiki-tempula commented Aug 1, 2023 • edited Loading

SanjithChockan commented Aug 1, 2023

xiki-tempula commented Aug 1, 2023

SanjithChockan commented Aug 1, 2023

xiki-tempula commented Aug 1, 2023

SanjithChockan commented Aug 1, 2023

mroeschke Aug 3, 2023

Choose a reason for hiding this comment

SanjithChockan Aug 3, 2023

Choose a reason for hiding this comment

mroeschke commented Aug 3, 2023

martindurant commented Nov 9, 2023

mroeschke commented Nov 10, 2023

aufdenkampe commented Nov 21, 2023 • edited Loading

mroeschke commented Nov 21, 2023

SanjithChockan commented Aug 1, 2023 •

edited

Loading

xiki-tempula commented Aug 1, 2023 •

edited

Loading

aufdenkampe commented Nov 21, 2023 •

edited

Loading