ENH: Expose `to_pandas_kwargs` in `read_parquet` for pyarrow engine #49236

TomAugspurger · 2022-10-21T16:38:28Z

Feature Type

Adding new functionality to pandas
Changing existing functionality in pandas
Removing existing functionality in pandas

Problem Description

I want to read a parquet file but have control over how the pyarrow Table is converted to a pandas dataframe by specifying the to_pandas_kwargs argument in the call to Table.to_parquet().

import pyarrow as pa
import pyarrow.parquet as pq
import datetime

# write
arr = pa.array([datetime.datetime(1600, 1, 1)], type=pa.timestamp("us"))
table = pa.table([arr], names=["timestamp"])
pq.write_table(table, "test.parquet")

# read
import pandas as pd
pd.read_parquet("test.parquet")

That raises with

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In [43], line 10
      7 pq.write_table(table, "test.parquet")
      9 import pandas as pd
---> 10 pd.read_parquet("test.parquet")

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/io/parquet.py:501, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
    454 """
    455 Load a parquet object from the file path, returning a DataFrame.
    456 
   (...)
    497 DataFrame
    498 """
    499 impl = get_engine(engine)
--> 501 return impl.read(
    502     path,
    503     columns=columns,
    504     storage_options=storage_options,
    505     use_nullable_dtypes=use_nullable_dtypes,
    506     **kwargs,
    507 )

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/io/parquet.py:249, in PyArrowImpl.read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
    242 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
    243     path,
    244     kwargs.pop("filesystem", None),
    245     storage_options=storage_options,
    246     mode="rb",
    247 )
    248 try:
--> 249     result = self.api.parquet.read_table(
    250         path_or_handle, columns=columns, **kwargs
    251     ).to_pandas(**to_pandas_kwargs)
    252     if manager == "array":
    253         result = result._as_manager("array", copy=False)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/array.pxi:823, in pyarrow.lib._PandasConvertible.to_pandas()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/table.pxi:3913, in pyarrow.lib.Table._to_pandas()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/pandas_compat.py:818, in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    816 _check_data_column_metadata_consistency(all_columns)
    817 columns = _deserialize_column_index(table, all_columns, column_indexes)
--> 818 blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
    820 axes = [columns, index]
    821 return BlockManager(blocks, axes)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/pandas_compat.py:1168, in _table_to_blocks(options, block_table, categories, extension_columns)
   1163 def _table_to_blocks(options, block_table, categories, extension_columns):
   1164     # Part of table_to_blockmanager
   1165 
   1166     # Convert an arrow table to Block from the internal pandas API
   1167     columns = block_table.column_names
-> 1168     result = pa.lib.table_to_blocks(options, block_table, categories,
   1169                                     list(extension_columns.keys()))
   1170     return [_reconstruct_block(item, columns, extension_columns)
   1171             for item in result]

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/table.pxi:2602, in pyarrow.lib.table_to_blocks()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: -11676096000000000

The solution, in pyarrow, is to pass timestamp_as_object=True in the call .to_pandas().

Feature Description

Add a new parameter to read_parquet (technically just the arrow engine, but adding it here for docs)

pd.read_parquet(
    path: 'FilePath | ReadBuffer[bytes]',
    engine: 'str' = 'auto',
    columns: 'list[str] | None' = None,
    storage_options: 'StorageOptions' = None,
    use_nullable_dtypes: 'bool' = False,
    to_pandas_kwargs: dict[str, Any] | None = None,
    **kwargs,
) -> 'DataFrame'
    """
    to_pandas_kwargs:
        Additional keyword arguments passed to :meth:`pyarrow.Table.to_pandas` to control
        how the pyarrow Table is converted to a pandas DataFrame. By default,
        the `use_nullable_dtypes` option controls whether the `types_mapper` argument
        is set.
    """

Alternative Solutions

Just use pyarrow :)

Additional Context

No response

The text was updated successfully, but these errors were encountered:

lithomas1 · 2022-10-23T14:59:34Z

xref #34823 for read_csv.

FilipRazek · 2024-01-24T01:12:05Z

take

akashthemosh · 2024-06-12T03:14:41Z

take

Adds the `to_pandas_kwargs` parameter to `pd.read_parquet` to allow passing arguments to `pyarrow.Table.to_pandas`. This addresses issues that may arise during Parquet-to-DataFrame conversion, such as handling microsecond timestamps. Fixes pandas-dev#49236

EduardAkhmetshin · 2024-08-03T11:05:51Z

take

…as into pandas-devgh-49236

jorisvandenbossche · 2024-11-12T21:40:55Z

In one of closed PRs, @WillAyd brought up that he finds this a weird API and that this ties us to the pyarrow API (#57044 (comment)), and it was suggested to update the documentation instead.

LIke @phofl, I am still +1 on adding this keyword here. It's indeed depending on the API of pyarrow (we could also make the keyword even more specific in name and call it something like pyarrow_to_pandas_kwargs ..), but we are quite tied to pyarrow anyway, since we pass any **kwargs to pyarrow.parquet.read_table. So this essentially just allows you to pass kwargs to the other pyarrow call.

Why not just encourage users to call pa.parquet.read_table in such a case directly?

That is not exactly equivalent though (eg pandas has different handling of the path (eg url support), sets up some default type mappers (especially now with the string dtype this is relevant), handles attrs)

WillAyd · 2024-11-13T03:54:56Z

Sounds good Joris. My objection was pretty soft, so happy to have this progressed

kleinhenz · 2024-11-19T17:54:25Z

I think #59654 is ready if the consensus is to go ahead with this. To me since we are already passing through arguments to read_table it makes sense to expose this as well.

TomAugspurger added Enhancement IO Parquet parquet, feather Needs Triage Issue that has not been reviewed by a pandas team member Arrow pyarrow functionality labels Oct 21, 2022

lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Oct 23, 2022

jorisvandenbossche mentioned this issue Jan 17, 2024

BUG: parquet serialization/deserialization adds all dict keys into column #56842

Open

3 tasks

jorisvandenbossche added the good first issue label Jan 17, 2024

github-actions bot assigned FilipRazek Jan 24, 2024

FilipRazek mentioned this issue Jan 24, 2024

fix: Expose to_pandas_kwargs in read_parquet for pyarrow engine #57044

Closed

5 tasks

github-actions bot assigned akashthemosh Jun 12, 2024

akashthemosh mentioned this issue Jun 12, 2024

Fix: Add to_pandas_kwargs to read_parquet for PyArrow engine #58981

Closed

4 tasks

github-actions bot assigned EduardAkhmetshin Aug 3, 2024

EduardAkhmetshin mentioned this issue Aug 3, 2024

TST: Test non-nanosecond datetimes in PyArrow Parquet dataframes #59393

Merged

5 tasks

EduardAkhmetshin added a commit to EduardAkhmetshin/pandas that referenced this issue Aug 9, 2024

Merge remote-tracking branch 'upstream/main' into pandas-devgh-49236

c2ab34c

EduardAkhmetshin added a commit to EduardAkhmetshin/pandas that referenced this issue Aug 9, 2024

Merge branch 'pandas-devgh-49236' of github.com:EduardAkhmetshin/pand…

78e5b4e

…as into pandas-devgh-49236

EduardAkhmetshin added a commit to EduardAkhmetshin/pandas that referenced this issue Aug 19, 2024

Merge remote-tracking branch 'upstream/main' into pandas-devgh-49236

5f462b7

kleinhenz mentioned this issue Sep 3, 2024

ENH: expose to_pandas_kwargs in read_parquet with pyarrow backend #59654

Merged

5 tasks

jorisvandenbossche closed this as completed in #59654 Nov 21, 2024

jorisvandenbossche added this to the 3.0 milestone Nov 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Expose `to_pandas_kwargs` in `read_parquet` for pyarrow engine #49236

ENH: Expose `to_pandas_kwargs` in `read_parquet` for pyarrow engine #49236

TomAugspurger commented Oct 21, 2022 •

edited

Loading

lithomas1 commented Oct 23, 2022

FilipRazek commented Jan 24, 2024

akashthemosh commented Jun 12, 2024

EduardAkhmetshin commented Aug 3, 2024

jorisvandenbossche commented Nov 12, 2024 •

edited

Loading

WillAyd commented Nov 13, 2024

kleinhenz commented Nov 19, 2024

ENH: Expose to_pandas_kwargs in read_parquet for pyarrow engine #49236

ENH: Expose to_pandas_kwargs in read_parquet for pyarrow engine #49236

Comments

TomAugspurger commented Oct 21, 2022 • edited Loading

Feature Type

Problem Description

Feature Description

Alternative Solutions

Additional Context

lithomas1 commented Oct 23, 2022

FilipRazek commented Jan 24, 2024

akashthemosh commented Jun 12, 2024

EduardAkhmetshin commented Aug 3, 2024

jorisvandenbossche commented Nov 12, 2024 • edited Loading

WillAyd commented Nov 13, 2024

kleinhenz commented Nov 19, 2024

ENH: Expose `to_pandas_kwargs` in `read_parquet` for pyarrow engine #49236

ENH: Expose `to_pandas_kwargs` in `read_parquet` for pyarrow engine #49236

TomAugspurger commented Oct 21, 2022 •

edited

Loading

jorisvandenbossche commented Nov 12, 2024 •

edited

Loading