Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Expose to_pandas_kwargs in read_parquet for pyarrow engine #49236

Closed
1 of 3 tasks
TomAugspurger opened this issue Oct 21, 2022 · 7 comments · Fixed by #59654
Closed
1 of 3 tasks

ENH: Expose to_pandas_kwargs in read_parquet for pyarrow engine #49236

TomAugspurger opened this issue Oct 21, 2022 · 7 comments · Fixed by #59654
Assignees
Labels
Arrow pyarrow functionality Enhancement good first issue IO Parquet parquet, feather
Milestone

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Oct 21, 2022

Feature Type

  • Adding new functionality to pandas

  • Changing existing functionality in pandas

  • Removing existing functionality in pandas

Problem Description

I want to read a parquet file but have control over how the pyarrow Table is converted to a pandas dataframe by specifying the to_pandas_kwargs argument in the call to Table.to_parquet().

import pyarrow as pa
import pyarrow.parquet as pq
import datetime

# write
arr = pa.array([datetime.datetime(1600, 1, 1)], type=pa.timestamp("us"))
table = pa.table([arr], names=["timestamp"])
pq.write_table(table, "test.parquet")

# read
import pandas as pd
pd.read_parquet("test.parquet")

That raises with

---------------------------------------------------------------------------
ArrowInvalid                              Traceback (most recent call last)
Cell In [43], line 10
      7 pq.write_table(table, "test.parquet")
      9 import pandas as pd
---> 10 pd.read_parquet("test.parquet")

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/io/parquet.py:501, in read_parquet(path, engine, columns, storage_options, use_nullable_dtypes, **kwargs)
    454 """
    455 Load a parquet object from the file path, returning a DataFrame.
    456 
   (...)
    497 DataFrame
    498 """
    499 impl = get_engine(engine)
--> 501 return impl.read(
    502     path,
    503     columns=columns,
    504     storage_options=storage_options,
    505     use_nullable_dtypes=use_nullable_dtypes,
    506     **kwargs,
    507 )

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pandas/io/parquet.py:249, in PyArrowImpl.read(self, path, columns, use_nullable_dtypes, storage_options, **kwargs)
    242 path_or_handle, handles, kwargs["filesystem"] = _get_path_or_handle(
    243     path,
    244     kwargs.pop("filesystem", None),
    245     storage_options=storage_options,
    246     mode="rb",
    247 )
    248 try:
--> 249     result = self.api.parquet.read_table(
    250         path_or_handle, columns=columns, **kwargs
    251     ).to_pandas(**to_pandas_kwargs)
    252     if manager == "array":
    253         result = result._as_manager("array", copy=False)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/array.pxi:823, in pyarrow.lib._PandasConvertible.to_pandas()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/table.pxi:3913, in pyarrow.lib.Table._to_pandas()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/pandas_compat.py:818, in table_to_blockmanager(options, table, categories, ignore_metadata, types_mapper)
    816 _check_data_column_metadata_consistency(all_columns)
    817 columns = _deserialize_column_index(table, all_columns, column_indexes)
--> 818 blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
    820 axes = [columns, index]
    821 return BlockManager(blocks, axes)

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/pandas_compat.py:1168, in _table_to_blocks(options, block_table, categories, extension_columns)
   1163 def _table_to_blocks(options, block_table, categories, extension_columns):
   1164     # Part of table_to_blockmanager
   1165 
   1166     # Convert an arrow table to Block from the internal pandas API
   1167     columns = block_table.column_names
-> 1168     result = pa.lib.table_to_blocks(options, block_table, categories,
   1169                                     list(extension_columns.keys()))
   1170     return [_reconstruct_block(item, columns, extension_columns)
   1171             for item in result]

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/table.pxi:2602, in pyarrow.lib.table_to_blocks()

File /srv/conda/envs/notebook/lib/python3.10/site-packages/pyarrow/error.pxi:100, in pyarrow.lib.check_status()

ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: -11676096000000000

The solution, in pyarrow, is to pass timestamp_as_object=True in the call .to_pandas().

Feature Description

Add a new parameter to read_parquet (technically just the arrow engine, but adding it here for docs)

pd.read_parquet(
    path: 'FilePath | ReadBuffer[bytes]',
    engine: 'str' = 'auto',
    columns: 'list[str] | None' = None,
    storage_options: 'StorageOptions' = None,
    use_nullable_dtypes: 'bool' = False,
    to_pandas_kwargs: dict[str, Any] | None = None,
    **kwargs,
) -> 'DataFrame'
    """
    to_pandas_kwargs:
        Additional keyword arguments passed to :meth:`pyarrow.Table.to_pandas` to control
        how the pyarrow Table is converted to a pandas DataFrame. By default,
        the `use_nullable_dtypes` option controls whether the `types_mapper` argument
        is set.
    """

Alternative Solutions

Just use pyarrow :)

Additional Context

No response

@TomAugspurger TomAugspurger added Enhancement IO Parquet parquet, feather Needs Triage Issue that has not been reviewed by a pandas team member Arrow pyarrow functionality labels Oct 21, 2022
@lithomas1 lithomas1 removed the Needs Triage Issue that has not been reviewed by a pandas team member label Oct 23, 2022
@lithomas1
Copy link
Member

xref #34823 for read_csv.

@FilipRazek
Copy link

take

@akashthemosh
Copy link

take

akashthemosh added a commit to akashthemosh/pandas that referenced this issue Jun 12, 2024
Adds the `to_pandas_kwargs` parameter to `pd.read_parquet` to allow passing arguments to `pyarrow.Table.to_pandas`. This addresses issues that may arise during Parquet-to-DataFrame conversion, such as handling microsecond timestamps.

Fixes pandas-dev#49236
@EduardAkhmetshin
Copy link
Contributor

take

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Nov 12, 2024

In one of closed PRs, @WillAyd brought up that he finds this a weird API and that this ties us to the pyarrow API (#57044 (comment)), and it was suggested to update the documentation instead.

LIke @phofl, I am still +1 on adding this keyword here. It's indeed depending on the API of pyarrow (we could also make the keyword even more specific in name and call it something like pyarrow_to_pandas_kwargs ..), but we are quite tied to pyarrow anyway, since we pass any **kwargs to pyarrow.parquet.read_table. So this essentially just allows you to pass kwargs to the other pyarrow call.

Why not just encourage users to call pa.parquet.read_table in such a case directly?

That is not exactly equivalent though (eg pandas has different handling of the path (eg url support), sets up some default type mappers (especially now with the string dtype this is relevant), handles attrs)

@WillAyd
Copy link
Member

WillAyd commented Nov 13, 2024

Sounds good Joris. My objection was pretty soft, so happy to have this progressed

@kleinhenz
Copy link
Contributor

I think #59654 is ready if the consensus is to go ahead with this. To me since we are already passing through arguments to read_table it makes sense to expose this as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment