Docs for pyarrow reader / writer #46

TomAugspurger · 2024-04-24T14:51:31Z

This adds docs for the pyarrow reader / writer.

I think most users of this will benefit from it being the primary method of producing stac-geoparquet (because it has the best support for nested data).

For analytics, where all the features of a library like geopandas will be desired, I'll want to work on pandas-dev/pandas#57411 which prevents (geo)pandas from reading these parquet files.

README.md

kylebarron · 2024-04-24T15:03:26Z

For analytics, where all the features of a library like geopandas will be desired, I'll want to work on pandas-dev/pandas#57411 which prevents (geo)pandas from reading these parquet files.

Interesting... I suppose I've never hit that because I'm not often saving data with the embedded Pandas-specific metadata. If you write this table without the Pandas-specific metadata it loads fine:

import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq

list_int = pa.list_(pa.int64())
col = pa.array([[1, 1], [2, 2]])
table = pa.table({'col': col})
pq.write_table(table, "ex.parquet")

df = pq.read_table("ex.parquet").to_pandas(types_mapper=pd.ArrowDtype)
print(repr(df))
print(df.dtypes)

gives:

     col
0  [1 1]
1  [2 2]

col    list<element: int64>[pyarrow]
dtype: object

Maybe the quickest way to unblock ourselves would be to have a helper function to GeoPandas that ignores any embedded GeoPandas metadata?

TomAugspurger · 2024-04-24T15:06:53Z

Oh, sorry, I must have been testing against the dtype_backend="pyarrow" implementation. That should produce the same output, aside from the pandas-specific metadata, which as you say is the issue.

>>> geopandas.read_parquet("items.parquet").head()

works fine, so I think we can ignore that issue.

kylebarron · 2024-04-24T15:09:16Z

must have been testing against the dtype_backend="pyarrow" implementation

I'm not 100% sure here, but I thought that I'm getting the same result as wherever you're using dtype_backend="pyarrow" by passing types_mapper=pd.ArrowDtype into the pyarrow.Table.to_pandas call. The output dtype is still pyarrow-backed:

col    list<element: int64>[pyarrow]

kylebarron · 2024-04-24T21:55:33Z

Thanks!

Docs for pyarrow reader / writer

60c7f97

kylebarron reviewed Apr 24, 2024

View reviewed changes

README.md Show resolved Hide resolved

kylebarron reviewed Apr 24, 2024

View reviewed changes

README.md Show resolved Hide resolved

kylebarron mentioned this pull request Apr 24, 2024

Move arrow-based code into arrow module #47

Merged

Tom Augspurger added 2 commits April 24, 2024 16:16

Merge remote-tracking branch 'origin/main' into user/tom/doc-arrow

2a77ac3

fixup

d3a1cf9

TomAugspurger merged commit 2e7cef7 into main Apr 24, 2024
1 check passed

TomAugspurger deleted the user/tom/doc-arrow branch April 24, 2024 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docs for pyarrow reader / writer #46

Docs for pyarrow reader / writer #46

TomAugspurger commented Apr 24, 2024

kylebarron commented Apr 24, 2024 •

edited

Loading

TomAugspurger commented Apr 24, 2024

kylebarron commented Apr 24, 2024 •

edited

Loading

kylebarron commented Apr 24, 2024

Docs for pyarrow reader / writer #46

Docs for pyarrow reader / writer #46

Conversation

TomAugspurger commented Apr 24, 2024

kylebarron commented Apr 24, 2024 • edited Loading

TomAugspurger commented Apr 24, 2024

kylebarron commented Apr 24, 2024 • edited Loading

kylebarron commented Apr 24, 2024

kylebarron commented Apr 24, 2024 •

edited

Loading

kylebarron commented Apr 24, 2024 •

edited

Loading