Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs for pyarrow reader / writer #46

Merged
merged 3 commits into from
Apr 24, 2024
Merged

Docs for pyarrow reader / writer #46

merged 3 commits into from
Apr 24, 2024

Conversation

TomAugspurger
Copy link
Collaborator

This adds docs for the pyarrow reader / writer.

I think most users of this will benefit from it being the primary method of producing stac-geoparquet (because it has the best support for nested data).

For analytics, where all the features of a library like geopandas will be desired, I'll want to work on pandas-dev/pandas#57411 which prevents (geo)pandas from reading these parquet files.

README.md Show resolved Hide resolved
README.md Show resolved Hide resolved
@kylebarron
Copy link
Collaborator

kylebarron commented Apr 24, 2024

For analytics, where all the features of a library like geopandas will be desired, I'll want to work on pandas-dev/pandas#57411 which prevents (geo)pandas from reading these parquet files.

Interesting... I suppose I've never hit that because I'm not often saving data with the embedded Pandas-specific metadata. If you write this table without the Pandas-specific metadata it loads fine:

import pyarrow as pa
import pandas as pd
import pyarrow.parquet as pq

list_int = pa.list_(pa.int64())
col = pa.array([[1, 1], [2, 2]])
table = pa.table({'col': col})
pq.write_table(table, "ex.parquet")

df = pq.read_table("ex.parquet").to_pandas(types_mapper=pd.ArrowDtype)
print(repr(df))
print(df.dtypes)

gives:

     col
0  [1 1]
1  [2 2]
col    list<element: int64>[pyarrow]
dtype: object

Maybe the quickest way to unblock ourselves would be to have a helper function to GeoPandas that ignores any embedded GeoPandas metadata?

@TomAugspurger
Copy link
Collaborator Author

Oh, sorry, I must have been testing against the dtype_backend="pyarrow" implementation. That should produce the same output, aside from the pandas-specific metadata, which as you say is the issue.

>>> geopandas.read_parquet("items.parquet").head()

works fine, so I think we can ignore that issue.

@kylebarron
Copy link
Collaborator

kylebarron commented Apr 24, 2024

must have been testing against the dtype_backend="pyarrow" implementation

I'm not 100% sure here, but I thought that I'm getting the same result as wherever you're using dtype_backend="pyarrow" by passing types_mapper=pd.ArrowDtype into the pyarrow.Table.to_pandas call. The output dtype is still pyarrow-backed:

col    list<element: int64>[pyarrow]

@TomAugspurger TomAugspurger merged commit 2e7cef7 into main Apr 24, 2024
1 check passed
@TomAugspurger TomAugspurger deleted the user/tom/doc-arrow branch April 24, 2024 21:46
@kylebarron
Copy link
Collaborator

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants