-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
STAC Interoperability with Arrow #37
STAC Interoperability with Arrow #37
Conversation
Thanks! I'm trying to round trip some NAIP items from the PC: import pystac_client
import stac_geoparquet.to_parquet
import stac_geoparquet.from_arrow
import stac_geoparquet.to_arrow
items = list(
pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
.search(collections="naip", max_items=4)
.items_as_dicts()
)
table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
items2 = list(stac_geoparquet.from_arrow.stac_table_to_items(table)) and am hitting TypeError Traceback (most recent call last)
Cell In[11], line 12
6 items = list(
7 pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
8 .search(collections="naip", max_items=4)
9 .items_as_dicts()
10 )
11 table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
---> 12 items2 = list(stac_geoparquet.from_arrow.stac_table_to_items(table))
File ~/src/stac-utils/stac-geoparquet/stac_geoparquet/from_arrow.py:27, in stac_table_to_items(table)
24 # Convert WKB geometry column to GeoJSON, and then assign the geojson geometry when
25 # converting each row to a dictionary.
26 for batch in table.to_batches():
---> 27 geoms = shapely.from_wkb(batch["geometry"])
28 geojson_strings = shapely.to_geojson(geoms)
30 # RecordBatch is missing a `drop()` method, so we keep all columns other than
31 # geometry instead
File ~/src/stac-utils/stac-geoparquet/.direnv/python-3.10.10/lib/python3.10/site-packages/shapely/io.py:320, in from_wkb(geometry, on_invalid, **kwargs)
316 # ensure the input has object dtype, to avoid numpy inferring it as a
317 # fixed-length string dtype (which removes trailing null bytes upon access
318 # of array elements)
319 geometry = np.asarray(geometry, dtype=object)
--> 320 return lib.from_wkb(geometry, invalid_handler, **kwargs)
TypeError: Expected bytes or string, got dict Seems like the In [18]: table["geometry"].to_pylist()
Out[18]:
[{'coordinates': [[[-65.683663, 18.184851],
[-65.684718, 18.253643],
[-65.75386, 18.25266],
[-65.752778, 18.183872],
[-65.683663, 18.184851]]],
'type': 'Polygon'},
{'coordinates': [[[-65.746142, 18.184853],
[-65.747222, 18.253666],
[-65.816382, 18.25266],
[-65.815275, 18.183852],
[-65.746142, 18.184853]]],
'type': 'Polygon'},
{'coordinates': [[[-65.558704, 18.309849],
[-65.559716, 18.378606],
[-65.628821, 18.377663],
[-65.627781, 18.30891],
[-65.558704, 18.309849]]],
'type': 'Polygon'},
{'coordinates': [[[-65.496227, 18.309844],
[-65.497215, 18.378583],
[-65.566297, 18.377663],
[-65.565282, 18.308929],
[-65.496227, 18.309844]]],
'type': 'Polygon'}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good to me using some manual tests.
Speaking of tests, thoughts on adding some basic ones, mainly making sure that round-trip between list[Item] <-> Table works? Do you want to wait for #39 to tackle tests?
Co-authored-by: Tom Augspurger <[email protected]>
Definitely. The nice part about this Arrow work is that it's a direct in-memory counterpart to the Parquet schema. So we can mainly test the Arrow interop and get the Parquet functionality for free, without having to test that step as rigorously.
Yeah that sounds good. |
Great, thanks! |
This is a clean up to #27, which implemented a work-in-progress converter to and from Arrow memory, originally done during the STAC sprint.
Change list
schema
argument for advanced users who know the schema of their STAC items. Note that this schema is applied after conversion to WKB but before any other conversions.bbox
column to a struct-type column to align with GeoParquet 1.1This approach may be preferred in some cases. It should be more memory efficient than the existing pandas approach, it's minimally manual (basically we offload all schema inference into the
pa.array
constructor), and it enforces strict schemas via inferred Arrow schema. In future work, we could also save memory with dictionary-encoded columns.This also should be interoperable with the Arrow support in Pandas v2, which GeoPandas also supports.
This mostly supersedes #27 but is created as a separate PR as it deletes the WIP
streaming.py
implementation from that PR.