Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

STAC Interoperability with Arrow #37

Merged
merged 18 commits into from
Apr 17, 2024

Conversation

kylebarron
Copy link
Collaborator

@kylebarron kylebarron commented Apr 16, 2024

This is a clean up to #27, which implemented a work-in-progress converter to and from Arrow memory, originally done during the STAC sprint.

Change list

  • Adds new functions that parse STAC Items from dicts or from a newline-delimited JSON file to an Arrow table
    • Supports an optional schema argument for advanced users who know the schema of their STAC items. Note that this schema is applied after conversion to WKB but before any other conversions.
  • Adds new functions that convert the Arrow table back to dicts or to a newline-delimited JSON file
  • The Arrow table stores geometries as WKB to easily allow STAC Items with differing geometry types.
  • Converts bbox column to a struct-type column to align with GeoParquet 1.1

This approach may be preferred in some cases. It should be more memory efficient than the existing pandas approach, it's minimally manual (basically we offload all schema inference into the pa.array constructor), and it enforces strict schemas via inferred Arrow schema. In future work, we could also save memory with dictionary-encoded columns.

This also should be interoperable with the Arrow support in Pandas v2, which GeoPandas also supports.

This mostly supersedes #27 but is created as a separate PR as it deletes the WIP streaming.py implementation from that PR.

@kylebarron kylebarron changed the title stac from arrow STAC Interoperability with Arrow Apr 16, 2024
@TomAugspurger
Copy link
Collaborator

Thanks!

I'm trying to round trip some NAIP items from the PC:

import pystac_client
import stac_geoparquet.to_parquet
import stac_geoparquet.from_arrow
import stac_geoparquet.to_arrow

items = list(
    pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
    .search(collections="naip", max_items=4)
    .items_as_dicts()
)
table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
items2 = list(stac_geoparquet.from_arrow.stac_table_to_items(table))

and am hitting

TypeError                                 Traceback (most recent call last)
Cell In[11], line 12
      6 items = list(
      7     pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
      8     .search(collections="naip", max_items=4)
      9     .items_as_dicts()
     10 )
     11 table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
---> 12 items2 = list(stac_geoparquet.from_arrow.stac_table_to_items(table))

File ~/src/stac-utils/stac-geoparquet/stac_geoparquet/from_arrow.py:27, in stac_table_to_items(table)
     24 # Convert WKB geometry column to GeoJSON, and then assign the geojson geometry when
     25 # converting each row to a dictionary.
     26 for batch in table.to_batches():
---> 27     geoms = shapely.from_wkb(batch["geometry"])
     28     geojson_strings = shapely.to_geojson(geoms)
     30     # RecordBatch is missing a `drop()` method, so we keep all columns other than
     31     # geometry instead

File ~/src/stac-utils/stac-geoparquet/.direnv/python-3.10.10/lib/python3.10/site-packages/shapely/io.py:320, in from_wkb(geometry, on_invalid, **kwargs)
    316 # ensure the input has object dtype, to avoid numpy inferring it as a
    317 # fixed-length string dtype (which removes trailing null bytes upon access
    318 # of array elements)
    319 geometry = np.asarray(geometry, dtype=object)
--> 320 return lib.from_wkb(geometry, invalid_handler, **kwargs)

TypeError: Expected bytes or string, got dict

Seems like the geometry column is geojson-like, but hsould be WKB?

In [18]: table["geometry"].to_pylist()
Out[18]: 
[{'coordinates': [[[-65.683663, 18.184851],
    [-65.684718, 18.253643],
    [-65.75386, 18.25266],
    [-65.752778, 18.183872],
    [-65.683663, 18.184851]]],
  'type': 'Polygon'},
 {'coordinates': [[[-65.746142, 18.184853],
    [-65.747222, 18.253666],
    [-65.816382, 18.25266],
    [-65.815275, 18.183852],
    [-65.746142, 18.184853]]],
  'type': 'Polygon'},
 {'coordinates': [[[-65.558704, 18.309849],
    [-65.559716, 18.378606],
    [-65.628821, 18.377663],
    [-65.627781, 18.30891],
    [-65.558704, 18.309849]]],
  'type': 'Polygon'},
 {'coordinates': [[[-65.496227, 18.309844],
    [-65.497215, 18.378583],
    [-65.566297, 18.377663],
    [-65.565282, 18.308929],
    [-65.496227, 18.309844]]],
  'type': 'Polygon'}]

Copy link
Collaborator

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good to me using some manual tests.

Speaking of tests, thoughts on adding some basic ones, mainly making sure that round-trip between list[Item] <-> Table works? Do you want to wait for #39 to tackle tests?

Co-authored-by: Tom Augspurger <[email protected]>
@kylebarron
Copy link
Collaborator Author

Speaking of tests, thoughts on adding some basic ones, mainly making sure that round-trip between list[Item] <-> Table works?

Definitely. The nice part about this Arrow work is that it's a direct in-memory counterpart to the Parquet schema. So we can mainly test the Arrow interop and get the Parquet functionality for free, without having to test that step as rigorously.

Do you want to wait for #39 to tackle tests?

Yeah that sounds good.

@TomAugspurger
Copy link
Collaborator

Great, thanks!

@TomAugspurger TomAugspurger merged commit 7152f5e into stac-utils:main Apr 17, 2024
1 check passed
@kylebarron kylebarron mentioned this pull request Apr 17, 2024
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants