STAC Interoperability with Arrow #37

kylebarron · 2024-04-16T02:58:43Z

This is a clean up to #27, which implemented a work-in-progress converter to and from Arrow memory, originally done during the STAC sprint.

Change list

Adds new functions that parse STAC Items from dicts or from a newline-delimited JSON file to an Arrow table
- Supports an optional schema argument for advanced users who know the schema of their STAC items. Note that this schema is applied after conversion to WKB but before any other conversions.
Adds new functions that convert the Arrow table back to dicts or to a newline-delimited JSON file
The Arrow table stores geometries as WKB to easily allow STAC Items with differing geometry types.
Converts bbox column to a struct-type column to align with GeoParquet 1.1

This approach may be preferred in some cases. It should be more memory efficient than the existing pandas approach, it's minimally manual (basically we offload all schema inference into the pa.array constructor), and it enforces strict schemas via inferred Arrow schema. In future work, we could also save memory with dictionary-encoded columns.

This also should be interoperable with the Arrow support in Pandas v2, which GeoPandas also supports.

This mostly supersedes #27 but is created as a separate PR as it deletes the WIP streaming.py implementation from that PR.

TomAugspurger · 2024-04-17T18:36:58Z

Thanks!

I'm trying to round trip some NAIP items from the PC:

import pystac_client
import stac_geoparquet.to_parquet
import stac_geoparquet.from_arrow
import stac_geoparquet.to_arrow

items = list(
    pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
    .search(collections="naip", max_items=4)
    .items_as_dicts()
)
table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
items2 = list(stac_geoparquet.from_arrow.stac_table_to_items(table))

and am hitting

TypeError                                 Traceback (most recent call last)
Cell In[11], line 12
      6 items = list(
      7     pystac_client.Client.open("https://planetarycomputer.microsoft.com/api/stac/v1")
      8     .search(collections="naip", max_items=4)
      9     .items_as_dicts()
     10 )
     11 table = stac_geoparquet.to_arrow.parse_stac_items_to_arrow(items)
---> 12 items2 = list(stac_geoparquet.from_arrow.stac_table_to_items(table))

File ~/src/stac-utils/stac-geoparquet/stac_geoparquet/from_arrow.py:27, in stac_table_to_items(table)
     24 # Convert WKB geometry column to GeoJSON, and then assign the geojson geometry when
     25 # converting each row to a dictionary.
     26 for batch in table.to_batches():
---> 27     geoms = shapely.from_wkb(batch["geometry"])
     28     geojson_strings = shapely.to_geojson(geoms)
     30     # RecordBatch is missing a `drop()` method, so we keep all columns other than
     31     # geometry instead

File ~/src/stac-utils/stac-geoparquet/.direnv/python-3.10.10/lib/python3.10/site-packages/shapely/io.py:320, in from_wkb(geometry, on_invalid, **kwargs)
    316 # ensure the input has object dtype, to avoid numpy inferring it as a
    317 # fixed-length string dtype (which removes trailing null bytes upon access
    318 # of array elements)
    319 geometry = np.asarray(geometry, dtype=object)
--> 320 return lib.from_wkb(geometry, invalid_handler, **kwargs)

TypeError: Expected bytes or string, got dict

Seems like the geometry column is geojson-like, but hsould be WKB?

In [18]: table["geometry"].to_pylist()
Out[18]: 
[{'coordinates': [[[-65.683663, 18.184851],
    [-65.684718, 18.253643],
    [-65.75386, 18.25266],
    [-65.752778, 18.183872],
    [-65.683663, 18.184851]]],
  'type': 'Polygon'},
 {'coordinates': [[[-65.746142, 18.184853],
    [-65.747222, 18.253666],
    [-65.816382, 18.25266],
    [-65.815275, 18.183852],
    [-65.746142, 18.184853]]],
  'type': 'Polygon'},
 {'coordinates': [[[-65.558704, 18.309849],
    [-65.559716, 18.378606],
    [-65.628821, 18.377663],
    [-65.627781, 18.30891],
    [-65.558704, 18.309849]]],
  'type': 'Polygon'},
 {'coordinates': [[[-65.496227, 18.309844],
    [-65.497215, 18.378583],
    [-65.566297, 18.377663],
    [-65.565282, 18.308929],
    [-65.496227, 18.309844]]],
  'type': 'Polygon'}]

stac_geoparquet/to_arrow.py

TomAugspurger

Overall this looks good to me using some manual tests.

Speaking of tests, thoughts on adding some basic ones, mainly making sure that round-trip between list[Item] <-> Table works? Do you want to wait for #39 to tackle tests?

Co-authored-by: Tom Augspurger <[email protected]>

kylebarron · 2024-04-17T19:11:46Z

Speaking of tests, thoughts on adding some basic ones, mainly making sure that round-trip between list[Item] <-> Table works?

Definitely. The nice part about this Arrow work is that it's a direct in-memory counterpart to the Parquet schema. So we can mainly test the Arrow interop and get the Parquet functionality for free, without having to test that step as rigorously.

Do you want to wait for #39 to tackle tests?

Yeah that sounds good.

TomAugspurger · 2024-04-17T19:31:24Z

Great, thanks!

kylebarron and others added 11 commits September 26, 2023 15:25

stac from arrow

ec8d8de

convert timestamps

aa20a8f

add ciso8601

c38446c

parse to arrow

4d196a5

rename to to_arrow.py

5612e6e

Convert back to json lines

de69b61

wip chunked jsonl reader

37b21f3

Avoid JSON in _items_to_arrow

bc40430

Merge branch 'main' into kyle/stac-geoarrow

dd2c6a8

Updates to to-arrow conversion

ee889e3

Remove streaming for now

52c3849

kylebarron changed the title ~~stac from arrow~~ STAC Interoperability with Arrow Apr 16, 2024

kylebarron added 5 commits April 15, 2024 23:03

lint

ce48700

lint

5d72770

Convert bbox column to struct layout

713cdd1

Convert bbox back to list before writing to JSON

f5ac44e

Use ISO WKB

e82e2ea

kylebarron mentioned this pull request Apr 16, 2024

Write to Parquet with GeoParquet 1.1 metadata #40

Merged

Lint

7d6a75b

TomAugspurger reviewed Apr 17, 2024

View reviewed changes

stac_geoparquet/to_arrow.py Outdated Show resolved Hide resolved

TomAugspurger approved these changes Apr 17, 2024

View reviewed changes

Update to_arrow.py

19fe8c0

Co-authored-by: Tom Augspurger <[email protected]>

TomAugspurger merged commit 7152f5e into stac-utils:main Apr 17, 2024
1 check passed

kylebarron mentioned this pull request Apr 17, 2024

stac-sprint updates #27

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

STAC Interoperability with Arrow #37

STAC Interoperability with Arrow #37

kylebarron commented Apr 16, 2024 •

edited

Loading

TomAugspurger commented Apr 17, 2024

TomAugspurger left a comment

kylebarron commented Apr 17, 2024

TomAugspurger commented Apr 17, 2024

STAC Interoperability with Arrow #37

STAC Interoperability with Arrow #37

Conversation

kylebarron commented Apr 16, 2024 • edited Loading

Change list

TomAugspurger commented Apr 17, 2024

TomAugspurger left a comment

Choose a reason for hiding this comment

kylebarron commented Apr 17, 2024

TomAugspurger commented Apr 17, 2024

kylebarron commented Apr 16, 2024 •

edited

Loading