stac-sprint updates #27

kylebarron · 2023-09-26T19:26:26Z

Use pyarrow.json.read_json() to construct STAC table from newline-delimited json
bring properties to the top level
convert geojson geometry to WKB
parse timestamp columns

Todo:

Ensure geoparquet metadata

kylebarron · 2023-09-26T20:37:29Z

with this stac-geoparquet we're spec-ing out, if you want to get rgb COG urls, the only data you have to load are those three columns (plus something you want to filter on)! It avoids even reading from disk any other columns. On this parquet table of 18,000 STAC Items, it takes 250ms to read these 6 columns

import pyarrow.parquet as pq
columns = [
    "assets.red.href",
    "assets.green.href",
    "assets.blue.href",
    "geometry",
    "created",
    "eo:cloud_cover",
]
pq.read_table("tmp.parquet", columns=columns)

pyarrow.Table
href: string
href: string
href: string
geometry: binary
created: timestamp[us, tz=UTC]
eo:cloud_cover: double
----
href: [["https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220927_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220907_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220912_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220915_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220920_0_L2A/B04.tif",...,"https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220515_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220525_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220530_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220505_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220510_0_L2A/B04.tif"]]
href: [["https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220927_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220907_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220912_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220915_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220920_0_L2A/B03.tif",...,"https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220515_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220525_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220530_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220505_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220510_0_L2A/B03.tif"]]
href: [["https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220927_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220907_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220912_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220915_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220920_0_L2A/B02.tif",...,"https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220515_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220525_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220530_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220505_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220510_0_L2A/B02.tif"]]
geometry: [[01030000000100000007000000FCFBBA88D7625AC0057F4F664FD43D40A3B1840089395AC028DD456296D43D402F9809EE98395AC0A3B1E961E5D63C401AE5A48BF5725AC0D064BCD04DD63C40857E2C58426F5AC0216E47488B0F3D4065180718B16D5AC068C6306CC2293D40FCFBBA88D7625AC0057F4F664FD43D40,01030000000100000006000000E45ECAAF8E635AC028567B594CD43D40A3B1840089395AC028DD456296D43D402F9809EE98395AC0A3B1E961E5D63C403C1667D6A0735AC04D71A5BE49D63C40F5620436A96F5AC03408B2C599133D40E45ECAAF8E635AC028567B594CD43D40,010300000001000000070000009AC32445A1625AC0B5DF9A4A50D43D40A3B1840089395AC028DD456296D43D402F9809EE98395AC0A3B1E961E5D63C40CF37AB69B7725AC05D805F474FD63C409E9C2949236F5AC0768B7424DC0D3D4013F5FD1E166D5AC0334C1518DA2F3D409AC32445A1625AC0B5DF9A4A50D43D40,010300000001000000050000009B824B2744825AC03B5F09FD8ED33D402744665FBE3C5AC0215ECF4598D43D4066CDAE08C74E5AC0FE3BF35E39D73C40B840AC11A1815AC04F23368923D63C409B824B2744825AC03B5F09FD8ED33D40,010300000001000000050000009B824B2744825AC03B5F09FD8ED33D404DAB1BB9D23C5AC0A4FABC4D98D43D40F5D2CB2EDB4E5AC0174E815039D73C40CC40AC11A1815AC03B43368923D63C409B824B2744825AC03B5F09FD8ED33D40,...,]]
created: [[2022-11-06 07:34:19.473097,2022-11-03 20:46:10.304505,2022-11-06 10:10:45.866648,2022-11-06 07:45:15.042230,2022-11-06 07:21:45.102993,...,2022-11-06 06:20:22.989608,2022-11-06 09:44:47.224868,2022-11-06 06:20:42.045835,2022-11-03 17:34:44.331725,2022-11-05 21:18:55.467216]]
eo:cloud_cover: [[29.692835,68.197918,0.258553,0.003515,21.436284,...,5.392413,99.998921,78.555667,93.079418,95.566863]]

kylebarron · 2023-09-28T21:47:06Z

I'll say the state of this PR is:

WIP efficient STAC -> GeoParquet and GeoParquet
- Most efficient with newline-delimited JSON STAC input on disk
- Should be much more efficient than existing implementation; doesn’t go through Pandas or Python dictionaries
WIP chunked STAC -> GeoParquet for massive STAC collections
💩 Schema resolution. This is the hard part
- Parquet needs to know the output schema before starting to write the Parquet file
- Every Parquet batch needs to have the exact same schema
- Tested on all sentinel-cogs STAC Items from UTM zone 13. One Item had data_coverage instead of sentinel:data_coverage and writing crashed because there was a new column in the table.
- Schema resolution (ensuring single schema for all items) is really hard
Doing the schema resolution process can act as a check to see if your static collection is "tidy", and so may be useful in itself for some collection-level statistics

bitner · 2023-10-02T13:32:50Z

I was playing around with some of this with an eye towards integrating pyarrow for reading data for the pypgstac loader and one issue that I found right away it's not just parquet that has issues with needing the output schema. As far as I can tell, just using pyarrow.read_json() against items ndjson records will fail on geojson that has different geometry types.

TomAugspurger · 2023-10-06T21:08:16Z

@kylebarron bc40430 has a commit to convert the items to a pyarrow Table in memory rather than through JSON. There might be some issues with it if any of the items have different properties, but I'm OK with ignoring that for now.

Just confirming: this doesn't do arrow to parquet yet? We need a helper that writes the geoparquet metadata?

kylebarron · 2023-10-09T15:18:50Z

stac_geoparquet/to_arrow.py

+    # TODO: Handle STAC items with different schemas
+    # This will fail if any of the items is missing a field since the arrays
+    # will be different lengths.
+    d = defaultdict(list)
+
+    for item in items:
+        for k, v in item.items():
+            d[k].append(v)
+
+    arrays = {k: pa.array(v) for k, v in d.items()}
+    t = pa.table(arrays)


I don't think you even have to do this...? You're manually constructing a dictionary of arrays, when you can pass the list of dicts directly.

import pyarrow as pa d1 = { 'a': 1, 'b': { 'c': 'foo' } } d2 = { 'a': 2, 'b': { 'c': 'bar' } } pa.array([d1, d2]) # <pyarrow.lib.StructArray object at 0x12665ba60> # -- is_valid: all not null # -- child 0 type: int64 # [ # 1, # 2 # ] # -- child 1 type: struct<c: string> # -- is_valid: all not null # -- child 0 type: string # [ # "foo", # "bar" # ]

obviously this requires that the dicts have the same schema, but this is a requirement for now anyways

obviously this requires that the dicts have the same schema, but this is a requirement for now anyways

Actually, this handles schema resolution automatically, just like pa.json.read_json does. The only downside is that all the input data has to fit in memory at once for the schema resolution:

import pyarrow as pa d1 = { 'a': 1, 'b': { 'c': 'foo' } } d2 = { 'a': 2, 'b': { 'c': 'bar', 'd': 'baz' } } pa.array([d1, d2]) # <pyarrow.lib.StructArray object at 0x105db5000> # -- is_valid: all not null # -- child 0 type: int64 # [ # 1, # 2 # ] # -- child 1 type: struct<c: string, d: string> # -- is_valid: all not null # -- child 0 type: string # [ # "foo", # "bar" # ] # -- child 1 type: string # [ # null, # "baz" # ]

kylebarron · 2023-10-09T15:25:38Z

fail on geojson that has different geometry types

Yeah... this just isn't going to work with the default arrow reader, because it doesn't support automatically inferring type unions. An easy workaround can be to manually preprocess geojson geometry to a WKB-encoded bytes Python object before passing into pa.array(), though of course that won't work with the direct-from-disk json reader.

Just confirming: this doesn't do arrow to parquet yet? We need a helper that writes the geoparquet metadata?

Yeah, you can use these couple lines, importing from geopandas.io.arrow, if you don't want to copy the code in manaully. Then you can just use pyarrow.parquet.write_table when the table has geo metadata on it.

kylebarron · 2023-10-11T16:11:20Z

stac_geoparquet/from_arrow.py

+        struct_arr = pa.StructArray.from_arrays(
+            batch.columns, fields=properties_column_fields
+        )


Note to self, this is slightly clearer

Suggested change

struct_arr = pa.StructArray.from_arrays(

batch.columns, fields=properties_column_fields

)

struct_arr = batch.to_struct_array()

kylebarron · 2024-04-17T19:52:01Z

#37 was intended to supersede this PR, so this PR wasn't intended to be merged.

It's also a little hard to inspect the commit history because I'm so used to squashing PRs. I'd be in strong favor of updating the repo settings to enforce squash merges if you were so inclined.

TomAugspurger · 2024-04-17T20:18:50Z

Ahhh my bad. Yeah, I do want a linear commit history so I'll require squash merges (I need to get permission on this repo first...).

kylebarron · 2024-04-17T20:26:29Z

Since this was merged after #37, I actually can't tell if merging this PR had any effect. I seem to still see my latest changes from #37 in the main branch.

I guess it only had the effect of muddling the commit history a bit more?

Requiring squash merges is always the first thing I do on any repo I create 😄 . I hate it's not the default

TomAugspurger · 2024-04-17T21:40:18Z

FWIW, I didn't actually do anything on this PR. GitHub apparently decided it was a subset of #37 when I merged that and auto-closed this?

Squash merges are now the only allowed option in this repo.

And you have write permission now too!

kylebarron · 2024-04-17T21:42:49Z

FWIW, I didn't actually do anything on this PR. GitHub apparently decided it was a subset of #37 when I merged that and auto-closed this?

Ok that makes more sense! But weird that (at least for me) Github shows this PR as merged and not closed

kylebarron added 3 commits September 26, 2023 15:25

stac from arrow

ec8d8de

convert timestamps

aa20a8f

add ciso8601

c38446c

kylebarron marked this pull request as draft September 26, 2023 20:00

parse to arrow

4d196a5

kylebarron added 3 commits September 27, 2023 16:01

rename to to_arrow.py

5612e6e

Convert back to json lines

de69b61

wip chunked jsonl reader

37b21f3

Avoid JSON in _items_to_arrow

bc40430

kylebarron commented Oct 9, 2023

View reviewed changes

kylebarron commented Oct 11, 2023

View reviewed changes

TomAugspurger mentioned this pull request Mar 24, 2024

Optionally use pyarrow types in to_geodataframe #31

Merged

Merge branch 'main' into kyle/stac-geoarrow

dd2c6a8

kylebarron mentioned this pull request Apr 16, 2024

STAC Interoperability with Arrow #37

Merged

TomAugspurger merged commit dd2c6a8 into stac-utils:main Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stac-sprint updates #27

stac-sprint updates #27

kylebarron commented Sep 26, 2023 •

edited

Loading

kylebarron commented Sep 26, 2023

kylebarron commented Sep 28, 2023

bitner commented Oct 2, 2023

TomAugspurger commented Oct 6, 2023 •

edited

Loading

kylebarron Oct 9, 2023

kylebarron Oct 9, 2023

kylebarron commented Oct 9, 2023

kylebarron Oct 11, 2023

kylebarron commented Apr 17, 2024

TomAugspurger commented Apr 17, 2024

kylebarron commented Apr 17, 2024 •

edited

Loading

TomAugspurger commented Apr 17, 2024

kylebarron commented Apr 17, 2024

stac-sprint updates #27

stac-sprint updates #27

Conversation

kylebarron commented Sep 26, 2023 • edited Loading

Todo:

kylebarron commented Sep 26, 2023

kylebarron commented Sep 28, 2023

bitner commented Oct 2, 2023

TomAugspurger commented Oct 6, 2023 • edited Loading

kylebarron Oct 9, 2023

Choose a reason for hiding this comment

kylebarron Oct 9, 2023

Choose a reason for hiding this comment

kylebarron commented Oct 9, 2023

kylebarron Oct 11, 2023

Choose a reason for hiding this comment

kylebarron commented Apr 17, 2024

TomAugspurger commented Apr 17, 2024

kylebarron commented Apr 17, 2024 • edited Loading

TomAugspurger commented Apr 17, 2024

kylebarron commented Apr 17, 2024

kylebarron commented Sep 26, 2023 •

edited

Loading

TomAugspurger commented Oct 6, 2023 •

edited

Loading

kylebarron commented Apr 17, 2024 •

edited

Loading