Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stac-sprint updates #27

Merged
merged 9 commits into from
Apr 17, 2024
Merged

Conversation

kylebarron
Copy link
Collaborator

@kylebarron kylebarron commented Sep 26, 2023

  • Use pyarrow.json.read_json() to construct STAC table from newline-delimited json
  • bring properties to the top level
  • convert geojson geometry to WKB
  • parse timestamp columns

Todo:

  • Ensure geoparquet metadata

@kylebarron kylebarron marked this pull request as draft September 26, 2023 20:00
@kylebarron
Copy link
Collaborator Author

with this stac-geoparquet we're spec-ing out, if you want to get rgb COG urls, the only data you have to load are those three columns (plus something you want to filter on)! It avoids even reading from disk any other columns. On this parquet table of 18,000 STAC Items, it takes 250ms to read these 6 columns

import pyarrow.parquet as pq
columns = [
    "assets.red.href",
    "assets.green.href",
    "assets.blue.href",
    "geometry",
    "created",
    "eo:cloud_cover",
]
pq.read_table("tmp.parquet", columns=columns)
pyarrow.Table
href: string
href: string
href: string
geometry: binary
created: timestamp[us, tz=UTC]
eo:cloud_cover: double
----
href: [["https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220927_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220907_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220912_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220915_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220920_0_L2A/B04.tif",...,"https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220515_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220525_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220530_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220505_0_L2A/B04.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220510_0_L2A/B04.tif"]]
href: [["https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220927_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220907_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220912_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220915_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220920_0_L2A/B03.tif",...,"https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220515_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220525_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220530_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220505_0_L2A/B03.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220510_0_L2A/B03.tif"]]
href: [["https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220927_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220907_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220912_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2B_13RDN_20220915_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/R/DN/2022/9/S2A_13RDN_20220920_0_L2A/B02.tif",...,"https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220515_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220525_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220530_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2B_13PBM_20220505_0_L2A/B02.tif","https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/13/P/BM/2022/5/S2A_13PBM_20220510_0_L2A/B02.tif"]]
geometry: [[01030000000100000007000000FCFBBA88D7625AC0057F4F664FD43D40A3B1840089395AC028DD456296D43D402F9809EE98395AC0A3B1E961E5D63C401AE5A48BF5725AC0D064BCD04DD63C40857E2C58426F5AC0216E47488B0F3D4065180718B16D5AC068C6306CC2293D40FCFBBA88D7625AC0057F4F664FD43D40,01030000000100000006000000E45ECAAF8E635AC028567B594CD43D40A3B1840089395AC028DD456296D43D402F9809EE98395AC0A3B1E961E5D63C403C1667D6A0735AC04D71A5BE49D63C40F5620436A96F5AC03408B2C599133D40E45ECAAF8E635AC028567B594CD43D40,010300000001000000070000009AC32445A1625AC0B5DF9A4A50D43D40A3B1840089395AC028DD456296D43D402F9809EE98395AC0A3B1E961E5D63C40CF37AB69B7725AC05D805F474FD63C409E9C2949236F5AC0768B7424DC0D3D4013F5FD1E166D5AC0334C1518DA2F3D409AC32445A1625AC0B5DF9A4A50D43D40,010300000001000000050000009B824B2744825AC03B5F09FD8ED33D402744665FBE3C5AC0215ECF4598D43D4066CDAE08C74E5AC0FE3BF35E39D73C40B840AC11A1815AC04F23368923D63C409B824B2744825AC03B5F09FD8ED33D40,010300000001000000050000009B824B2744825AC03B5F09FD8ED33D404DAB1BB9D23C5AC0A4FABC4D98D43D40F5D2CB2EDB4E5AC0174E815039D73C40CC40AC11A1815AC03B43368923D63C409B824B2744825AC03B5F09FD8ED33D40,...,]]
created: [[2022-11-06 07:34:19.473097,2022-11-03 20:46:10.304505,2022-11-06 10:10:45.866648,2022-11-06 07:45:15.042230,2022-11-06 07:21:45.102993,...,2022-11-06 06:20:22.989608,2022-11-06 09:44:47.224868,2022-11-06 06:20:42.045835,2022-11-03 17:34:44.331725,2022-11-05 21:18:55.467216]]
eo:cloud_cover: [[29.692835,68.197918,0.258553,0.003515,21.436284,...,5.392413,99.998921,78.555667,93.079418,95.566863]]

@kylebarron
Copy link
Collaborator Author

I'll say the state of this PR is:

  • WIP efficient STAC -> GeoParquet and GeoParquet

    • Most efficient with newline-delimited JSON STAC input on disk
    • Should be much more efficient than existing implementation; doesn’t go through Pandas or Python dictionaries
  • WIP chunked STAC -> GeoParquet for massive STAC collections

  • 💩 Schema resolution. This is the hard part

    • Parquet needs to know the output schema before starting to write the Parquet file
    • Every Parquet batch needs to have the exact same schema
    • Tested on all sentinel-cogs STAC Items from UTM zone 13. One Item had data_coverage instead of sentinel:data_coverage and writing crashed because there was a new column in the table.
    • Schema resolution (ensuring single schema for all items) is really hard
  • Doing the schema resolution process can act as a check to see if your static collection is "tidy", and so may be useful in itself for some collection-level statistics

@bitner
Copy link
Contributor

bitner commented Oct 2, 2023

I was playing around with some of this with an eye towards integrating pyarrow for reading data for the pypgstac loader and one issue that I found right away it's not just parquet that has issues with needing the output schema. As far as I can tell, just using pyarrow.read_json() against items ndjson records will fail on geojson that has different geometry types.

@TomAugspurger
Copy link
Collaborator

TomAugspurger commented Oct 6, 2023

@kylebarron bc40430 has a commit to convert the items to a pyarrow Table in memory rather than through JSON. There might be some issues with it if any of the items have different properties, but I'm OK with ignoring that for now.

Just confirming: this doesn't do arrow to parquet yet? We need a helper that writes the geoparquet metadata?

Comment on lines +59 to +69
# TODO: Handle STAC items with different schemas
# This will fail if any of the items is missing a field since the arrays
# will be different lengths.
d = defaultdict(list)

for item in items:
for k, v in item.items():
d[k].append(v)

arrays = {k: pa.array(v) for k, v in d.items()}
t = pa.table(arrays)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you even have to do this...? You're manually constructing a dictionary of arrays, when you can pass the list of dicts directly.

import pyarrow as pa
d1 = {
    'a': 1,
    'b': {
        'c': 'foo'
    }
}
d2 = {
    'a': 2,
    'b': {
        'c': 'bar'
    }
}
pa.array([d1, d2])
# <pyarrow.lib.StructArray object at 0x12665ba60>
# -- is_valid: all not null
# -- child 0 type: int64
#   [
#     1,
#     2
#   ]
# -- child 1 type: struct<c: string>
#   -- is_valid: all not null
#   -- child 0 type: string
#     [
#       "foo",
#       "bar"
#     ]

obviously this requires that the dicts have the same schema, but this is a requirement for now anyways

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

obviously this requires that the dicts have the same schema, but this is a requirement for now anyways

Actually, this handles schema resolution automatically, just like pa.json.read_json does. The only downside is that all the input data has to fit in memory at once for the schema resolution:

import pyarrow as pa
d1 = {
    'a': 1,
    'b': {
        'c': 'foo'
    }
}
d2 = {
    'a': 2,
    'b': {
        'c': 'bar',
        'd': 'baz'
    }
}
pa.array([d1, d2])
# <pyarrow.lib.StructArray object at 0x105db5000>
# -- is_valid: all not null
# -- child 0 type: int64
#   [
#     1,
#     2
#   ]
# -- child 1 type: struct<c: string, d: string>
#   -- is_valid: all not null
#   -- child 0 type: string
#     [
#       "foo",
#       "bar"
#     ]
#   -- child 1 type: string
#     [
#       null,
#       "baz"
#     ]

@kylebarron
Copy link
Collaborator Author

fail on geojson that has different geometry types

Yeah... this just isn't going to work with the default arrow reader, because it doesn't support automatically inferring type unions. An easy workaround can be to manually preprocess geojson geometry to a WKB-encoded bytes Python object before passing into pa.array(), though of course that won't work with the direct-from-disk json reader.

Just confirming: this doesn't do arrow to parquet yet? We need a helper that writes the geoparquet metadata?

Yeah, you can use these couple lines, importing from geopandas.io.arrow, if you don't want to copy the code in manaully. Then you can just use pyarrow.parquet.write_table when the table has geo metadata on it.

Comment on lines +102 to +104
struct_arr = pa.StructArray.from_arrays(
batch.columns, fields=properties_column_fields
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to self, this is slightly clearer

Suggested change
struct_arr = pa.StructArray.from_arrays(
batch.columns, fields=properties_column_fields
)
struct_arr = batch.to_struct_array()

@TomAugspurger TomAugspurger merged commit dd2c6a8 into stac-utils:main Apr 17, 2024
@kylebarron
Copy link
Collaborator Author

#37 was intended to supersede this PR, so this PR wasn't intended to be merged.

It's also a little hard to inspect the commit history because I'm so used to squashing PRs. I'd be in strong favor of updating the repo settings to enforce squash merges if you were so inclined.

@TomAugspurger
Copy link
Collaborator

Ahhh my bad. Yeah, I do want a linear commit history so I'll require squash merges (I need to get permission on this repo first...).

@kylebarron
Copy link
Collaborator Author

kylebarron commented Apr 17, 2024

Since this was merged after #37, I actually can't tell if merging this PR had any effect. I seem to still see my latest changes from #37 in the main branch.

I guess it only had the effect of muddling the commit history a bit more?

Requiring squash merges is always the first thing I do on any repo I create 😄 . I hate it's not the default

@TomAugspurger
Copy link
Collaborator

FWIW, I didn't actually do anything on this PR. GitHub apparently decided it was a subset of #37 when I merged that and auto-closed this?

Squash merges are now the only allowed option in this repo.

And you have write permission now too!

@kylebarron
Copy link
Collaborator Author

FWIW, I didn't actually do anything on this PR. GitHub apparently decided it was a subset of #37 when I merged that and auto-closed this?

Ok that makes more sense! But weird that (at least for me) Github shows this PR as merged and not closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants