-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stac-sprint updates #27
Conversation
with this import pyarrow.parquet as pq
columns = [
"assets.red.href",
"assets.green.href",
"assets.blue.href",
"geometry",
"created",
"eo:cloud_cover",
]
pq.read_table("tmp.parquet", columns=columns)
|
I'll say the state of this PR is:
|
I was playing around with some of this with an eye towards integrating pyarrow for reading data for the pypgstac loader and one issue that I found right away it's not just parquet that has issues with needing the output schema. As far as I can tell, just using pyarrow.read_json() against items ndjson records will fail on geojson that has different geometry types. |
@kylebarron bc40430 has a commit to convert the items to a pyarrow Table in memory rather than through JSON. There might be some issues with it if any of the items have different properties, but I'm OK with ignoring that for now. Just confirming: this doesn't do arrow to parquet yet? We need a helper that writes the geoparquet metadata? |
# TODO: Handle STAC items with different schemas | ||
# This will fail if any of the items is missing a field since the arrays | ||
# will be different lengths. | ||
d = defaultdict(list) | ||
|
||
for item in items: | ||
for k, v in item.items(): | ||
d[k].append(v) | ||
|
||
arrays = {k: pa.array(v) for k, v in d.items()} | ||
t = pa.table(arrays) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you even have to do this...? You're manually constructing a dictionary of arrays, when you can pass the list of dicts directly.
import pyarrow as pa
d1 = {
'a': 1,
'b': {
'c': 'foo'
}
}
d2 = {
'a': 2,
'b': {
'c': 'bar'
}
}
pa.array([d1, d2])
# <pyarrow.lib.StructArray object at 0x12665ba60>
# -- is_valid: all not null
# -- child 0 type: int64
# [
# 1,
# 2
# ]
# -- child 1 type: struct<c: string>
# -- is_valid: all not null
# -- child 0 type: string
# [
# "foo",
# "bar"
# ]
obviously this requires that the dicts have the same schema, but this is a requirement for now anyways
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
obviously this requires that the dicts have the same schema, but this is a requirement for now anyways
Actually, this handles schema resolution automatically, just like pa.json.read_json
does. The only downside is that all the input data has to fit in memory at once for the schema resolution:
import pyarrow as pa
d1 = {
'a': 1,
'b': {
'c': 'foo'
}
}
d2 = {
'a': 2,
'b': {
'c': 'bar',
'd': 'baz'
}
}
pa.array([d1, d2])
# <pyarrow.lib.StructArray object at 0x105db5000>
# -- is_valid: all not null
# -- child 0 type: int64
# [
# 1,
# 2
# ]
# -- child 1 type: struct<c: string, d: string>
# -- is_valid: all not null
# -- child 0 type: string
# [
# "foo",
# "bar"
# ]
# -- child 1 type: string
# [
# null,
# "baz"
# ]
Yeah... this just isn't going to work with the default arrow reader, because it doesn't support automatically inferring type unions. An easy workaround can be to manually preprocess geojson geometry to a WKB-encoded
Yeah, you can use these couple lines, importing from |
struct_arr = pa.StructArray.from_arrays( | ||
batch.columns, fields=properties_column_fields | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to self, this is slightly clearer
struct_arr = pa.StructArray.from_arrays( | |
batch.columns, fields=properties_column_fields | |
) | |
struct_arr = batch.to_struct_array() |
#37 was intended to supersede this PR, so this PR wasn't intended to be merged. It's also a little hard to inspect the commit history because I'm so used to squashing PRs. I'd be in strong favor of updating the repo settings to enforce squash merges if you were so inclined. |
Ahhh my bad. Yeah, I do want a linear commit history so I'll require squash merges (I need to get permission on this repo first...). |
Since this was merged after #37, I actually can't tell if merging this PR had any effect. I seem to still see my latest changes from #37 in the main branch. I guess it only had the effect of muddling the commit history a bit more? Requiring squash merges is always the first thing I do on any repo I create 😄 . I hate it's not the default |
FWIW, I didn't actually do anything on this PR. GitHub apparently decided it was a subset of #37 when I merged that and auto-closed this? Squash merges are now the only allowed option in this repo. And you have write permission now too! |
Ok that makes more sense! But weird that (at least for me) Github shows this PR as merged and not closed |
pyarrow.json.read_json()
to construct STAC table from newline-delimited jsonTodo: