Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update spec to document storing all GeoJSON properties as WKB #51

Merged
merged 1 commit into from
Jun 2, 2024

Conversation

kylebarron
Copy link
Collaborator

Some extensions also store GeoJSON geometries. I'm primarily thinking of proj:geometry, but maybe there are others?

This updates the spec to recommend all GeoJSON geometries be stored as WKB. It's not uncommon for a STAC Collection that mostly holds Polygon geometries to occasionally have some MultiPolygon geometries, for example over the international date line. It's much easier to handle these edge cases when all geometries are stored as WKB.

@TomAugspurger
Copy link
Collaborator

I think this makes sense to me (since this is geo-parquet, we already know the reader can parse WKB), but let's give @cholmes a week or so to comment.

The main difficulty this imposes on readers / writers is knowing which fields ought to be encoded as WKB. At some point it might be helpful to collect the list of known extensions writing geometry into a document (maybe here).

@kylebarron
Copy link
Collaborator Author

The main difficulty this imposes on readers / writers is knowing which fields ought to be encoded as WKB. At some point it might be helpful to collect the list of known extensions writing geometry into a document (maybe here).

I agree. I don't know STAC well enough... are there any other extensions that store a geometry?

@bitner
Copy link
Contributor

bitner commented May 20, 2024

Whether there are any other common extensions now that have a geometry, I think we need to assume that any field could include geometry whether in an official extension

I think this goes back to having two options:

  1. having the schema up front which would be much better performing and we could just parse the schema to know what fields need to be converted to/from wkb
  2. If we don't have the schema up front, I think we need to assume that any field could be a geometry and we just need to look for a type and coordinates fields - this will again be much slower than having the schema (or a superset of the schema) up front

@cholmes
Copy link
Contributor

cholmes commented May 20, 2024

I think this makes sense to me (since this is geo-parquet, we already know the reader can parse WKB), but let's give @cholmes a week or so to comment.

sorry! Missed this before - too many github notifications. Looks good to me. I don't think there are other STAC extensions that store geometries, indeed it took me a moment to remember that the 'proj' one did, and I think that one is sorta special - it's an alternate representation of the same core geometry. STAC is mostly not about the geometry, it's about the language for the other properties.

I think it may even be reasonable to assume that no other fields would include a geometry - I certainly don't see the value of adding a geometry in most extensions. If someone has an extra geometry and really wants it recognized then they can make the case for it.

@kylebarron
Copy link
Collaborator Author

  1. having the schema up front which would be much better performing and we could just parse the schema to know what fields need to be converted to/from wkb

I want to highlight that this in particular is more involved than just knowing the physical schema, because it requires first knowing which keys in the JSON to convert to WKB.

We can't directly and reliably parse any GeoJSON geometry as-is if there is any chance of having multiple levels of coordinates nesting. In theory we could do this via a union type, where we define that the geometry type will be a union over List[Float64] for points or List[List[Float64]] for MultiPoint/LineString, etc. But no Arrow libraries support parsing JSON with a union type in the schema. Arrow C++/pyarrow error when a union is passed in the schema. And I asked on the arrow-rs discord and not only is it not currently supported, but they said it would be incredibly difficult to support (and would overly complicate the json reading code)

  1. If we don't have the schema up front, I think we need to assume that any field could be a geometry and we just need to look for a type and coordinates fields - this will again be much slower than having the schema (or a superset of the schema) up front

huh, so we literally would just scan all the nested keys of every json input looking for something that has type and coordinates and nothing else? I suppose that would work, but it sounds really slow.

Maybe in practice for now, we should assume that proj:geometry will be the only other geometry field, and revisit this if we find STAC data in the wild with other geometry fields.

@TomAugspurger
Copy link
Collaborator

I think we can ignore all the worries I raised earlier. Initially, I thought that we might have a coordination problem between different readers and writers all having to agree on which fields are WKB encoded. But fortunately, we've solved that with geoparquet's columns file metadata (nice how that works out). As long as the writer correctly captures the fact that proj:geometry is a geometry column in the geoparquet metadata, then any geoparquet reader will be able to read it.

We can certainly still recommend that writers encoded geometries (and maybe so maybe the "should" in the PR is the right word to use).

@kylebarron
Copy link
Collaborator Author

As long as the writer correctly captures the fact that proj:geometry is a geometry column in the geoparquet metadata, then any geoparquet reader will be able to read it.

That's a good thing to bring up. We don't currently check for this when writing GeoParquet. Another question is: should this proj:geometry column in the GeoParquet metadata have a CRS assigned? Often times the CRS will be different for each row in the column, however.

@kylebarron
Copy link
Collaborator Author

With #56 we mark a proj:geometry column as a geometry column in the GeoParquet metadata. I think this PR is good to merge then.

@kylebarron kylebarron merged commit c968f15 into main Jun 2, 2024
1 check passed
@kylebarron kylebarron deleted the kyle/proj-geometry-wkb-spec branch June 2, 2024 13:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants