-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add GeoArrow encoding as an option to the specification #189
Changes from 10 commits
cd8c0a2
3452b53
c94bcaf
3d28ebc
146dcf0
f8e0ae2
35a2919
2a41943
6dc23dc
5a996f8
0fac197
88ae045
ec154cc
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,14 @@ | ||
{ | ||
"geo": { | ||
"columns": { | ||
"geometry": { | ||
"encoding": "point", | ||
"geometry_types": [ | ||
"Point" | ||
] | ||
} | ||
}, | ||
"primary_column": "geometry", | ||
"version": "1.1.0-dev" | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,9 +12,7 @@ This is version 1.1.0-dev of the GeoParquet specification. See the [JSON Schema | |
|
||
## Geometry columns | ||
|
||
Geometry columns MUST be stored using the `BYTE_ARRAY` parquet type. They MUST be encoded as [WKB](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we keep the information of the first sentence somewhere? (i.e. that for a WKB encoding, the geometry column MUST be stores using the (you kept the "Implementation note" just below that also mentions There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done! We could also add some more in here about the Parquet physical description of how nesting works (but maybe in a future PR?) |
||
|
||
Implementation note: when using the ecosystem of Arrow libraries, Parquet types such as `BYTE_ARRAY` might not be directly accessible. Instead, the corresponding Arrow data type can be `Arrow::Type::BINARY` (for arrays that whose elements can be indexed through a 32-bit index) or `Arrow::Type::LARGE_BINARY` (64-bit index). It is recommended that GeoParquet readers are compatible with both data types, and writers preferably use `Arrow::Type::BINARY` (thus limiting to row groups with content smaller than 2 GB) for larger compatibility. | ||
Geometry columns MUST be encoded as [WKB](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary) or using the single-geometry type encodings based on the [GeoArrow](https://geoarrow.org/) specification. | ||
|
||
See the [encoding](#encoding) section below for more details. | ||
|
||
|
@@ -51,7 +49,7 @@ Each geometry column in the dataset MUST be included in the `columns` field abov | |
|
||
| Field Name | Type | Description | | ||
| -------------- | ------------ | ----------- | | ||
| encoding | string | **REQUIRED.** Name of the geometry encoding format. Currently only `"WKB"` is supported. | | ||
| encoding | string | **REQUIRED.** Name of the geometry encoding format. Currently `"WKB"`, `"point"`, `"linestring"`, `"polygon"`, `"multipoint"`, `"multilinestring"`, and `"multipolygon"` are supported. | | ||
| geometry_types | \[string] | **REQUIRED.** The geometry types of all geometries, or an empty array if they are not known. | | ||
| crs | object\|null | [PROJJSON](https://proj.org/specifications/projjson.html) object representing the Coordinate Reference System (CRS) of the geometry. If the field is not provided, the default CRS is [OGC:CRS84](https://www.opengis.net/def/crs/OGC/1.3/CRS84), which means the data in this column must be stored in longitude, latitude based on the WGS84 datum. | | ||
| orientation | string | Winding order of exterior ring of polygons. If present must be `"counterclockwise"`; interior rings are wound in opposite order. If absent, no assertions are made regarding the winding order. | | ||
|
@@ -83,10 +81,68 @@ The optional `epoch` field allows to specify this in case the `crs` field define | |
|
||
#### encoding | ||
|
||
This is the binary format that the geometry is encoded in. The string `"WKB"`, signifying Well Known Binary is the only current option, but future versions of the spec may support alternative encodings. This SHOULD be the ["OpenGIS® Implementation Specification for Geographic information - Simple feature access - Part 1: Common architecture"](https://portal.ogc.org/files/?artifact_id=18241) WKB representation (using codes for 3D geometry types in the \[1001,1007\] range). This encoding is also consistent with the one defined in the ["ISO/IEC 13249-3:2016 (Information technology - Database languages - SQL multimedia and application packages - Part 3: Spatial)"](https://www.iso.org/standard/60343.html) standard. | ||
This is the memory layout used to encode geometries in the geometry column. | ||
Supported values: | ||
|
||
- `"WKB"` | ||
- one of `"point"`, `"linestring"`, `"polygon"`, `"multipoint"`, `"multilinestring"`, `"multipolygon"` | ||
|
||
##### WKB | ||
|
||
The preferred option for maximum portability is `"WKB"`, signifying [Well Known Binary](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary). This SHOULD be the ["OpenGIS® Implementation Specification for Geographic information - Simple feature access - Part 1: Common architecture"](https://portal.ogc.org/files/?artifact_id=18241) WKB representation (using codes for 3D geometry types in the \[1001,1007\] range). This encoding is also consistent with the one defined in the ["ISO/IEC 13249-3:2016 (Information technology - Database languages - SQL multimedia and application packages - Part 3: Spatial)"](https://www.iso.org/standard/60343.html) standard. | ||
|
||
Note that the current version of the spec only allows for a subset of WKB: 2D or 3D geometries of the standard geometry types (the Point, LineString, Polygon, MultiPoint, MultiLineString, MultiPolygon, and GeometryCollection geometry types). This means that M values or non-linear geometry types are not yet supported. | ||
|
||
WKB geometry columns MUST be stored using the `BYTE_ARRAY` parquet type. | ||
|
||
Implementation note: when using WKB encoding with the ecosystem of Arrow libraries, Parquet types such as `BYTE_ARRAY` might not be directly accessible. Instead, the corresponding Arrow data type can be `Arrow::Type::BINARY` (for arrays that whose elements can be indexed through a 32-bit index) or `Arrow::Type::LARGE_BINARY` (64-bit index). It is recommended that GeoParquet readers are compatible with both data types, and writers preferably use `Arrow::Type::BINARY` (thus limiting to row groups with content smaller than 2 GB) for larger compatibility. | ||
|
||
##### Native encodings (based on GeoArrow) | ||
|
||
Using the single-geometry type encodings (i.e., `"point"`, `"linestring"`, `"polygon"`, `"multipoint"`, `"multilinestring"`, `"multipolygon"`) may provide better performance and enable readers to leverage more features of the Parquet format to accelerate geospatial queries (e.g., row group-level min/max statistics). These encodings correspond to extension name suffix in the [GeoArrow metadata specification for extension names](https://geoarrow.org/extension-types#extension-names) to signify the memory layout used by the geometry column. GeoParquet uses the separated (struct) representation of coordinates for single-geometry type encodings because this encoding results in useful column statistics when row groups and/or files contain related features. | ||
|
||
The actual coordinates of the geometries MUST be stored as native numbers, i.e. using | ||
the `DOUBLE` parquet type in a (repeated) group of fields (exact repetition depending | ||
on the geometry type). | ||
|
||
For the `"point"` geometry type, this results in a struct of two fields for x | ||
and y coordinates (in case of 2D geometries): | ||
|
||
``` | ||
// "point" geometry column as simple field with two child fields for x and y | ||
optional group geometry { | ||
required double x; | ||
required double y; | ||
} | ||
``` | ||
|
||
For the other geometry types, those x and y coordinate values MUST be embedded | ||
in repeated groups (`LIST` logical parquet type). For example, for the | ||
`"multipolygon"` geometry type: | ||
|
||
``` | ||
// "multipolygon" geometry column with multiple levels of nesting | ||
optional group geometry (List) { | ||
// the parts of the MultiPolygon | ||
repeated group list { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this always named There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mentioned this issue above at #189 (comment) The Parquet spec specifies that those names should be "list" and "element": https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists But it also mentions that for backwards compatibility you should be able to read lists with other names:
Until recently, PyArrow / Arrow C++ actually created such "non-compliant" files, but there is a flag to control this, for which the default has been changed to create compliant files by default since pyarrow 13.0 (apache/arrow#29781). Illustrating this: point_type = pa.struct([pa.field("x", pa.float64(), nullable=False), pa.field("y", pa.float64(), nullable=False)])
polygon_type = pa.list_(
pa.field("rings", pa.list_(pa.field("vertices", point_type, nullable=False)), nullable=False)
)
multipolygon_type = pa.list_(pa.field("polygons", _polygon_type, nullable=False))
table = pa.table({"geometry": pa.array([[[[{'x': 1.0, 'y': 2.0}, {'x': 2.0, 'y': 3.0}]]]], type=multipolygon_type)})
>>> pq.write_table(table, "test_multipolygon_non_nullable-compliant.parquet")
>>> pq.read_metadata("test_multipolygon_non_nullable-compliant.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7efda2333b40>
required group field_id=-1 schema {
optional group field_id=-1 geometry (List) {
repeated group field_id=-1 list {
required group field_id=-1 element (List) {
repeated group field_id=-1 list {
required group field_id=-1 element (List) {
repeated group field_id=-1 list {
required group field_id=-1 element {
required double field_id=-1 x;
required double field_id=-1 y;
}
}
}
}
}
}
}
}
>>> pq.write_table(table, "test_multipolygon_non_nullable-non_compliant.parquet", use_compliant_nested_type=False)
>>> pq.read_metadata("test_multipolygon_non_nullable-non_compliant.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7efda135fd00>
required group field_id=-1 schema {
optional group field_id=-1 geometry (List) {
repeated group field_id=-1 list {
required group field_id=-1 polygons (List) {
repeated group field_id=-1 list {
required group field_id=-1 rings (List) {
repeated group field_id=-1 list {
required group field_id=-1 vertices {
required double field_id=-1 x;
required double field_id=-1 y;
}
}
}
}
}
}
}
} Specifically for Arrow C++ (or other implementation that do this), we do store the original Arrow schema in the Parquet metadata, and so we could preserve the custom names on roundtrip. But it seems we are not doing that currently. But this actually raises the question: for the GeoArrow spec itself, how "required" are those custom field names? Are you supposed to rename them after reading a compliant Parquet file that will not have those names?
To get back to that initial question: it will be something like There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
👍
My implementation will likely rename field names to the GeoArrow prescribed ones.
👍 |
||
optional group element (List) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is verbose, but this matches how the Parquet docs specify a List type: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists In the Spatial Parquet paper, they made this shorter (https://arxiv.org/pdf/2209.02158.pdf), and essentially something like (translating their syntax to our layout):
While that is clearer / more readable for the reader not super familiar with Parquet details, it also has value to stick with the way the Parquet docs do this (and eg how printing the Parquet schema of a file with pyarrow will also do this). |
||
// the rings of one Polygon | ||
repeated group list { | ||
optional group element (List) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another element here is that the Parquet spec says to use the generic "list" and "element" field names, which means we don't use the more descriptive "polygons" / "rings" / "vertices" names as in GeoArrow (until recently, pyarrow would honor the names in the Arrow schema and write a "non-compliant" Parquet file, but now it defaults to writing a compliant one, which you can still turn off with |
||
// the list of coordinates of one ring | ||
repeated group list { | ||
optional group element { | ||
required double x; | ||
required double y; | ||
Comment on lines
+135
to
+136
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At the moment I only made the inner x and y as "required" here, but I think I should also mark the other sub-levels (for polygons/rings/vertices) as required? I think we said only the geometry itself can be null in the GeoArrow spec. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Changed all to "required" except for the top-level geometry group. Another question is how strict this should be followed. For example, PyArrow will write those to Parquet as "optional" anyway, even if there are no nulls, unless you ensure your actual Arrow schema also explicitly uses There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be a little strange if the schema written by pyarrow for a GeoArrow encoded array didn't match the text here. Does marking them as required have a storage optimization? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. From a quick test with tiny data, you save a bit of space. PyArrow can write the schema as shown here, as long as you ensure your fields are marked as non-nullable in the Arrow schema. But I don't think we are being strict about doing that in our geoarrow implementations .. (in general in PyArrow this nullability flag is not used much, but I think other ecosystems use that more diligently) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can probably make sure that everything except the outer layer is written as non-nullable fields regardless of how GeoArrow deals with that. In the proof-of-concept there's a step where the extension information is stripped, at which point we could also set the non-nullable-ness if this really does matter. Once wrapped in an extension type, I don't think the explicit non-nullable-ness of the storage is all that useful. The assumptions of buffer data are governed by the extension type, and we say there are no nulls allowed except at the outer level. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Looking at the Parquet docs, it says (https://parquet.apache.org/docs/file-format/nulls/):
and then here (https://parquet.apache.org/docs/file-format/data-pages/):
So essentially if a field is required, it can skip some data being stored (the definition levels for that field). Although given that it is run-length encoded, this data will also be tiny in case there are no nulls (because then essentially it's just one constant being run-length encoded, I assume) See also https://arrow.apache.org/blog/2022/10/05/arrow-parquet-encoding-part-1/ There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For the actual spec here, I think we can recommend There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds fine to me. Maybe say something that there MUST be no null values for child arrays, even if it's marked as There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. OK, see the last commit (88ae045), I tried to clarify that only top-level nulls are allowed, but that the schema itself can still use There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. (another good reason to not be strict on this, is that it is quite hard to create data that conforms this with pyarrow. For example, if you want to convert existing data to a schema with proper nullable=False fields, just casting doesn't yet work in pyarrow: apache/arrow#33592) |
||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
} | ||
``` | ||
|
||
#### Coordinate axis order | ||
|
||
The axis order of the coordinates in WKB stored in a GeoParquet follows the de facto standard for axis order in WKB and is therefore always (x, y) where x is easting or longitude and y is northing or latitude. This ordering explicitly overrides the axis order as specified in the CRS. This follows the precedent of [GeoPackage](https://geopackage.org), see the [note in their spec](https://www.geopackage.org/spec130/#gpb_spec). | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that geometry column with mixed types of geometries cannot be encoded as GeoArrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not until apache/parquet-format#44 is merged (if ever)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can (or will be once we sort out the details of geoarrow/geoarrow#43 ), although it's unclear exactly how we'd do that in Parquet or if it would be useful in Parquet. In any case, it would be a future addition!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Columns with mixed geometry values are quite common for most query engines with geospatial support. Most of the time geometry columns have the umbrella type "geometry" or "geography", and it is not practical to first resolve the subtypes of the geometries before writing out parquet files. I'd look forward to a columnar encoding supporting mixed geometry types as well as geometry collections.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Arrow and Parquet are two different specs. Arrow has a union type, which allows for mixed geometry types in a single column, while maintaining constant-time access to any coordinate. Parquet does not today have a union type, so it's impossible to to write the
Geometry
andGeometryCollection
arrays in geoarrow/geoarrow#43 natively to Parquet.GeoArrow implementations are able to statically know whether they have singly-typed geometries in a column, in which case they can write one of the 6 primitive types. GeoArrow implementations will have to fall back to WKB-encoded geometries for mixed-type columns. I don't see how this is something we could realistically change, unless we essentially re-implement union handling in a struct, which would be a big ask for implementors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good points!
I think Kyle put the summary best:
In the context of this PR, that would mean that the column option
geoarrow_type
could in the future be set to"geoarrow.mixed"
.I don't think we anticipated that writing mixed geometries in geoarrow to Parquet would be the main use-case. If this is an important use, please chime in on geoarrow/geoarrow#43 with some details! We definitely don't want to represent mixed geometries in a way that prevents them being used.
This is only true if there's no ability for a user to supply any information about the encoding. If there is,
write_geoparquet(..., encoding = geoarrow("geoarrow.point"))
should do it. Typically the user does know this information (even if the database does not).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This discussion is making me think that the GeoParquet spec should not be defined in terms of GeoArrow. Rather it should be defined as "native type" or "flat type" or similar. Then a sentence in prose can mention that it overlaps partially with the GeoArrow spec.
I'm also becoming convinced that the serialized form need not exactly overlap with GeoArrow. On the topic of mixed arrays specifically, as possibly the only one who has written an implementation of GeoArrow mixed arrays, I've become an even stronger proponent of using an Arrow union type for GeoArrow mixed arrays because of its ability to know geometry types statically. So I think the best way forward for GeoParquet (for a future PR) would be to discuss a "struct-union" approach for GeoParquet that is not the same in-memory representation as GeoArrow.
I think changing nomenclature will also be clearer to non-arrow-based implementations that reading and writing this "native" encoding of GeoParquet is not dependent on using Arrow internally.
So my recommendation would be to take out most references to geoarrow from this PR. I.e., we don't want the metadata key called
geoarrow_type
if there's a possibility where the GeoParquet union type is not the same as the GeoArrow union type.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think the strength of this PR is the strong delegation to the GeoArrow specification: I don't think we should be developing two specifications, particularly since we have very little engagement on the GeoArrow spec already. We've spent quite a bit of time documenting the memory layouts for geoarrow in that specification and I don't think it would be productive to copy/paste those here and maintain them independently. I also don't think it would be productive to link to the GeoArrow specification for documentation of all the memory layouts but very pointedly not call it GeoArrow.
It may be that representing mixed geometry is not important in the context of GeoParquet (Maybe WKB is just as fast in the context of compression + IO + Parquet's list type? Have we checked?), or it may be that there is a common memory representation that makes sense for both specifications that will improve interoperability (although that would involve quite a lot of reimplementation on Kyle's end 😬 ).
I don't want us to loose track of the main point here, which is that this PR is mostly about enabling very efficient representations of single-type geometries, which are very commonly the types of files that you might want to put in a giant S3 bucket and scan efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could go back to to the question about which "encoding" values to allow, and instead of a generic
"geoarrow"
option (with an additional "geoarrow_type" key to then be more specific), have actual encoding options"point"
,"linestring"
,"polygon"
, etc.(i.e. like one of the options we initially discussed was also "geoarrow.point", "geoarrow.linestring", etc, but then just dropping the "geoarrow." prefix)
For the rest, it is mostly a question about how to document this: how to phrase this exactly in the specification, how strongly to tie it to geoarrow (or just reference as mostly similar), how much to duplicate the details of the memory layout, etc. But that's all "details" about the best way to document it, while keeping the actual specification (what ends up in the metadata in a file) agnostic to geoarrow.
I think I am somewhat convinced by @kylebarron's points on this, and like to idea of having the actual spec changes not use "geoarrow" (and then we can still debate how much to use the term in the explanation of it).
For example, as long as the specification would exactly overlap (or be a strict subset of geoarrow), we can still point to the geoarrow spec for the details to avoid too much duplication (and having to maintain two versions). And this is also easy to change then in the future if we would want to introduce differences.
On the other hand, for an implementation of GeoParquet in some library that has nothing to do with Arrow (doesn't use an Arrow implementation under the hood), the "geoarrow" name is also somewhat uninformative, when strictly looking at it from a GeoParquet point of view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it's worth crafting a new PR that uses the language you all are hoping for with a draft implementation? I don't currently have the bandwidth to do that but am happy to review!