Skip to content

Commit

Permalink
add explanation about corresponding Parquet types
Browse files Browse the repository at this point in the history
  • Loading branch information
jorisvandenbossche committed Mar 11, 2024
1 parent 6dc23dc commit 5a996f8
Showing 1 changed file with 43 additions and 1 deletion.
44 changes: 43 additions & 1 deletion format-specs/geoparquet.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,10 +97,52 @@ WKB geometry columns MUST be stored using the `BYTE_ARRAY` parquet type.

Implementation note: when using WKB encoding with the ecosystem of Arrow libraries, Parquet types such as `BYTE_ARRAY` might not be directly accessible. Instead, the corresponding Arrow data type can be `Arrow::Type::BINARY` (for arrays that whose elements can be indexed through a 32-bit index) or `Arrow::Type::LARGE_BINARY` (64-bit index). It is recommended that GeoParquet readers are compatible with both data types, and writers preferably use `Arrow::Type::BINARY` (thus limiting to row groups with content smaller than 2 GB) for larger compatibility.

##### Geometry type specific encodings (based on GeoArrow)
##### Native encodings (based on GeoArrow)

Using the single-geometry type encodings (i.e., `"point"`, `"linestring"`, `"polygon"`, `"multipoint"`, `"multilinestring"`, `"multipolygon"`) may provide better performance and enable readers to leverage more features of the Parquet format to accelerate geospatial queries (e.g., row group-level min/max statistics). These encodings correspond to extension name suffix in the [GeoArrow metadata specification for extension names](https://geoarrow.org/extension-types#extension-names) to signify the memory layout used by the geometry column. GeoParquet uses the separated (struct) representation of coordinates for single-geometry type encodings because this encoding results in useful column statistics when row groups and/or files contain related features.

The actual coordinates of the geometries MUST be stored as native numbers, i.e. using
the `DOUBLE` parquet type in a (repeated) group of fields (exact repetition depending
on the geometry type).

For the `"point"` geometry type, this results in a struct of two fields for x
and y coordinates (in case of 2D geometries):

```
// "point" geometry column as simple field with two child fields for x and y
optional group geometry {
required double x;
required double y;
}
```

For the other geometry types, those x and y coordinate values MUST be embedded
in repeated groups (`LIST` logical parquet type). For example, for the
`"multipolygon"` geometry type:

```
// "multipolygon" geometry column with multiple levels of nesting
optional group geometry (List) {
// the parts of the MultiPolygon
repeated group list {
optional group element (List) {
// the rings of one Polygon
repeated group list {
optional group element (List) {
// the list of coordinates of one ring
repeated group list {
optional group element {
required double x;
required double y;
}
}
}
}
}
}
}
```

#### Coordinate axis order

The axis order of the coordinates in WKB stored in a GeoParquet follows the de facto standard for axis order in WKB and is therefore always (x, y) where x is easting or longitude and y is northing or latitude. This ordering explicitly overrides the axis order as specified in the CRS. This follows the precedent of [GeoPackage](https://geopackage.org), see the [note in their spec](https://www.geopackage.org/spec130/#gpb_spec).
Expand Down

0 comments on commit 5a996f8

Please sign in to comment.