diff --git a/format-specs/spatial-type-guidance.md b/format-specs/spatial-type-guidance.md index aaf82ac..932ab7c 100644 --- a/format-specs/spatial-type-guidance.md +++ b/format-specs/spatial-type-guidance.md @@ -15,7 +15,8 @@ By **Spatial Type**, we are referring to a library, specification, or file forma that intends to represent points, lines, polygons, or combinations thereof and relate them to a position on the surface of the earth[^1]. We will also constrain this definition to only include attempts to *formalize* the representation of this -data within an existing type system or format. Essentially, if you are storing data that you +data within an existing type system or format that is otherwise *not* a dedicated +spatial type. Essentially, if you are storing data in a non-spatial format that you intend to be put on a map, and you expect somebody else to interpret it in the same way that you did, that data comes from a spatial type implementation. @@ -28,21 +29,27 @@ celestial body. We'll discuss this later. information when converting spatial data from elsewhere. The main thing this means if your Spatial Type intends to interoperate with any other Spatial Type, you need to provide a mechanism to locate a full Coordinate Reference System definition. + For example, [GeoPandas](https://geopandas.org) stores a coordinate reference system + at the type level of each array of geometries. - **Defer to existing standards**: Spatial data can be complicated and there is a lot of prior art to draw from. Deferring to existing standards for well-worn topics like Coordinate Reference System representation and serialized representation of geometry increases the probability that other tools will be able to easily interact with your - convention. + convention. For example, [GeoParquet](https://geoparquet.org) stores the coordinate + reference system as [PROJJSON](#projjson). + and stores geometries as [well-known binary](#well-known-binary). - **Leverage the capabilities of the format**: Every format has a unique set of - features and limitations. There is sometimes a tension between deferring to existing standards - and ensuring that users of the format can leverage all the great things that - can be done. + features and limitations and there is sometimes a tension between deferring to existing + standards and ensuring that users of the format can leverage existing tools that work with + the format. For example, [GeoJSON](https://geojson.org) encodes geometries as JavaScript + objects instead of a previously existing ## Prior art GeoParquet is far from the first attempt to formalize the representation of spatial data to an existing type system/storage format. A non-exhaustive list of established -Spatial Type implementations we will refer to in this document include: +Spatial Type implementations (in no particular order) we will refer to in this document +include: - [PostGIS](https://postgis.net/) (PostgreSQL) - [GeoParquet](https://geoparquet.org) (Parquet) @@ -87,12 +94,72 @@ to the surface of the earth[^2]. [^2]: Or another celestial body. Again, details forthcoming. -## Types? +## Components of a data type + +Data types describe the user-facing meaning of each element in some collection of +objects (e.g., a column in a table or element in an array). For example, +a [pandas dtype](https://pandas.pydata.org/docs/user_guide/basics.html#basics-dtypes), +a [Parquet LogicalType](https://parquet.apache.org/docs/file-format/types/logicaltypes/), +or an [Arrow Data Type](https://arrow.apache.org/docs/format/Columnar.html#data-types). +Formats use the concept of a data type to separate the arrangement of physical bytes +from the meaning of those bytes. For example, many formats encode dates as the number +of days since January 1, 1970, storing only the type name at the top level and using it +to interpret some sequence of integers. + +Most modern type systems allow types to have parameters (e.g., Arrow/Parquet/DuckDB +decimals can be parameterized with a precision and scale at the type level instead +of storing those values alongside each element); however, many older type systems +(e.g., PostgreSQL, SQLite) only have the opportunity to store a type name or identifier. +This requires that all information required to interpret a type lives with each element +(which, depending on the type, can involve a lot of repetition). + +If the format allows, spatial data types should be parameterized with a Coordinate +Reference System to avoid unnecessarily repeating this information. This also allows +better interopearability with other spatial data types that store the Coordinate +Reference System in this way (e.g., +[GeoPandas](#), +[R/sf](https://r-spatial.github.io/sf/reference/st_crs.html), +[GeoArrow](https://geoarrow.org/extension-types.html#extension-metadata), +[GeoParquet](https://github.com/opengeospatial/geoparquet/blob/v1.1.0/format-specs/geoparquet.md#crs) +). + +There are cases where a format does not allow a type to be parameterized (i.e., a name +or identifier is all that is possible). PostgreSQL is an example where this is the case: +as a consumer of PostgreSQL using a non-spatial client, all the information you have +is the name or ID of the type ("geometry" or "geography"). The Spatial Type implementation +for PostgreSQL, PostGIS, uses two mechanisms to make the CRS of a column known: + +- It stores a CRS identifier (an integer) with every element that refers to a + `spatial_ref_sys` table that contains an extended string definition of that CRS. +- It keeps a record of all + [geometry columns as a VIEW](https://postgis.net/docs/using_postgis_dbmanagement.html#geometry_columns) and includes CRS identifier (an integer) that refers to the + `spatial_ref_sys` table. + +For clients reading a table directly, it is possible to look up the type-level CRS +by querying the `geometry_columns` and `spatial_ref_sys` tables; however, for clients +executing a complex query, it is necessary to inspect the integer that is sent alongside +each encoded geometry in the result (which is almost always the exact same value for +every element in the result). An important consequence of this workaround is that +once the context of the original database connection is lost, there is no mechanism +to obtain the full CRS definition to pass on to another Spatial Type[^3]. + +[^3]: To mitigate this problem, PostGIS does use generally consistent identifiers +for common CRSes such that it is usually possible to guess the CRS based on the +integer identifier. +## Coordinate Reference Systems -## Coordinate Reference Systems https://macwright.com/lonlat/ + +### PROJJSON + + +## Geometries + +## WKB + +