From 0591f6816ccd71c0c3e24773c1eaeb68d86897c8 Mon Sep 17 00:00:00 2001 From: Szehon Ho Date: Wed, 11 Sep 2024 14:10:05 -0700 Subject: [PATCH] More fixes --- format/spec.md | 52 +++++++++++++++++++++++++------------------------- 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/format/spec.md b/format/spec.md index f082c6dc6e4e..2c7a770d5926 100644 --- a/format/spec.md +++ b/format/spec.md @@ -207,9 +207,9 @@ Notes: 1. Timestamp values _without time zone_ represent a date and time of day regardless of zone: the time value is independent of zone adjustments (`2017-11-16 17:10:34` is always retrieved as `2017-11-16 17:10:34`). 2. Timestamp values _with time zone_ represent a point in time: values are stored as UTC and do not retain a source time zone (`2017-11-16 17:10:34 PST` is stored/retrieved as `2017-11-17 01:10:34 UTC` and these values are considered identical). 3. Character strings must be stored as UTF-8 encoded byte arrays. -4. Coordinate reference system, i.e. mapping of how coordinates refer to precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be changed by schema evolution. -5. Coordinate reference system encoding, value specifying type of coordinate reference system field. Must be set if C is set. Defaults to "PROJJSON". Fixed and cannot be changed by schema evolution. -6. Edges, interpretation for edges within geometry object, i.e. whether the edge between points represent a straight cartesian line or the shortest line on the sphere. This applies to all non-point geometry objects. Fixed and cannot be changed by schema evolution. +4. Crs (coordinate reference system), i.e. mapping of how coordinates refer to precise locations on earth. Defaults to "OGC:CRS84". Fixed and cannot be changed by schema evolution. +5. Crs-encoding, value specifying type of crs field. Must be set if crs is set. Defaults to "PROJJSON". Fixed and cannot be changed by schema evolution. +6. Edges, value specifying interpretation for edges within geometry object, i.e. whether the edge between points represent a straight cartesian line or the shortest line on the sphere. This applies to all non-point geometry objects. Defaults to "planar". Fixed and cannot be changed by schema evolution. For details on how to serialize a schema to JSON, see Appendix C. @@ -466,28 +466,28 @@ The schema of a manifest file is a struct called `manifest_entry` with the follo `data_file` is a struct with the following fields: -| v1 | v2 | Field id, name | Type | Description | -| ---------- | ---------- |-----------------------------------|------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| | _required_ | **`134 content`** | `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | -| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | -| _required_ | _required_ | **`101 file_format`** | `string` | String file format name, avro, orc or parquet | -| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | -| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | -| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | -| _required_ | | ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | -| _optional_ | | ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | -| _optional_ | | ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | -| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | -| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | -| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | -| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | -| _optional_ | _optional_ | **`111 distinct_counts`** | `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | -| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] For Geometry type, this is a Point composed of the min value of each dimension in all Points in the Geometry. | -| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] For Geometry type, this is a Point composed of the max value of each dimension in all Points in the Geometry. | -| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | -| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | -| | _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | -| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | +| v1 | v2 | Field id, name | Type | Description | +| ---------- | ---------- |-----------------------------------|------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| +| | _required_ | **`134 content`** | `int` with meaning: `0: DATA`, `1: POSITION DELETES`, `2: EQUALITY DELETES` | Type of content stored by the data file: data, equality deletes, or position deletes (all v1 files are data files) | +| _required_ | _required_ | **`100 file_path`** | `string` | Full URI for the file with FS scheme | +| _required_ | _required_ | **`101 file_format`** | `string` | String file format name, avro, orc or parquet | +| _required_ | _required_ | **`102 partition`** | `struct<...>` | Partition data tuple, schema based on the partition spec output using partition field ids for the struct field ids | +| _required_ | _required_ | **`103 record_count`** | `long` | Number of records in this file | +| _required_ | _required_ | **`104 file_size_in_bytes`** | `long` | Total file size in bytes | +| _required_ | | ~~**`105 block_size_in_bytes`**~~ | `long` | **Deprecated. Always write a default in v1. Do not write in v2.** | +| _optional_ | | ~~**`106 file_ordinal`**~~ | `int` | **Deprecated. Do not write.** | +| _optional_ | | ~~**`107 sort_columns`**~~ | `list<112: int>` | **Deprecated. Do not write.** | +| _optional_ | _optional_ | **`108 column_sizes`** | `map<117: int, 118: long>` | Map from column id to the total size on disk of all regions that store the column. Does not include bytes necessary to read other columns, like footers. Leave null for row-oriented formats (Avro) | +| _optional_ | _optional_ | **`109 value_counts`** | `map<119: int, 120: long>` | Map from column id to number of values in the column (including null and NaN values) | +| _optional_ | _optional_ | **`110 null_value_counts`** | `map<121: int, 122: long>` | Map from column id to number of null values in the column | +| _optional_ | _optional_ | **`137 nan_value_counts`** | `map<138: int, 139: long>` | Map from column id to number of NaN values in the column | +| _optional_ | _optional_ | **`111 distinct_counts`** | `map<123: int, 124: long>` | Map from column id to number of distinct values in the column; distinct counts must be derived using values in the file by counting or using sketches, but not using methods like merging existing distinct counts | +| _optional_ | _optional_ | **`125 lower_bounds`** | `map<126: int, 127: binary>` | Map from column id to lower bound in the column serialized as binary [1]. Each value must be less than or equal to all non-null, non-NaN values in the column for the file [2] For geometry type, this is a WKB-encoded point composed of the min value of each dimension among all component points of all geometry objects for the file. | +| _optional_ | _optional_ | **`128 upper_bounds`** | `map<129: int, 130: binary>` | Map from column id to upper bound in the column serialized as binary [1]. Each value must be greater than or equal to all non-null, non-Nan values in the column for the file [2] For geometry type, this is a WKB-encoded point composed of the max value of each dimension among all component points of all geometry objects for the file. | +| _optional_ | _optional_ | **`131 key_metadata`** | `binary` | Implementation-specific key metadata for encryption | +| _optional_ | _optional_ | **`132 split_offsets`** | `list<133: long>` | Split offsets for the data file. For example, all row group offsets in a Parquet file. Must be sorted ascending | +| | _optional_ | **`135 equality_ids`** | `list<136: int>` | Field ids used to determine row equality in equality delete files. Required when `content=2` and should be null otherwise. Fields with ids listed in this column must be present in the delete file | +| _optional_ | _optional_ | **`140 sort_order_id`** | `int` | ID representing sort order for this file [3]. | Notes: @@ -1449,6 +1449,6 @@ When processing point in time queries implementations should use "snapshot-log" ## Appendix G: Geospatial Notes -The Geometry class hierarchy and WKB serialization is defined by [OpenGIS Implementation Specification for Geographic information – Simple feature access – Part 1: Common architecture, Version 1.2.1](https://portal.ogc.org/files/?artifact_id=25355), from [Open Geospatial Consortium](https://www.ogc.org/standard/sfa/). +The Geometry class hierarchy and WKB serialization (ISO WKB supporting XY, XYZ, XYM, XYZM) is defined by [OpenGIS Implementation Specification for Geographic information – Simple feature access – Part 1: Common architecture, Version 1.2.1](https://portal.ogc.org/files/?artifact_id=25355), from [Open Geospatial Consortium](https://www.ogc.org/standard/sfa/). Future versions of this spec may also used if the WKB representation remains wire-compatible.