From 1d81b7a0347ae424a90402c7122c194c087552e0 Mon Sep 17 00:00:00 2001
From: Gene Pang <77996944+gene-db@users.noreply.github.com>
Date: Wed, 6 Nov 2024 09:54:34 -0800
Subject: [PATCH] Clarify Variant specification (#457)

* [FOLLOWUP] Clarify Variant details

* address feedback

* minor fix
---
 VariantEncoding.md  | 17 ++++++++++++-----
 VariantShredding.md | 20 +++++++++++++++++---
 2 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/VariantEncoding.md b/VariantEncoding.md
index 1eac3bcbe..c6d2d1135 100644
--- a/VariantEncoding.md
+++ b/VariantEncoding.md
@@ -93,6 +93,7 @@ Next, is an `offset` list, which contains `dictionary_size + 1` values.
 Each `offset` is a little-endian value of `offset_size` bytes, and represents the starting byte offset of the i-th string in `bytes`.
 The first `offset` value will always be `0`, and the last `offset` value will always be the total length of `bytes`.
 The last part of the metadata is `bytes`, which stores all the string values in the dictionary.
+All string values must be UTF-8 encoded strings.
 
 ## Metadata encoding grammar
 
@@ -107,7 +108,7 @@ offset_size_minus_one: 2-bit value providing the number of bytes per dictionary
 dictionary_size: `offset_size` bytes. little-endian value indicating the number of strings in the dictionary
 dictionary: <offset>* <bytes>
 offset: `offset_size` bytes. little-endian value indicating the starting position of the ith string in `bytes`. The list should contain `dictionary_size + 1` values, where the last value is the total length of `bytes`.
-bytes: dictionary string values
+bytes: UTF-8 encoded dictionary string values
 ```
 
 Notes:
@@ -209,7 +210,7 @@ The [primitive types table](#encoding-types) shows the encoding format for each
 
 ### Value Data for Short string (`basic_type`=1)
 
-When `basic_type` is `1`, `value_data` is the sequence of bytes that represents the string.
+When `basic_type` is `1`, `value_data` is the sequence of UTF-8 encoded bytes that represents the string.
 
 ### Value Data for Object (`basic_type`=2)
 
@@ -337,7 +338,7 @@ object_header: (is_large << 4 | field_id_size_minus_one << 2 | field_offset_size
 array_header: (is_large << 2 | field_offset_size_minus_one)
 value_data:  <primitive_val> | <short_string_val> | <object_val> | <array_val>
 primitive_val: see table for binary representation
-short_string_val: bytes
+short_string_val: UTF-8 encoded bytes
 object_val: <num_elements> <field_id>* <field_offset>* <fields>
 array_val: <num_elements> <field_offset>* <fields>
 num_elements: a 1 or 4 byte little-endian value (depending on is_large in <object_header>/<array_header>)
@@ -403,11 +404,17 @@ The *Logical Type* column indicates logical equivalence of physically encoded ty
 For example, a user expression operating on a string value containing "hello" should behave the same, whether it is encoded with the short string optimization, or long string encoding.
 Similarly, user expressions operating on an *int8* value of 1 should behave the same as a decimal16 with scale 2 and unscaled value 100.
 
-# Field ID order and uniqueness
+# String values must be UTF-8 encoded
+
+All strings within the Variant binary format must be UTF-8 encoded.
+This includes the dictionary key string values, the "short string" values, and the "long string" values.
+
+# Object field ID order and uniqueness
 
 For objects, field IDs and offsets must be listed in the order of the corresponding field names, sorted lexicographically.
-Note that the fields themselves are not required to follow this order.
+Note that the field values themselves are not required to follow this order.
 As a result, offsets will not necessarily be listed in ascending order.
+The field values are not required to be in the same order as the field IDs, to enable flexibility when constructing Variant values.
 
 An implementation may rely on this field ID order in searching for field names.
 E.g. a binary search on field IDs (combined with metadata lookups) may be used to find a field with a given field.
diff --git a/VariantShredding.md b/VariantShredding.md
index 51160a9bc..31e1f5289 100644
--- a/VariantShredding.md
+++ b/VariantShredding.md
@@ -91,7 +91,7 @@ optional group variant_col {
 # Parquet Layout
 
 The `array` and `object` fields represent Variant array and object types, respectively.
-Arrays must use the three-level list structure described in https://github.com/apache/parquet-format/blob/master/LogicalTypes.md.
+Arrays must use the three-level list structure described in [LogicalTypes.md](LogicalTypes.md).
 
 An `object` field must be a group.
 Each field name of this inner group corresponds to the Variant value's object field name.
@@ -143,6 +143,17 @@ There are two main motivations for including the `variant_value` column:
 1) In a case where there are rare type mismatches (for example, a numeric field with rare strings like “n/a”), we allow the field to be shredded, which could still be a significant performance benefit compared to fetching and decoding the full value/metadata binary.
 2) Since there is a single schema per file, there would be no easy way to recover from a type mismatch encountered late in a file write. Parquet files can be large, and buffering all file data before starting to write could be expensive. Including a variant column for every field guarantees we can adhere to the requested shredding schema.
 
+# Top-level metadata
+
+Any values stored in a shredded `variant_value` field may have dictionary IDs referring to the metadata.
+There is one metadata value for the entire Variant record, and that is stored in the top-level `metadata` field.
+This means any `variant_value` values in the shredded representation is only the "value" portion of the [Variant Binary Encoding](VariantEncoding.md).
+
+The metadata is kept at the top-level, instead of shredding the metadata with the shredded variant values because:
+* Simplified shredding scheme and specification. No need for additional struct-of-binary values, or custom concatenated binary scheme for `variant_value`.
+* Simplified and good performance for write shredding. No need to rebuild the metadata, or re-encode IDs for `variant_value`.
+* Simplified and good performance for Variant reconstruction. No need to re-encode IDs for `variant_value`.
+
 # Data Skipping
 
 Shredded columns are expected to store statistics in the same format as a normal Parquet column.
@@ -154,11 +165,14 @@ This specification is not strict about what values may be stored in `variant_val
 # Shredding Semantics
 
 Reconstruction of Variant value from a shredded representation is not expected to produce a bit-for-bit identical binary to the original unshredded value.
-For example, the order of fields in the binary may change, as may the physical representation of scalar values.
+For example, in a reconstructed Variant value, the order of object field values may be different from the original binary.
+This is allowed since the [Variant Binary Encoding](VariantEncoding.md#object-field-id-order-and-uniqueness) does not require an ordering of the field values, but the field IDs will still be ordered lexicographically according to the corresponding field names.
 
+The physical representation of scalar values may also be different in the reconstructed Variant binary.
 In particular, the [Variant Binary Encoding](VariantEncoding.md) considers all integer and decimal representations to represent a single logical type.
+This flexibility enables shredding to be applicable in more scenarios, while maintaining all information and values losslessly.
 As a result, it is valid to shred a decimal into a decimal column with a different scale, or to shred an integer as a decimal, as long as no numeric precision is lost.
-For example, it would be valid to write the value 123 to a Decimal(9, 2) column, but the value 1.234 would need to be written to the **variant_value** column.
+For example, it would be valid to write the value 123 to a Decimal(9, 2) column, but the value 1.234 would need to be written to the `variant_value` column.
 When reconstructing, it would be valid for a reader to reconstruct 123 as an integer, or as a Decimal(9, 2).
 Engines should not depend on the physical type of a Variant value, only the logical type.