Skip to content

Commit

Permalink
Fix errors and ambiguities in the storage format specification. (#5101)
Browse files Browse the repository at this point in the history
[SC-49755](https://app.shortcut.com/tiledb-inc/story/49755/incorrect-storage-format-specification-of-encrypted-chunk-metadata)

[SC-49761](https://app.shortcut.com/tiledb-inc/story/49761/ambiguities-the-filter-pipeline-format-specification)

This PR updates the storage format specification to fix some errors and
ambiguities. Specifically:

* The specification for encrypted tiles specified two structures:
`AESPart` for data parts and the more complex `AESPartMD` for metadata
parts. The latter however was not used, and in fact the Core has been
treating both data and metadata parts the same.[^1][^2] The
specification of `AESPartMD` was removed.
* The dictionary filter was specified to both serialize and not
serialize options. The former is true[^3] and the specification was
updated.
* It was ambiguous whether the phrase "Used for" in the _Reinterpret
datatype_ field meant that for the non-delta compression filters the
field is serialized and ignored, or not serialized at all. The latter is
true[^4] and the ambiguity was resolved by breaking the options table
into two; one for non-delta compression filters and one for delta
compression filters.

[^1]:
https://github.com/TileDB-Inc/TileDB/blob/e95cf44ed4bc06fe1bdf43bcc6e5fc1f3a4afd28/tiledb/sm/filter/encryption_aes256gcm_filter.h#L46-L78
[^2]:
https://github.com/TileDB-Inc/TileDB/blob/392270851c0e39925f43ec22810498e6dd09334e/tiledb/sm/filter/encryption_aes256gcm_filter.cc#L166-L170
[^3]:
https://github.com/TileDB-Inc/TileDB/blob/392270851c0e39925f43ec22810498e6dd09334e/tiledb/sm/enums/compressor.h#L64-L65
[^4]:
https://github.com/TileDB-Inc/TileDB/blob/392270851c0e39925f43ec22810498e6dd09334e/tiledb/sm/filter/compression_filter.cc#L713-L716

---
TYPE: FORMAT
DESC: Update the storage format specification to fix errors and
ambiguities in the serialized options and tiles of the compression and
encryption filters.
  • Loading branch information
teo-tsirpanis authored Jul 1, 2024
1 parent 388af4d commit 303c0e6
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 32 deletions.
17 changes: 13 additions & 4 deletions format_spec/filter_pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,13 +42,22 @@ The filter options are configuration parameters for the filters that do not chan

### Main Compressor Options

For the compression filters \(any of the filter types `TILEDB_FILTER_{GZIP,ZSTD,LZ4,RLE,BZIP2,DOUBLE_DELTA,DELTA,DICTIONARY}`\) the filter options have internal format:
For the main compression filters \(any of the filter types `TILEDB_FILTER_{GZIP,ZSTD,LZ4,RLE,BZIP2,DICTIONARY}`\) the filter options have internal format:

| **Field** | **Type** | **Description** |
| :--- | :--- | :--- |
| Compressor type | `uint8_t` | Type of compression \(e.g. `TILEDB_BZIP2`\) |
| Compressor type | `uint8_t` | Type of compression \(e.g. `TILEDB_FILTER_BZIP2`\) |
| Compression level | `int32_t` | Compression level used \(ignored by some compressors\). |
| Reinterpret datatype | `uint8_t` | Type to reinterpret data prior to compression. Used for DOUBLE_DELTA and DELTA only. |

### Delta Compressor Options

For the `TILEDB_FILTER_DELTA` and `TILEDB_FILTER_DOUBLE_DELTA` compression filters the filter options have internal format:

| **Field** | **Type** | **Description** |
| :--- | :--- | :--- |
| Compressor type | `uint8_t` | Type of compression \(e.g. `TILEDB_FILTER_DELTA`\) |
| Compression level | `int32_t` | Ignored |
| Reinterpret datatype | `uint8_t` | Type to reinterpret data prior to compression. |

### Bit-width Reduction Options

Expand Down Expand Up @@ -78,4 +87,4 @@ The filter options for `TILEDB_FILTER_POSITIVE_DELTA` has internal format:

### Other Filter Options

The remaining filters \(`TILEDB_FILTER_{BITSHUFFLE,BYTESHUFFLE,CHECKSUM_MD5,CHECKSUM_256,XOR,DICTIONARY}`\) do not serialize any options.
The remaining filters \(`TILEDB_FILTER_{BITSHUFFLE,BYTESHUFFLE,CHECKSUM_MD5,CHECKSUM_256,XOR}`\) do not serialize any options.
2 changes: 1 addition & 1 deletion format_spec/filters/dictionary_encoding.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ The Dictionary Encoding filter compresses losslessly string data by creating a s
As an example in pseudocode:

```
input_data = "HG543232", "HG543232", "HG543232", "HG54", "HG54", "A", "HG543232", "HG54"]
input_data = ["HG543232", "HG543232", "HG543232", "HG54", "HG54", "A", "HG543232", "HG54"]
# apply dictionary encoding ->
dictionary = ["HG543232", "HG54", "A"]
output_data = [0, 0, 0, 1, 1, 2, 0, 1]
Expand Down
29 changes: 2 additions & 27 deletions format_spec/tile.md
Original file line number Diff line number Diff line change
Expand Up @@ -193,38 +193,13 @@ The encryption filter metadata have the following on-disk format:
| :--- | :--- | :--- |
| Num metadata parts | `uint32_t` | Number of encrypted metadata parts |
| Num data parts | `uint32_t` | Number of encrypted data parts |
| AES Metadata Part 1 | `AESPartMD` | Metadata part 1 |
| AES Metadata Part 1 | `AESPart` | Metadata part 1 |
||||
| AES Metadata Part N | `AESPartMD` | Metadata part N |
| AES Metadata Part N | `AESPart` | Metadata part N |
| AES Data Part 1 | `AESPart` | Data part 1 |
||||
| AES Data Part N | `AESPart` | Data part N |

The `AESPartMD` field has the following on-disk format:

| **Field** | **Type** | **Description** |
| :--- | :--- | :--- |
| Num metadata parts | `uint32_t` | Number of metadata parts |
| Num data parts | `uint32_t` | Number of data parts |
| Plaintext length for metadata part 1 | `uint32_t` | Number of bytes of plaintext metadata part 1 |
| Ciphertext length for metadata part 1 | `uint32_t` | Number of bytes of ciphertext metadata part 1 |
| IV bytes for metadata part 1 | `uint32_t` | Number of bytes of AES-256-GCM IV bytes for metadata part 1 |
| Tag bytes for metadata part 1 | `uint32_t` | Number of bytes of AES-256-GCM tag for metadata part 1 |
||||
| Plaintext length for metadata part N | `uint32_t` | Number of bytes of plaintext metadata part N |
| Ciphertext length for metadata part N | `uint32_t` | Number of bytes of ciphertext metadata part N |
| IV bytes for metadata part N | `uint32_t` | Number of bytes of AES-256-GCM IV bytes for metadata part N |
| Tag bytes for metadata part N | `uint32_t` | Number of bytes of AES-256-GCM tag for metadata part N |
| Plaintext length for data part 1 | `uint32_t` | Number of bytes of plaintext data part 1 |
| Ciphertext length for data part 1 | `uint32_t` | Number of bytes of ciphertext data part 1 |
| IV bytes for data part 1 | `uint32_t` | Number of bytes of AES-256-GCM IV bytes for data part 1 |
| Tag bytes for data part 1 | `uint32_t` | Number of bytes of AES-256-GCM tag for data part 1 |
||||
| Plaintext length for data part N | `uint32_t` | Number of bytes of plaintext data part N |
| Ciphertext length for data part N | `uint32_t` | Number of bytes of ciphertext data part N |
| IV bytes for data part N | `uint32_t` | Number of bytes of AES-256-GCM IV bytes for data part N |
| Tag bytes for data part N | `uint32_t` | Number of bytes of AES-256-GCM tag for data part N |

The original metadata is **not** included in the metadata output.

The `AESPart` field has the following on-disk format:
Expand Down

0 comments on commit 303c0e6

Please sign in to comment.