From f7497e91afe29798967bc86597aeed174b0633e5 Mon Sep 17 00:00:00 2001 From: Nicholas Gates Date: Mon, 18 Mar 2024 15:32:37 +0000 Subject: [PATCH] README (#102) * README * README * Flatten * Flatten * lots of small copyedits --------- Co-authored-by: Will Manning --- LICENSE-APACHE => LICENSE | 0 README.md | 164 ++++++++++++++++++++++++++++++-------- 2 files changed, 131 insertions(+), 33 deletions(-) rename LICENSE-APACHE => LICENSE (100%) diff --git a/LICENSE-APACHE b/LICENSE similarity index 100% rename from LICENSE-APACHE rename to LICENSE diff --git a/README.md b/README.md index 6f6a70ad8f..95c4ea25ca 100644 --- a/README.md +++ b/README.md @@ -5,46 +5,144 @@ [![Documentation](https://docs.rs/vortex-rs/badge.svg)](https://docs.rs/vortex-array) [![Rust](https://img.shields.io/badge/rust-1.76.0%2B-blue.svg?maxAge=3600)](https://github.com/fulcrum-so/vortex) -An in-memory format for 1-dimensional array data. +Vortex is an Apache Arrow-compatible toolkit for working with compressed array data. We are using Vortex to develop a +next-generation file format for multidimensional arrays called Spiral. -Vortex is a maximally [Apache Arrow](https://arrow.apache.org/) compatible data format that aims to separate logical and -physical representation of data, and allow pluggable physical layout. +> [!CAUTION] +> This library is very much a work in progress! -Array operations are separately defined in terms of their semantics, dealing only with logical types and physical layout -that defines exact ways in which values are transformed. +The major components of Vortex are (will be!): -# Logical Types +* **Logical Types** - a schema definition that makes no assertions about physical layout. +* **Encodings** - a pluggable set of physical layouts. Vortex ships with several state-of-the-art lightweight +compression codecs that have the potential to support GPU decompression. +* **Compression** - recursive compression based on stratified samples of the input. +* **Compute** - basic compute kernels that can operate over compressed data. Note that Vortex does not intend to become + a full-fledged compute engine, but rather to provide the ability to implement basic compute operations as may be + required for efficient scanning & pushdown operations. +* **Statistics** - each array carries around lazily computed summary statistics, optionally populated at read-time. + These are available to compute kernels as well as to the compressor. +* **Serde** - zero-copy serialization. Designed to work well both on-disk and over-the-wire. -Vortex type system only conveys semantic meaning of the array data without prescribing physical layout. When operating -over arrays you can focus on semantics of the operation. Separately you can provide low level implementation dependent -on particular physical operation. +## Overview: Logical vs Physical -``` -Null: all null array -Bool: Single bit value -Integer: Fixed width signed/unsigned number. Supports 8, 16, 32, 64 bit widths -Float: Fixed width floating point number. Supports 16, 32, 64 bit float types -Decimal: Fixed width decimal with specified precision (total number of digits) and scale (number of digits after decimal point) -Instant: An instantaneous point on the time-line. Number of seconds/miliseconds/microseconds/nanoseconds from epoch -LocalDate: A date without a time-zone -LocalTime: A time without a time-zone -ZonedDateTime: A data and time including ISO-8601 timezone -List: Sequence of items of same type -Map: Key, value mapping -Struct: Named tuple of types -``` +One of the core principles in Vortex is separation of the logical from the physical. -# Physical Encodings +A Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical encoding +(the type of the array itself). Vortex ships with several built-in encodings, as well as several extension encodings. -Vortex calls array implementations encodings, they encode the physical layout of the data. Encodings are recurisvely -nested, i.e. encodings contain other encodings. For every array you have their value data type and the its encoding that -defines how operations will be performed. By default necessary encodings to zero copy convert to and from Apache Arrow -are included in the package. +The built-in encodings are primarily designed to model the Apache Arrow in-memory format, enabling us to construct Vortex +arrays with zero-copy from Arrow arrays. There are also several built-in encodings (e.g., `sparse` and `chunked`) that +are useful building blocks for other encodings. +The included extension encodings are mostly designed to model compressed in-memory arrays, such as run-length or +dictionary encoding. -When performing operations they're dispatched on the encodings to provide specialized implementation. +## Components -## Compression +### Logical Types -The advantage of separating physical layout from the semantic of the data is compression. Vortex can compress data -without requiring changes to the logical operations. To support efficient data access we focus on lightweight -compression algorithms only falling back to general purpose compressors for binary data. +The Vortex type-system is still in flux. The current set of logical types is: + +* Null +* Bool +* Integer +* Float +* Decimal +* Binary +* UTF8 +* List +* Struct +* Date/Time/DateTime/Duration: TODO (in-progress, currently partially supported) +* FixedList: TODO +* Union: TODO + +### Canonical/Flat Encodings + +Vortex includes a base set of "flat" encodings that are designed to be zero-copy with Apache Arrow. These are the canonical +representations of each of the logical data types. The canonical encodings currently supported are: + +* Null +* Bool +* Primitive (Integer, Float) +* Struct +* VarBin +* VarBinView +* ...with more to come + +### Compressed Encodings + +Vortex includes a set of compressed encodings that can hold compression in-memory arrays allowing us to defer +compression. These are: + +* BitPacking +* Constant +* Chunked +* Dictionary +* Frame-of-Reference +* Run-end +* RoaringUInt +* RoaringBool +* Sparse +* ZigZag + +### Compression + +Vortex's compression scheme is based on the [BtrBlocks](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf) paper. + +Roughly, for each chunk of data a sample is taken and a set of encodings are attempted. The best-performing encoding +is then chosen to encode the entire chunk. This sounds like it would be very expensive, but given basic statistics +about a chunk, it is possible to cheaply rule out many encodings and ensure the search space does not explode in size. + +### Compute + +Vortex provides the ability for each encoding to override the implementation of a compute function to avoid +decompressing where possible. For example, filtering a dictionary-encoded UTF8 array can be more cheaply performed by +filtering the dictionary first. + +Note that Vortex does not intend to become a full-fledged compute engine, but rather to provide the ability to +implement basic compute operations as may be required for efficient scanning & operation pushdown. + +### Statistics + +Vortex arrays carry lazily-computed summary statistics. Unlike other array libraries, these statistics can be populated +from disk formats such as Parquet and preserved all the way into a compute engine. Statistics are available to compute +kernels as well as to the compressor. + +The current statistics are: + +* BitWidthFreq +* TrailingZeroFreq +* IsConstant +* IsSorted +* IsStrictSorted +* Max +* Min +* RunCount +* TrueCount +* NullCount + +### Serialization / Deserialization (Serde) + +TODO + +## Vs Apache Arrow + +It is important to note that Vortex and Arrow have different design goals. As such, it is somewhat +unfair to make any comparison at all. But given both can be used as array libraries, it is worth noting the differences. + +Vortex is designed to be maximally compatible with Apache Arrow. All Arrow arrays can be converted into Vortex arrays +with zero-copy. And a Vortex array constructed from an Arrow array can be converted back to Arrow, again with zero-copy. + +Vortex explicitly separates logical types from physical encodings, distinguishing it from Arrow. This allows +Vortex to model more complex arrays while still exposing a logical interface. For example, Vortex can model a UTF8 +`ChunkedArray` where the first chunk is run-length encoded and the second chunk is dictionary encoded. +In Arrow, `RunLengthArray` and `DictionaryArray` are separate incompatible types, and so cannot be combined in this way. + +## Contributing + +While we hope to turn Vortex into a community project, its current rapid rate of change makes taking contributions +without prior discussion infeasible. If you are interested in contributing, please open an issue to discuss your ideas. + +## License + +Licensed under the Apache License, Version 2.0 (the "License"). \ No newline at end of file