Skip to content

Commit

Permalink
README (#102)
Browse files Browse the repository at this point in the history
* README

* README

* Flatten

* Flatten

* lots of small copyedits

---------

Co-authored-by: Will Manning <[email protected]>
  • Loading branch information
gatesn and lwwmanning authored Mar 18, 2024
1 parent 2c07f6c commit f7497e9
Show file tree
Hide file tree
Showing 2 changed files with 131 additions and 33 deletions.
File renamed without changes.
164 changes: 131 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,46 +5,144 @@
[![Documentation](https://docs.rs/vortex-rs/badge.svg)](https://docs.rs/vortex-array)
[![Rust](https://img.shields.io/badge/rust-1.76.0%2B-blue.svg?maxAge=3600)](https://github.com/fulcrum-so/vortex)

An in-memory format for 1-dimensional array data.
Vortex is an Apache Arrow-compatible toolkit for working with compressed array data. We are using Vortex to develop a
next-generation file format for multidimensional arrays called Spiral.

Vortex is a maximally [Apache Arrow](https://arrow.apache.org/) compatible data format that aims to separate logical and
physical representation of data, and allow pluggable physical layout.
> [!CAUTION]
> This library is very much a work in progress!
Array operations are separately defined in terms of their semantics, dealing only with logical types and physical layout
that defines exact ways in which values are transformed.
The major components of Vortex are (will be!):

# Logical Types
* **Logical Types** - a schema definition that makes no assertions about physical layout.
* **Encodings** - a pluggable set of physical layouts. Vortex ships with several state-of-the-art lightweight
compression codecs that have the potential to support GPU decompression.
* **Compression** - recursive compression based on stratified samples of the input.
* **Compute** - basic compute kernels that can operate over compressed data. Note that Vortex does not intend to become
a full-fledged compute engine, but rather to provide the ability to implement basic compute operations as may be
required for efficient scanning & pushdown operations.
* **Statistics** - each array carries around lazily computed summary statistics, optionally populated at read-time.
These are available to compute kernels as well as to the compressor.
* **Serde** - zero-copy serialization. Designed to work well both on-disk and over-the-wire.

Vortex type system only conveys semantic meaning of the array data without prescribing physical layout. When operating
over arrays you can focus on semantics of the operation. Separately you can provide low level implementation dependent
on particular physical operation.
## Overview: Logical vs Physical

```
Null: all null array
Bool: Single bit value
Integer: Fixed width signed/unsigned number. Supports 8, 16, 32, 64 bit widths
Float: Fixed width floating point number. Supports 16, 32, 64 bit float types
Decimal: Fixed width decimal with specified precision (total number of digits) and scale (number of digits after decimal point)
Instant: An instantaneous point on the time-line. Number of seconds/miliseconds/microseconds/nanoseconds from epoch
LocalDate: A date without a time-zone
LocalTime: A time without a time-zone
ZonedDateTime: A data and time including ISO-8601 timezone
List: Sequence of items of same type
Map: Key, value mapping
Struct: Named tuple of types
```
One of the core principles in Vortex is separation of the logical from the physical.

# Physical Encodings
A Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical encoding
(the type of the array itself). Vortex ships with several built-in encodings, as well as several extension encodings.

Vortex calls array implementations encodings, they encode the physical layout of the data. Encodings are recurisvely
nested, i.e. encodings contain other encodings. For every array you have their value data type and the its encoding that
defines how operations will be performed. By default necessary encodings to zero copy convert to and from Apache Arrow
are included in the package.
The built-in encodings are primarily designed to model the Apache Arrow in-memory format, enabling us to construct Vortex
arrays with zero-copy from Arrow arrays. There are also several built-in encodings (e.g., `sparse` and `chunked`) that
are useful building blocks for other encodings.
The included extension encodings are mostly designed to model compressed in-memory arrays, such as run-length or
dictionary encoding.

When performing operations they're dispatched on the encodings to provide specialized implementation.
## Components

## Compression
### Logical Types

The advantage of separating physical layout from the semantic of the data is compression. Vortex can compress data
without requiring changes to the logical operations. To support efficient data access we focus on lightweight
compression algorithms only falling back to general purpose compressors for binary data.
The Vortex type-system is still in flux. The current set of logical types is:

* Null
* Bool
* Integer
* Float
* Decimal
* Binary
* UTF8
* List
* Struct
* Date/Time/DateTime/Duration: TODO (in-progress, currently partially supported)
* FixedList: TODO
* Union: TODO

### Canonical/Flat Encodings

Vortex includes a base set of "flat" encodings that are designed to be zero-copy with Apache Arrow. These are the canonical
representations of each of the logical data types. The canonical encodings currently supported are:

* Null
* Bool
* Primitive (Integer, Float)
* Struct
* VarBin
* VarBinView
* ...with more to come

### Compressed Encodings

Vortex includes a set of compressed encodings that can hold compression in-memory arrays allowing us to defer
compression. These are:

* BitPacking
* Constant
* Chunked
* Dictionary
* Frame-of-Reference
* Run-end
* RoaringUInt
* RoaringBool
* Sparse
* ZigZag

### Compression

Vortex's compression scheme is based on the [BtrBlocks](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf) paper.

Roughly, for each chunk of data a sample is taken and a set of encodings are attempted. The best-performing encoding
is then chosen to encode the entire chunk. This sounds like it would be very expensive, but given basic statistics
about a chunk, it is possible to cheaply rule out many encodings and ensure the search space does not explode in size.

### Compute

Vortex provides the ability for each encoding to override the implementation of a compute function to avoid
decompressing where possible. For example, filtering a dictionary-encoded UTF8 array can be more cheaply performed by
filtering the dictionary first.

Note that Vortex does not intend to become a full-fledged compute engine, but rather to provide the ability to
implement basic compute operations as may be required for efficient scanning & operation pushdown.

### Statistics

Vortex arrays carry lazily-computed summary statistics. Unlike other array libraries, these statistics can be populated
from disk formats such as Parquet and preserved all the way into a compute engine. Statistics are available to compute
kernels as well as to the compressor.

The current statistics are:

* BitWidthFreq
* TrailingZeroFreq
* IsConstant
* IsSorted
* IsStrictSorted
* Max
* Min
* RunCount
* TrueCount
* NullCount

### Serialization / Deserialization (Serde)

TODO

## Vs Apache Arrow

It is important to note that Vortex and Arrow have different design goals. As such, it is somewhat
unfair to make any comparison at all. But given both can be used as array libraries, it is worth noting the differences.

Vortex is designed to be maximally compatible with Apache Arrow. All Arrow arrays can be converted into Vortex arrays
with zero-copy. And a Vortex array constructed from an Arrow array can be converted back to Arrow, again with zero-copy.

Vortex explicitly separates logical types from physical encodings, distinguishing it from Arrow. This allows
Vortex to model more complex arrays while still exposing a logical interface. For example, Vortex can model a UTF8
`ChunkedArray` where the first chunk is run-length encoded and the second chunk is dictionary encoded.
In Arrow, `RunLengthArray` and `DictionaryArray` are separate incompatible types, and so cannot be combined in this way.

## Contributing

While we hope to turn Vortex into a community project, its current rapid rate of change makes taking contributions
without prior discussion infeasible. If you are interested in contributing, please open an issue to discuss your ideas.

## License

Licensed under the Apache License, Version 2.0 (the "License").

0 comments on commit f7497e9

Please sign in to comment.