Skip to content

Commit

Permalink
Remove unsafge
Browse files Browse the repository at this point in the history
  • Loading branch information
gatesn committed Mar 26, 2024
2 parents c2fef84 + a95cbe9 commit 9bd5ba6
Show file tree
Hide file tree
Showing 12 changed files with 337 additions and 222 deletions.
55 changes: 31 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,33 +11,34 @@ next-generation columnar file format for multidimensional arrays called Spiral.
> [!CAUTION]
> This library is still under rapid development and is very much a work in progress!
>
> Some key features are not yet implemented, the API will almost certainly change in breaking ways, and we cannot yet guarantee correctness in all cases.
> Some key features are not yet implemented, the API will almost certainly change in breaking ways, and we cannot
> yet guarantee correctness in all cases.
The major components of Vortex are (will be!):

* **Logical Types** - a schema definition that makes no assertions about physical layout.
* **Encodings** - a pluggable set of physical layouts. Vortex ships with several state-of-the-art lightweight
compression codecs that have the potential to support GPU decompression.
* **Encodings** - a pluggable set of physical layouts. Vortex ships with several state-of-the-art lightweight
compression codecs that have the potential to support GPU decompression.
* **Compression** - recursive compression based on stratified samples of the input.
* **Compute** - basic compute kernels that can operate over compressed data. Note that Vortex does not intend to become
a full-fledged compute engine, but rather to provide the ability to implement basic compute operations as may be
required for efficient scanning & pushdown operations.
* **Statistics** - each array carries around lazily computed summary statistics, optionally populated at read-time.
These are available to compute kernels as well as to the compressor.
* **Serde** - zero-copy serialization. Designed to work well both on-disk and over-the-wire.
* **Serde** - zero-copy serialization. Useful as a building block in creating IPC or file formats that contain
compressed arrays.

## Overview: Logical vs Physical

One of the core principles in Vortex is separation of the logical from the physical.

A Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical encoding
A Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical encoding
(the type of the array itself). Vortex ships with several built-in encodings, as well as several extension encodings.

The built-in encodings are primarily designed to model the Apache Arrow in-memory format, enabling us to construct Vortex
arrays with zero-copy from Arrow arrays. There are also several built-in encodings (e.g., `sparse` and `chunked`) that
are useful building blocks for other encodings.
The included extension encodings are mostly designed to model compressed in-memory arrays, such as run-length or
dictionary encoding.
The built-in encodings are primarily designed to model the Apache Arrow in-memory format, enabling us to construct
Vortex arrays with zero-copy from Arrow arrays. There are also several built-in encodings (e.g., `sparse` and
`chunked`) that are useful building blocks for other encodings. The included extension encodings are mostly designed
to model compressed in-memory arrays, such as run-length or dictionary encoding.

## Components

Expand All @@ -47,21 +48,21 @@ The Vortex type-system is still in flux. The current set of logical types is:

* Null
* Bool
* Integer
* Float
* Decimal
* Integer(8, 16, 32, 64)
* Float(16, b16, 32, 64)
* Binary
* UTF8
* List
* Struct
* Decimal: TODO
* Date/Time/DateTime/Duration: TODO (in-progress, currently partially supported)
* List: TODO
* FixedList: TODO
* Union: TODO

### Canonical/Flat Encodings

Vortex includes a base set of "flat" encodings that are designed to be zero-copy with Apache Arrow. These are the canonical
representations of each of the logical data types. The canonical encodings currently supported are:
Vortex includes a base set of "flat" encodings that are designed to be zero-copy with Apache Arrow. These are the
canonical representations of each of the logical data types. The canonical encodings currently supported are:

* Null
* Bool
Expand All @@ -76,7 +77,7 @@ representations of each of the logical data types. The canonical encodings curre
Vortex includes a set of compressed encodings that can hold compression in-memory arrays allowing us to defer
compression. These are:

* BitPacking
* BitPacked
* Constant
* Chunked
* Dictionary
Expand All @@ -89,12 +90,13 @@ compression. These are:

### Compression

Vortex's compression scheme is based on the [BtrBlocks](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf) paper.
Vortex's compression scheme is based on
the [BtrBlocks](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf) paper.

Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted (recursively)
with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode the entire chunk.
This sounds like it would be very expensive, but given basic statistics about a chunk, it is possible to cheaply prune
many encodings and ensure the search space does not explode in size.
Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted (
recursively) with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode
the entire chunk. This sounds like it would be very expensive, but given basic statistics about a chunk, it is
possible to cheaply prune many encodings and ensure the search space does not explode in size.

### Compute

Expand Down Expand Up @@ -126,7 +128,12 @@ The current statistics are:

### Serialization / Deserialization (Serde)

TODO
Vortex serde is currently in the design phase. The goals of this implementation are:

* Support scanning (column projection + row filter) with zero-copy and zero heap allocation.
* Support random access in constant time.
* Forward statistical information (such as sortedness) to consumers.
* To provide a building block for file format authors to store compressed array data.

## Vs Apache Arrow

Expand All @@ -143,7 +150,7 @@ In Arrow, `RunLengthArray` and `DictionaryArray` are separate incompatible types

## Contributing

While we hope to turn Vortex into a community project, its current rapid rate of change makes taking contributions
While we hope to turn Vortex into a community project, its current rapid rate of change makes taking contributions
without prior discussion infeasible. If you are interested in contributing, please open an issue to discuss your ideas.

## License
Expand Down
134 changes: 129 additions & 5 deletions vortex-array/src/array/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ use crate::compute::ArrayCompute;
use crate::formatter::{ArrayDisplay, ArrayFormatter};
use crate::serde::{ArraySerde, EncodingSerde};
use crate::stats::Stats;
use crate::validity::{ArrayValidity, Validity};

pub mod bool;
pub mod chunked;
Expand Down Expand Up @@ -110,7 +111,6 @@ macro_rules! impl_array {
};
}

use crate::validity::{ArrayValidity, Validity};
pub use impl_array;

impl ArrayCompute for ArrayRef {
Expand Down Expand Up @@ -151,6 +151,24 @@ impl ArrayCompute for ArrayRef {
}
}

impl ArrayValidity for ArrayRef {
fn nullability(&self) -> Nullability {
self.as_ref().nullability()
}

fn validity(&self) -> Option<Validity> {
self.as_ref().validity()
}

fn logical_validity(&self) -> Option<Validity> {
self.as_ref().logical_validity()
}

fn is_valid(&self, index: usize) -> bool {
self.as_ref().is_valid(index)
}
}

impl Array for ArrayRef {
fn as_any(&self) -> &dyn Any {
self.as_ref().as_any()
Expand Down Expand Up @@ -201,15 +219,121 @@ impl Array for ArrayRef {
}
}

impl ArrayValidity for ArrayRef {
impl ArrayDisplay for ArrayRef {
fn fmt(&self, fmt: &'_ mut ArrayFormatter) -> std::fmt::Result {
ArrayDisplay::fmt(self.as_ref(), fmt)
}
}

impl<'a, T: ArrayCompute> ArrayCompute for &'a T {
fn as_arrow(&self) -> Option<&dyn AsArrowArray> {
T::as_arrow(self)
}

fn as_contiguous(&self) -> Option<&dyn AsContiguousFn> {
T::as_contiguous(self)
}

fn cast(&self) -> Option<&dyn CastFn> {
T::cast(self)
}

fn flatten(&self) -> Option<&dyn FlattenFn> {
T::flatten(self)
}

fn fill_forward(&self) -> Option<&dyn FillForwardFn> {
T::fill_forward(self)
}

fn patch(&self) -> Option<&dyn PatchFn> {
T::patch(self)
}

fn scalar_at(&self) -> Option<&dyn ScalarAtFn> {
T::scalar_at(self)
}

fn search_sorted(&self) -> Option<&dyn SearchSortedFn> {
T::search_sorted(self)
}

fn take(&self) -> Option<&dyn TakeFn> {
T::take(self)
}
}

impl<'a, T: ArrayValidity> ArrayValidity for &'a T {
fn nullability(&self) -> Nullability {
T::nullability(self)
}

fn validity(&self) -> Option<Validity> {
self.as_ref().validity()
T::validity(self)
}

fn logical_validity(&self) -> Option<Validity> {
T::logical_validity(self)
}

fn is_valid(&self, index: usize) -> bool {
T::is_valid(self, index)
}
}

impl ArrayDisplay for ArrayRef {
impl<'a, T: Array + Clone> Array for &'a T {
fn as_any(&self) -> &dyn Any {
T::as_any(self)
}

fn into_any(self: Arc<Self>) -> Arc<dyn Any + Send + Sync> {
T::into_any(Arc::new((*self).clone()))
}

fn to_array(&self) -> ArrayRef {
T::to_array(self)
}

fn into_array(self) -> ArrayRef {
self.to_array()
}

fn len(&self) -> usize {
T::len(self)
}

fn is_empty(&self) -> bool {
T::is_empty(self)
}

fn dtype(&self) -> &DType {
T::dtype(self)
}

fn stats(&self) -> Stats {
T::stats(self)
}

fn slice(&self, start: usize, stop: usize) -> VortexResult<ArrayRef> {
T::slice(self, start, stop)
}

fn encoding(&self) -> EncodingRef {
T::encoding(self)
}

fn nbytes(&self) -> usize {
T::nbytes(self)
}

fn serde(&self) -> Option<&dyn ArraySerde> {
T::serde(self)
}
}

impl<'a, T: ArrayDisplay> ArrayDisplay for &'a T {
fn fmt(&self, fmt: &'_ mut ArrayFormatter) -> std::fmt::Result {
ArrayDisplay::fmt(self.as_ref(), fmt)
ArrayDisplay::fmt(*self, fmt)
}
}

Expand Down
2 changes: 0 additions & 2 deletions vortex-array/src/array/primitive/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -221,8 +221,6 @@ impl PrimitiveArray {
}
}

pub type PrimitiveIter<'a, T> = ArrayIter<dyn ArrayAccessor<T>, T>;

#[derive(Debug)]
pub struct PrimitiveEncoding;

Expand Down
51 changes: 51 additions & 0 deletions vortex-array/src/array/varbin/accessor.rs
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
use num_traits::AsPrimitive;

use crate::accessor::ArrayAccessor;
use crate::array::downcast::DowncastArrayBuiltin;
use crate::array::varbin::VarBinArray;
use crate::array::Array;
use crate::compute::flatten::flatten_primitive;
use crate::compute::scalar_at::scalar_at;
use crate::match_each_native_ptype;
use crate::validity::ArrayValidity;

fn offset_at(array: &dyn Array, index: usize) -> usize {
if let Some(parray) = array.maybe_primitive() {
match_each_native_ptype!(parray.ptype(), |$P| {
parray.typed_data::<$P>()[index].as_()
})
} else {
scalar_at(array, index).unwrap().try_into().unwrap()
}
}

impl<'a> ArrayAccessor<&'a [u8]> for &'a VarBinArray {
fn value(&self, index: usize) -> Option<&'a [u8]> {
if self.is_valid(index) {
let start = offset_at(self.offsets(), index);
let end = offset_at(self.offsets(), index + 1);
Some(&self.bytes().as_primitive().buffer()[start..end])
} else {
None
}
}
}

impl<'a> ArrayAccessor<Vec<u8>> for &'a VarBinArray {
fn value(&self, index: usize) -> Option<Vec<u8>> {
if self.is_valid(index) {
let start = offset_at(self.offsets(), index);
let end = offset_at(self.offsets(), index + 1);

let slice_bytes = self.bytes().slice(start, end).unwrap();
Some(
flatten_primitive(&slice_bytes)
.unwrap()
.typed_data::<u8>()
.to_vec(),
)
} else {
None
}
}
}
Loading

0 comments on commit 9bd5ba6

Please sign in to comment.