Remove unsafge

spiraldb · Mar 26, 2024 · 9bd5ba6 · 9bd5ba6
2 parents c2fef84 + a95cbe9
commit 9bd5ba6
Show file tree

Hide file tree

Showing 12 changed files with 337 additions and 222 deletions.
diff --git a/README.md b/README.md
@@ -11,33 +11,34 @@ next-generation columnar file format for multidimensional arrays called Spiral.
 > [!CAUTION]
 > This library is still under rapid development and is very much a work in progress!
 >
-> Some key features are not yet implemented, the API will almost certainly change in breaking ways, and we cannot yet guarantee correctness in all cases.
+> Some key features are not yet implemented, the API will almost certainly change in breaking ways, and we cannot
+> yet guarantee correctness in all cases.
 
 The major components of Vortex are (will be!):
 
 * **Logical Types** - a schema definition that makes no assertions about physical layout.
-* **Encodings** - a pluggable set of physical layouts. Vortex ships with several state-of-the-art lightweight 
-compression codecs that have the potential to support GPU decompression.
+* **Encodings** - a pluggable set of physical layouts. Vortex ships with several state-of-the-art lightweight
+  compression codecs that have the potential to support GPU decompression.
 * **Compression** - recursive compression based on stratified samples of the input.
 * **Compute** - basic compute kernels that can operate over compressed data. Note that Vortex does not intend to become
   a full-fledged compute engine, but rather to provide the ability to implement basic compute operations as may be
   required for efficient scanning & pushdown operations.
 * **Statistics** - each array carries around lazily computed summary statistics, optionally populated at read-time.
   These are available to compute kernels as well as to the compressor.
-* **Serde** - zero-copy serialization. Designed to work well both on-disk and over-the-wire.
+* **Serde** - zero-copy serialization. Useful as a building block in creating IPC or file formats that contain
+  compressed arrays.
 
 ## Overview: Logical vs Physical
 
 One of the core principles in Vortex is separation of the logical from the physical.
 
-A Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical encoding 
+A Vortex array is defined by a logical data type (i.e., the type of scalar elements) as well as a physical encoding
 (the type of the array itself). Vortex ships with several built-in encodings, as well as several extension encodings.
 
-The built-in encodings are primarily designed to model the Apache Arrow in-memory format, enabling us to construct Vortex
-arrays with zero-copy from Arrow arrays. There are also several built-in encodings (e.g., `sparse` and `chunked`) that
-are useful building blocks for other encodings.
-The included extension encodings are mostly designed to model compressed in-memory arrays, such as run-length or
-dictionary encoding.
+The built-in encodings are primarily designed to model the Apache Arrow in-memory format, enabling us to construct
+Vortex arrays with zero-copy from Arrow arrays. There are also several built-in encodings (e.g., `sparse` and
+`chunked`) that are useful building blocks for other encodings. The included extension encodings are mostly designed
+to model compressed in-memory arrays, such as run-length or dictionary encoding.
 
 ## Components
 
@@ -47,21 +48,21 @@ The Vortex type-system is still in flux. The current set of logical types is:
 
 * Null
 * Bool
-* Integer
-* Float
-* Decimal
+* Integer(8, 16, 32, 64)
+* Float(16, b16, 32, 64)
 * Binary
 * UTF8
-* List
 * Struct
+* Decimal: TODO
 * Date/Time/DateTime/Duration: TODO (in-progress, currently partially supported)
+* List: TODO
 * FixedList: TODO
 * Union: TODO
 
 ### Canonical/Flat Encodings
 
-Vortex includes a base set of "flat" encodings that are designed to be zero-copy with Apache Arrow. These are the canonical
-representations of each of the logical data types. The canonical encodings currently supported are:
+Vortex includes a base set of "flat" encodings that are designed to be zero-copy with Apache Arrow. These are the
+canonical representations of each of the logical data types. The canonical encodings currently supported are:
 
 * Null
 * Bool
@@ -76,7 +77,7 @@ representations of each of the logical data types. The canonical encodings curre
 Vortex includes a set of compressed encodings that can hold compression in-memory arrays allowing us to defer
 compression. These are:
 
-* BitPacking
+* BitPacked
 * Constant
 * Chunked
 * Dictionary
@@ -89,12 +90,13 @@ compression. These are:
 
 ### Compression
 
-Vortex's compression scheme is based on the [BtrBlocks](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf) paper.
+Vortex's compression scheme is based on
+the [BtrBlocks](https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/btrblocks.pdf) paper.
 
-Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted (recursively)
-with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode the entire chunk.
-This sounds like it would be very expensive, but given basic statistics about a chunk, it is possible to cheaply prune
-many encodings and ensure the search space does not explode in size.
+Roughly, for each chunk of data, a sample of at least ~1% of the data is taken. Compression is then attempted (
+recursively) with a set of lightweight encodings. The best-performing combination of encodings is then chosen to encode
+the entire chunk. This sounds like it would be very expensive, but given basic statistics about a chunk, it is
+possible to cheaply prune many encodings and ensure the search space does not explode in size.
 
 ### Compute
 
@@ -126,7 +128,12 @@ The current statistics are:
 
 ### Serialization / Deserialization (Serde)
 
-TODO
+Vortex serde is currently in the design phase. The goals of this implementation are:
+
+* Support scanning (column projection + row filter) with zero-copy and zero heap allocation.
+* Support random access in constant time.
+* Forward statistical information (such as sortedness) to consumers.
+* To provide a building block for file format authors to store compressed array data.
 
 ## Vs Apache Arrow
 
@@ -143,7 +150,7 @@ In Arrow, `RunLengthArray` and `DictionaryArray` are separate incompatible types
 
 ## Contributing
 
-While we hope to turn Vortex into a community project, its current rapid rate of change makes taking contributions 
+While we hope to turn Vortex into a community project, its current rapid rate of change makes taking contributions
 without prior discussion infeasible. If you are interested in contributing, please open an issue to discuss your ideas.
 
 ## License

diff --git a/vortex-array/src/array/mod.rs b/vortex-array/src/array/mod.rs
@@ -32,6 +32,7 @@ use crate::compute::ArrayCompute;
 use crate::formatter::{ArrayDisplay, ArrayFormatter};
 use crate::serde::{ArraySerde, EncodingSerde};
 use crate::stats::Stats;
+use crate::validity::{ArrayValidity, Validity};
 
 pub mod bool;
 pub mod chunked;
@@ -110,7 +111,6 @@ macro_rules! impl_array {
     };
 }
 
-use crate::validity::{ArrayValidity, Validity};
 pub use impl_array;
 
 impl ArrayCompute for ArrayRef {
@@ -151,6 +151,24 @@ impl ArrayCompute for ArrayRef {
     }
 }
 
+impl ArrayValidity for ArrayRef {
+    fn nullability(&self) -> Nullability {
+        self.as_ref().nullability()
+    }
+
+    fn validity(&self) -> Option<Validity> {
+        self.as_ref().validity()
+    }
+
+    fn logical_validity(&self) -> Option<Validity> {
+        self.as_ref().logical_validity()
+    }
+
+    fn is_valid(&self, index: usize) -> bool {
+        self.as_ref().is_valid(index)
+    }
+}
+
 impl Array for ArrayRef {
     fn as_any(&self) -> &dyn Any {
         self.as_ref().as_any()
@@ -201,15 +219,121 @@ impl Array for ArrayRef {
     }
 }
 
-impl ArrayValidity for ArrayRef {
+impl ArrayDisplay for ArrayRef {
+    fn fmt(&self, fmt: &'_ mut ArrayFormatter) -> std::fmt::Result {
+        ArrayDisplay::fmt(self.as_ref(), fmt)
+    }
+}
+
+impl<'a, T: ArrayCompute> ArrayCompute for &'a T {
+    fn as_arrow(&self) -> Option<&dyn AsArrowArray> {
+        T::as_arrow(self)
+    }
+
+    fn as_contiguous(&self) -> Option<&dyn AsContiguousFn> {
+        T::as_contiguous(self)
+    }
+
+    fn cast(&self) -> Option<&dyn CastFn> {
+        T::cast(self)
+    }
+
+    fn flatten(&self) -> Option<&dyn FlattenFn> {
+        T::flatten(self)
+    }
+
+    fn fill_forward(&self) -> Option<&dyn FillForwardFn> {
+        T::fill_forward(self)
+    }
+
+    fn patch(&self) -> Option<&dyn PatchFn> {
+        T::patch(self)
+    }
+
+    fn scalar_at(&self) -> Option<&dyn ScalarAtFn> {
+        T::scalar_at(self)
+    }
+
+    fn search_sorted(&self) -> Option<&dyn SearchSortedFn> {
+        T::search_sorted(self)
+    }
+
+    fn take(&self) -> Option<&dyn TakeFn> {
+        T::take(self)
+    }
+}
+
+impl<'a, T: ArrayValidity> ArrayValidity for &'a T {
+    fn nullability(&self) -> Nullability {
+        T::nullability(self)
+    }
+
     fn validity(&self) -> Option<Validity> {
-        self.as_ref().validity()
+        T::validity(self)
+    }
+
+    fn logical_validity(&self) -> Option<Validity> {
+        T::logical_validity(self)
+    }
+
+    fn is_valid(&self, index: usize) -> bool {
+        T::is_valid(self, index)
     }
 }
 
-impl ArrayDisplay for ArrayRef {
+impl<'a, T: Array + Clone> Array for &'a T {
+    fn as_any(&self) -> &dyn Any {
+        T::as_any(self)
+    }
+
+    fn into_any(self: Arc<Self>) -> Arc<dyn Any + Send + Sync> {
+        T::into_any(Arc::new((*self).clone()))
+    }
+
+    fn to_array(&self) -> ArrayRef {
+        T::to_array(self)
+    }
+
+    fn into_array(self) -> ArrayRef {
+        self.to_array()
+    }
+
+    fn len(&self) -> usize {
+        T::len(self)
+    }
+
+    fn is_empty(&self) -> bool {
+        T::is_empty(self)
+    }
+
+    fn dtype(&self) -> &DType {
+        T::dtype(self)
+    }
+
+    fn stats(&self) -> Stats {
+        T::stats(self)
+    }
+
+    fn slice(&self, start: usize, stop: usize) -> VortexResult<ArrayRef> {
+        T::slice(self, start, stop)
+    }
+
+    fn encoding(&self) -> EncodingRef {
+        T::encoding(self)
+    }
+
+    fn nbytes(&self) -> usize {
+        T::nbytes(self)
+    }
+
+    fn serde(&self) -> Option<&dyn ArraySerde> {
+        T::serde(self)
+    }
+}
+
+impl<'a, T: ArrayDisplay> ArrayDisplay for &'a T {
     fn fmt(&self, fmt: &'_ mut ArrayFormatter) -> std::fmt::Result {
-        ArrayDisplay::fmt(self.as_ref(), fmt)
+        ArrayDisplay::fmt(*self, fmt)
     }
 }
 

diff --git a/vortex-array/src/array/primitive/mod.rs b/vortex-array/src/array/primitive/mod.rs
@@ -221,8 +221,6 @@ impl PrimitiveArray {
     }
 }
 
-pub type PrimitiveIter<'a, T> = ArrayIter<dyn ArrayAccessor<T>, T>;
-
 #[derive(Debug)]
 pub struct PrimitiveEncoding;
 

diff --git a/vortex-array/src/array/varbin/accessor.rs b/vortex-array/src/array/varbin/accessor.rs
@@ -0,0 +1,51 @@
+use num_traits::AsPrimitive;
+
+use crate::accessor::ArrayAccessor;
+use crate::array::downcast::DowncastArrayBuiltin;
+use crate::array::varbin::VarBinArray;
+use crate::array::Array;
+use crate::compute::flatten::flatten_primitive;
+use crate::compute::scalar_at::scalar_at;
+use crate::match_each_native_ptype;
+use crate::validity::ArrayValidity;
+
+fn offset_at(array: &dyn Array, index: usize) -> usize {
+    if let Some(parray) = array.maybe_primitive() {
+        match_each_native_ptype!(parray.ptype(), |$P| {
+            parray.typed_data::<$P>()[index].as_()
+        })
+    } else {
+        scalar_at(array, index).unwrap().try_into().unwrap()
+    }
+}
+
+impl<'a> ArrayAccessor<&'a [u8]> for &'a VarBinArray {
+    fn value(&self, index: usize) -> Option<&'a [u8]> {
+        if self.is_valid(index) {
+            let start = offset_at(self.offsets(), index);
+            let end = offset_at(self.offsets(), index + 1);
+            Some(&self.bytes().as_primitive().buffer()[start..end])
+        } else {
+            None
+        }
+    }
+}
+
+impl<'a> ArrayAccessor<Vec<u8>> for &'a VarBinArray {
+    fn value(&self, index: usize) -> Option<Vec<u8>> {
+        if self.is_valid(index) {
+            let start = offset_at(self.offsets(), index);
+            let end = offset_at(self.offsets(), index + 1);
+
+            let slice_bytes = self.bytes().slice(start, end).unwrap();
+            Some(
+                flatten_primitive(&slice_bytes)
+                    .unwrap()
+                    .typed_data::<u8>()
+                    .to_vec(),
+            )
+        } else {
+            None
+        }
+    }
+}
-Original file line number
+Diff line change
@@ Expand Up / @@ -221,8 +221,6 @@ impl PrimitiveArray { @@
         }
     }
-    pub type PrimitiveIter<'a, T> = ArrayIter<dyn ArrayAccessor<T>, T>;
     #[derive(Debug)]
     pub struct PrimitiveEncoding;
@@ Expand Down @@