Skip to content

Latest commit

 

History

History
316 lines (214 loc) · 12.4 KB

README.md

File metadata and controls

316 lines (214 loc) · 12.4 KB

A Guide to Contiguous Data in Rust

Many modern languages have collections called "array," "slice," or "vector." Rust has all three, plus many third-party libraries! This is an opinionated guide that tries to help you choose the best way to store contiguous data in Rust. It covers tools from Rust's core and standard library as well as third-party crates that meet more specific needs.

Contiguous data is when multiple pieces of data are stored next to each other in memory. It is often a great way to store collections, because it can provide great cache locality and branch prediction.

Tradeoffs

These are things to think about when choosing a technique for storing contiguous data.

Mutability

There are a few broad levels of mutability for contiguous data in Rust:

  • Completely immutable
  • Fixed length, mutable contents
  • Mutable length and contents

Allocation

Some solutions can be used with const and static, some solutions can use the memory of the stack, and some need to use memory allocated by an allocator (I'll call this "the heap" although I don't think that it technically has to be a heap).

Splitting

There are several ways that you can split contiguous data in Rust. Reviewing References and Borrowing from TRPL might be helpful to understanding this better. Since any contiguous data can be turned into a slice, the slice splitting rules generally apply for borrowed data. The bytes crate provides an interesting way to split owned data.

Extend vs. refuse

When you run out of memory in your contiguous data, should the data structure ask to allocate more memory or should it refuse to give you more memory? If you want to intentionally use a capped amount of memory, it might be better to refuse to extend the memory. This could be useful in embedded and/or real-time programming.

Types

Most of the solutions below can store any type but there are a couple exceptions.

Solutions

The solutions are roughly in order of increasing power, according to the Principle of Least Power. So, consider using the first solution that will solve your problem.

Fixed-size array: [T; N]

When you create a fixed-size array in Rust, you must specify the size (N) at compile time.

Given a mutable fixed-size array, you can change the content, but there is no way to change the size.

const MY_DATA: [i8; 3] = [1, 2, 3];

fn main() {
    let mut my_data = [2; 3];
    my_data[0] = 1;
    my_data[2] = 3;
    assert_eq!(MY_DATA, my_data);
}

Because the size is known at compile time, a fixed-size array can be stored anywhere, even to be used as a const. However, it might not be a good idea to store large fixed-size arrays on the stack because it could cause a stack overflow.

The length of the array is actually part of its type. This means that the following code, for example, won't compile because you can't compare two variables with different types:

const MY_DATA_3: [i8; 3] = [1, 2, 3];
const MY_DATA_4: [i8; 4] = [1, 2, 3, 4];

fn main() {
    assert_eq!(MY_DATA_3, MY_DATA_4);
}

An array cannot be split.

See also Array types from The Rust Reference.

Slice: &[T] or &mut [T]

In Rust, a slice is "a dynamically-sized view into a contiguous sequence." In contrast to an array, the length of a slice isn't known at compile time.

Given a mutable slice, you can change the content, but there is no way to change the length. When we take a mutable slice to an array, we're actually modifying the data in the array itself. That's why my_array needs to be declared as mut below.

fn main() {
    let mut my_array: [i8; 3] = [1, 2, 0];
    let my_slice: &mut [i8] = &mut my_array[1..];
    my_slice[1] = 3;
    assert_eq!(my_array, [1, 2, 3]);
}

Unlike with an an array, two slices of different lengths are the same type, so the code below works:

const MY_SLICE: &[i8] = &[1, 2, 3];

fn main() {
    let my_slice: &[i8] = &[1, 2, 3, 4];
    assert_ne!(MY_SLICE, my_slice);
}

A slice can refer to memory anywhere including as a const.

You can always create as many overlapping shared slices (&[T]) as you want, although you can't mutate them:

const MY_DATA: [i8; 3] = [2; 3];

fn main() {
    // Compare two overlapping slices
    assert_eq!(&MY_DATA[..2], &MY_DATA[1..]);
}

You can split a mutable slice into multiple non-overlapping mutable slices:

fn main() {
    let my_data = &mut [0, 2, 3, 0];
    let (left, right) = my_data.split_at_mut(2);
    left[0] = 1;
    right[1] = 4;
    assert_eq!(my_data, &[1, 2, 3, 4]);
}

See also TRPL on slices.

Boxed slice: Box<[T]>

With a boxed slice, you can create arrays at run time without knowing the length at compile time. In most cases, you could probably just use Vec instead. However, boxed slices do have a few use cases:

  • Writing a custom buffer (e.g. std::io::BufReader)
  • If you want to store data slightly more efficiently than a Vec (a boxed slice doesn't store a capacity)
  • If you generally want to prevent users from resizing the data

The code below won't compile; you can't collect an iterator into a fixed-size array because the number of elements in an iterator isn't generally known at compile time.

fn main() {
    let my_data: [i8; 3] = [1, 2, 3].iter().cloned().collect();
}

...but you can collect into a boxed slice:

const MY_DATA: [i8; 3] = [1, 2, 3];


fn main() {
    let my_data: Box<[i8]> = MY_DATA.iter().cloned().collect();

    // We need to take a slice b/c we can't compare boxed slices and arrays directly
    assert_eq!(&my_data[..], MY_DATA);
}

Because the length isn't known at compile time, the data for a boxed slice can only be allocated with an allocator.

Vec

std::vec::Vec is the std way to do variable-length contiguous data. When you try to add data and there's not enough room, the data structure will allocate more memory for the data.

fn main() {
    let mut vec = Vec::new();
    vec.push(1);
    vec.push(2);
    vec.push(3);
    
    assert_eq!(vec.len(), 3);
    assert_eq!(vec, [1, 2, 3]);
}

Because the length is dynamic and not known at compile time, the data for a Vec can only live on the heap.

We can split a Vec into two separate Vecs using the split_off method. However, doing this (surprisingly?) copies the data. According to user hanna-kruppe on the Rustlang forum,:

But as far as I know, there’s currently no way to tell the allocator “Hey I got this piece of memory from you, now please pretend you gave it to me as two separate, contiguous allocations”

See also [Storing Lists of Values with Vectors](Storing Lists of Values with Vectors) from The Rust Programming Language.

smallvec

The smallvec provides an interface that is very close to Vec, but it will store small numbers of items on the stack instead of allocating on the heap. This is good if you suspect your data structure will normally be quite small but it may need to grow occasionally.

Whereas an array with type [T; 4] stores exactly 4 elements, a SmallVec of type SmallVec[T; 4] can store more than 4 elements. If it has 4 or fewer elements, it will be stored on the stack; otherwise, it will be stored on the heap.

use smallvec::{SmallVec, smallvec};
    
fn main() {
    // This SmallVec can hold up to 4 items on the stack:
    let mut v: SmallVec<[i32; 4]> = smallvec![1, 2, 0, 4];
    
    // It will automatically move its contents to the heap if
    // contains more than four items:
    v.push(5);
    
    // SmallVec points to a slice, so you can use normal slice
    // indexing and other methods to access its contents:
    v[2] = 3;
    assert_eq!(&[1, 2, 3, 4, 5], &v[..]);
}

See also When is it “morally” correct to use smallvec? from The Rust Programming Language Forum.

arrayvec

This crate will let you store a Vec inside of an array, but it won't let you exceed the length of the underlying array. This means that the data can live in the data segment, on the stack, or on the heap.

arrayvec

use arrayvec::ArrayVec;

fn main() {
    let mut array = ArrayVec::<[_; 2]>::new();
    assert_eq!(array.try_push(1), Ok(()));
    assert_eq!(array.try_push(2), Ok(()));
    assert!(array.try_push(3).is_err());
    assert_eq!(&array[..], &[1, 2]);
}

tinyvec

tinyvec provides 100%-safe alternatives to both arrayvec and smallvec. It works pretty much the same way except that the types must implement Default. Note that it doesn't provide the exact same APIs.

use tinyvec::ArrayVec;

fn main() {
    let mut array = ArrayVec::<[_; 2]>::new();
    array.push(1);
    array.push(2);
}
use tinyvec::ArrayVec;

fn main() {
    let mut array = ArrayVec::<[_; 2]>::new();
    array.push(1);
    array.push(2);
    array.push(3);
}

VecDeque

std::collections::VecDeque is "a double-ended queue implemented with a growable ring buffer." It's pretty similar to a Vec except that it can efficiently push and pop from both front and back. This means that VecDeque can be used as a queue or as a double-ended queue. See std::collections docs for more details.

use std::collections::VecDeque;

fn main() {
    let mut vec_deque = VecDeque::with_capacity(3);
    vec_deque.push_back(0);
    vec_deque.push_back(2);
    vec_deque.push_back(3);
    assert_eq!(Some(0), vec_deque.pop_front());
    vec_deque.push_front(1);

    assert_eq!(&vec_deque, &[1, 2, 3]);
}

Like a Vec, the VecDeque data lives on the heap.

The bytes crate

bytes provides Bytes, "an efficient container for storing and operating on contiguous slices of memory." One of its signature features is that, unlike Vec, it allows you to split ownership of data without copying. Unlike the other tools in this guide, the bytes crate can't store types T, it can only store u8.

use bytes::{BytesMut, BufMut};

fn main() {
    let mut whole = BytesMut::with_capacity(1024);
    whole.put(&[1u8, 2, 3, 4] as &[u8]);
    
    let mut right = whole.split_off(2);
    let mut left = whole;
    
    std::thread::spawn(move || {
        right[1] = 6;
        assert_eq!(&right[..], [3, 6]);
    });
    
    left[0] = 5;
    assert_eq!(&left[..], [5, 2]);
}

The data for the bytes data structures will live on the heap.

Honorable Mention

These are a few techniques that are good to know about, but the use cases are pretty narrow or there are major drawbacks.

  • A boxed array (Box<[T; N]>) is not the same as a boxed slice (Box<[T]>). You probably don't want to use a boxed array; among other things, there are a couple major ergonomics drawbacks at the time of writing. See also stack overflow.
  • Slice Deque is a double-ended queue that uses operating system virtual memory trickery to let you see the entire contents of the queue as a slice.
  • tendril is a crate from the Servo org boasting really advanced features for zero-copy parsing.
  • ndarray and nalgebra for math-heavy and multidimensional arrays.

More Resources

Thanks

Thanks to soruh_c10 on Twitch and s3bk and mejrs on users.rust-lang.org for suggestions and corrections!