Skip to content

Commit

Permalink
Update cargo docs, update crate to version 0.3.0 for publishing (#20)
Browse files Browse the repository at this point in the history
* Updates to cargo docs

* Skip second README doc test (requires first to have run and execution order isn't guaranteed)

* Update to verison 0.3.0 for publishing to Cargo

List @Keats, @GSGerritsen, and @boydgreenfield as maintainers.

* Update to mmap-bitvec 0.4.1

This fixes an issue introduced by
rust-lang/rust#98112 in 1.70+ that otherwise
breaks pointer dereferencing `mmap-bitvec`.

* Ignore notebook .python-version files
  • Loading branch information
boydgreenfield authored Oct 5, 2023
1 parent 1604535 commit d131df2
Show file tree
Hide file tree
Showing 7 changed files with 80 additions and 48 deletions.
12 changes: 6 additions & 6 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,20 @@ name: CI
on:
push:
branches:
- master
- main
pull_request:

jobs:
tests:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@master
uses: actions/checkout@main

- uses: actions-rs/toolchain@v1
with:
profile: minimal
toolchain: 1.60.0
toolchain: stable
override: true

- name: version info
Expand All @@ -28,7 +28,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@master
uses: actions/checkout@main

- uses: actions-rs/toolchain@v1
with:
Expand All @@ -46,7 +46,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@master
uses: actions/checkout@main

- uses: actions-rs/toolchain@v1
with:
Expand All @@ -63,7 +63,7 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@master
uses: actions/checkout@main

- uses: actions-rs/toolchain@v1
with:
Expand Down
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ Cargo.lock
.DS_Store
.idea/
old/
docs/notebook/.python-version
18 changes: 15 additions & 3 deletions Cargo.toml
Original file line number Diff line number Diff line change
@@ -1,12 +1,24 @@
[package]
name = "bfield"
version = "0.2.1"
authors = ["Roderick Bovee <[email protected]>"]
description = "B-field datastructure implementation in Rust"
version = "0.3.0"
authors = ["Vincent Prouillet <[email protected]>", "Gerrit Gerritsen <[email protected]>", "Nick Greenfield <[email protected]>"]
homepage = "https://github.com/onecodex/rust-bfield/"
repository = "https://github.com/onecodex/rust-bfield/"
readme = "README.md"
keywords = ["B-field", "probabilistic data structures"]
categories = ["data-structures"]
edition = "2018"
license = "Apache 2.0"
exclude = [
".gitignore",
".github/*",
"docs/*",
]

[dependencies]
bincode = "1"
mmap-bitvec = "0.4.0"
mmap-bitvec = "0.4.1"
murmurhash3 = "0.0.5"
serde = { version = "1.0", features = ["derive"] }
once_cell = "1.3.1"
Expand Down
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,12 @@
# `rust-bfield`, an implementation of the B-field probabilistic key-value data structure

[![Crates.io Version](https://img.shields.io/crates/v/bfield.svg)](https://crates.io/crates/bfield)

The B-field is a novel, probabilistic data structure for storing key-value pairs (or, said differently, it is a probabilistic associative array or map). B-fields support insertion (`insert`) and lookup (`get`) operations, and share a number of mathematical and performance properties with the well-known [Bloom filter](https://doi.org/10.1145/362686.362692).

At [One Codex](https://www.onecodex.com), we use the `rust-bfield` crate in bioinformatics applications to efficiently store associations between billions of $k$-length nucleotide substrings (["k-mers"](https://en.wikipedia.org/wiki/K-mer)) and [their taxonomic identity](https://www.ncbi.nlm.nih.gov/taxonomy) _**using only 6-7 bytes per `(kmer, value)` pair**_ for up to 100,000 unique taxonomic IDs (distinct values) and a 0.1% error rate. We hope others are able to use this library (or implementations in other languages) for applications in bioinformatics and beyond.

> _Note: In the [Implementation Details](#implementation-details) section below, we detail the use of this B-field implementation in Rust and use `code` formatting and English parameter names (e.g., we discuss the B-field being a data structure for storing `(key, value)` pairs). In the following [Formal Data Structure Details](#formal-data-structure-details) section, we detail the design and mechanics of the B-field using mathematical notation (i.e., we discuss it as an associate array mapping a set of_ $(x, y)$ _pairs). The generated Rust documentation includes both notations for ease of reference._
> _Note: In the [Implementation Details](#implementation-details) section below, we detail the use of this B-field implementation in Rust and use `code` formatting and English parameter names (e.g., we discuss the B-field being a data structure for storing `(key, value)` pairs). In the following [Formal Data Structure Details](#formal-data-structure-details) section, we detail the design and mechanics of the B-field using mathematical notation (i.e., we discuss it as an associate array mapping a set of_ $(x, y)$ _pairs). The [generated Rust documentation](https://docs.rs/bfield/latest/bfield/) includes both notations for ease of reference._
## Implementation Details

Expand Down Expand Up @@ -73,7 +75,7 @@ for p in 0..4u32 {

* After creation, a B-field can optionally be loaded from a directory containing the produced `mmap` and related files with the `load` function. And once created or loaded, a B-field can be directly queried using the `get` function, which will either return `None`, `Indeterminate`, or `Some(BFieldValue)` (which is currently an alias for `Some(u32)` see [limitations](#⚠️-current-limitations-of-the-rust-bfield-implementation) below for more details):

```rust
```rust no_run
use bfield::BField;

// Load based on filename of the first array ".0.bfd"
Expand Down
77 changes: 43 additions & 34 deletions src/bfield.rs
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ use serde::Serialize;

use crate::bfield_member::{BFieldLookup, BFieldMember, BFieldVal};

/// The struct holding the various bfields
/// The `struct` holding the `BField` primary and secondary bit arrays.
pub struct BField<T> {
members: Vec<BFieldMember<T>>,
read_only: bool,
Expand All @@ -18,18 +18,26 @@ unsafe impl<T> Send for BField<T> {}
unsafe impl<T> Sync for BField<T> {}

impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
/// The (complicated) method to create a bfield.
/// The bfield files will be created in `directory` with the given `filename` and the
/// suffixes `(0..n_secondaries).bfd`
/// `size` is the primary bfield size, subsequent bfield sizes will be determined by
/// `secondary_scaledown` and `max_scaledown`.
/// If you set `in_memory` to true, remember to call `persist_to_disk` when it's built to
/// A (rather complex) method for creating a `BField`.
///
/// This will create a series of `BField` bit array files in `directory` with the given `filename` and the
/// suffixes `(0..n_secondaries).bfd`. If you set `in_memory` to true, remember to call `persist_to_disk` once it's built to
/// save it.
/// The params are the following in the paper:
/// `n_hashes` -> k
/// `marker_width` -> v (nu)
/// `n_marker_bits` -> κ (kappa)
/// `secondary_scaledown` -> β (beta)
///
/// The following parameters are required. See the [README.md](https://github.com/onecodex/rust-bfield/)
/// for additional details as well as the
/// [parameter selection notebook](https://github.com/onecodex/rust-bfield/blob/main/docs/notebook/calculate-parameters.ipynb)
/// for helpful guidance in picking optimal parameters.
/// - `size` is the primary `BField` size, subsequent `BField` sizes will be determined
/// by the `secondary_scaledown` and `max_scaledown` parameters
/// - `n_hashes`. The number of hash functions _k_ to use.
/// - `marker_width` or v (nu). The length of the bit-string to use for
/// - `n_marker_bits` or κ (kappa). The number of 1s to set in each v-length bit-string (also its Hamming weight).
/// - `secondary_scaledown` or β (beta). The scaling factor to use for each subsequent `BField` size.
/// - `max_scaledown`. A maximum scaling factor to use for secondary `BField` sizes, since β raised to the power of
/// `n_secondaries` can be impractically/needlessly small.
/// - `n_secondaries`. The number of secondary `BField`s to create.
/// - `in_memory`. Whether to create the `BField` in memory or on disk.
#[allow(clippy::too_many_arguments)]
pub fn create<P>(
directory: P,
Expand Down Expand Up @@ -84,7 +92,7 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
})
}

/// Loads the bfield given the path to the "main" db path (eg the one ending with `0.bfd`).
/// Loads the `BField` given the path to the primary array data file (eg the one ending with `0.bfd`).
pub fn load<P: AsRef<Path>>(main_db_path: P, read_only: bool) -> Result<Self, io::Error> {
let mut members = Vec::new();
let mut n = 0;
Expand Down Expand Up @@ -126,8 +134,8 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
Ok(BField { members, read_only })
}

/// Write the current bfields to disk.
/// Only useful if you are creating a bfield in memory
/// Write the current `BField` to disk.
/// Only useful if you are creating a `BField` in memory.
pub fn persist_to_disk(self) -> Result<Self, io::Error> {
let mut members = Vec::with_capacity(self.members.len());
for m in self.members {
Expand All @@ -139,32 +147,32 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
})
}

/// Returns (n_hashes, marker_width, n_marker_bits, Vec<size of each member>)
/// Returns `(n_hashes, marker_width, n_marker_bits, Vec<size of each member>)`.
pub fn build_params(&self) -> (u8, u8, u8, Vec<usize>) {
let (_, n_hashes, marker_width, n_marker_bits) = self.members[0].info();
let sizes = self.members.iter().map(|i| i.info().0).collect();
(n_hashes, marker_width, n_marker_bits, sizes)
}

/// Returns the params given at build time to the bfields
/// Returns the params given at build time to the `BField` arrays.
pub fn params(&self) -> &Option<T> {
&self.members[0].params.other
}

/// This doesn't actually update the file, so we can use it to e.g.
/// simulate params on an old legacy file that may not actually have
/// them set.
/// ⚠️ Method for setting parameters without actually updating any files on disk. **Only useful for supporting legacy file formats
/// in which these parameters are not saved.**
pub fn mock_params(&mut self, params: T) {
self.members[0].params.other = Some(params);
}

/// This allows an insert of a value into the b-field after the entire
/// b-field build process has been completed.
///
/// It has the very bad downside of potentially knocking other keys out
/// of the b-field by making them indeterminate (which will make them fall
/// back to the secondaries where they don't exist and thus it'll appear
/// as if they were never inserted to begin with)
/// ⚠️ Method for inserting a value into a `BField`
/// after it has been fully built and finalized.
/// **This method should be used with extreme care**
/// as it does not guarantee that keys are properly propagated
/// to secondary arrays and therefore may make lookups of previously
/// set values return an indeterminate result in the primary array,
/// then causing fallback to the secondary arrays where they were never
/// inserted (and returning a false negative).
pub fn force_insert(&self, key: &[u8], value: BFieldVal) {
debug_assert!(!self.read_only, "Can't insert into read_only bfields");
for secondary in &self.members {
Expand All @@ -174,8 +182,8 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
}
}

/// Insert the given key/value at the given pass
/// Returns whether the value was inserted during this call, eg will return `false` if
/// Insert the given key/value at the given pass (1-indexed `BField` array/member).
/// Returns whether the value was inserted during this call, i.e., will return `false` if
/// the value was already present.
pub fn insert(&self, key: &[u8], value: BFieldVal, pass: usize) -> bool {
debug_assert!(!self.read_only, "Can't insert into read_only bfields");
Expand All @@ -195,8 +203,8 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
true
}

/// Returns the value of the given key if found, None otherwise.
/// If the value is indeterminate, we still return None.
/// Returns the value of the given key if found, `None` otherwise.
/// The current implementation also returns `None` for indeterminate values.
pub fn get(&self, key: &[u8]) -> Option<BFieldVal> {
for secondary in self.members.iter() {
match secondary.get(key) {
Expand All @@ -210,8 +218,8 @@ impl<T: Clone + DeserializeOwned + Serialize> BField<T> {
None
}

/// Get the info of each member
/// Returns Vec<(size, n_hashes, marker_width, n_marker_bits)>
/// Get the info of each secondary array (`BFieldMember`) in the `BField`.
/// Returns `Vec<(size, n_hashes, marker_width, n_marker_bits)>`.
pub fn info(&self) -> Vec<(usize, u8, u8, u8)> {
self.members.iter().map(|m| m.info()).collect()
}
Expand Down Expand Up @@ -304,6 +312,7 @@ mod tests {
}
}

// Causes cargo test to run doc tests on all `rust` code blocks
#[doc = include_str!("../README.md")]
#[cfg(doctest)]
pub struct ReadmeDoctests;
struct ReadmeDoctests;
4 changes: 2 additions & 2 deletions src/combinatorial.rs
Original file line number Diff line number Diff line change
Expand Up @@ -63,9 +63,9 @@ pub fn unrank(marker: u128) -> usize {
value as usize
}

/// (Hopefully) fast implementation of a binomial
/// (Hopefully) fast implementation of a binomial.
///
/// This uses a preset group of equations for k < 8 and then falls back to a
/// This function uses a preset group of equations for k < 8 and then falls back to a
/// multiplicative implementation that tries to prevent overflows while
/// maintaining all results as exact integers.
#[inline]
Expand Down
10 changes: 9 additions & 1 deletion src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,7 +1,15 @@
#![deny(missing_docs)]

//! The bfield datastructure, implemented in Rust.
//! The B-field datastructure, implemented in Rust.
//! A space-efficient, probabilistic data structure and storage and retrieval method for key-value information.
//! These Rust docs represent some minimal documentation of the crate itself.
//! See the [Github README](https://github.com/onecodex/rust-bfield) for an
//! extensive write-up, including the math and design underpinning the B-field
//! data structure, guidance on B-field parameter selection, as well as usage
//! examples.[^1]
//!
//! [^1]: These are not embeddable in the Cargo docs as they include MathJax,
//! which is currently unsupported.

mod bfield;
mod bfield_member;
Expand Down

0 comments on commit d131df2

Please sign in to comment.