-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feat: on-disk RevIndex based on RocksDB #2230
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## latest #2230 +/- ##
==========================================
+ Coverage 86.23% 86.30% +0.07%
==========================================
Files 130 135 +5
Lines 14756 15301 +545
Branches 2622 2622
==========================================
+ Hits 12725 13206 +481
- Misses 1731 1795 +64
Partials 300 300
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. |
9e491ef
to
0cf8af3
Compare
77f6e35
to
872c83a
Compare
872c83a
to
4dc51b7
Compare
39ddd3a
to
e2c9e92
Compare
Fixes #4 #6 Closes #10 Bring a few bits from sourmash-bio/sourmash#2230 to support the proper query config (scaled considering downsampling, both of queries and DB sigs).
e2c9e92
to
3ff22b7
Compare
da587db
to
2819307
Compare
2819307
to
8210966
Compare
8210966
to
0a1eee0
Compare
28948a1
to
112b4f1
Compare
@bluegenes can you remind me of the relationship between this branch and the pyo3_dependency on commit ff1092f (which may no longer exist in this repo...?) @luizirber my main request is that you update the description of the PR so that we can refer back to this in the future. Specific thoughts for things to add (or maybe questions) -
thanks! |
I stopped rebasing this branch and sourmash-bio/sourmash_plugin_branchwater#134 uses this PR instead, and can flip to use
PR description updated!
👍
Well... there is a light FFI exposing what is now It is confusing to have these different RevIndex impls in Rust, and I started converging them, but the work is not completely done yet. Nonetheless, I kept the current tests for the Python RevIndex working, but there are not many of them. #2727 is a better starting point for how |
thanks! @bluegenes I leave this in your capable hands ;) |
Per out of band conversation: merge now, release sourmash-rs 0.12.0, integrate/test/fix in sourmash-bio/sourmash_plugin_branchwater#134 |
[rocksdb-eval](https://github.com/luizirber/2022-06-26-rocksdb-eval) is the mastiff PoC, and since sourmash-bio/sourmash#2230 inherited the sourmash parts (the Index implementation) it makes sense to move some operational tools (like `index`, `update`,`check`, and `convert`) here and archive the repo.
A few useful things from the branchwater plugin perspective: - add getters to `num`, `scaled`, and `n_hashes` in `Record` - add misc documentation, based in part on descriptions in #2230 - add `#[derive Clone]` to `Collection` and `CollectionSet` - add `Collection::from_rocksdb(...)` --------- Co-authored-by: Luiz Irber <[email protected]> Co-authored-by: Luiz Irber <[email protected]>
On-disk RevIndex based on RocksDB, initially implemented in https://github.com/luizirber/2022-06-26-rocksdb-eval
This is the index/core data structure backing https://mastiff.sourmash.bio
There are many changes in the Rust code, so bumping the version to
0.12.0
.This is mostly not exposed thru the FFI yet. Tests from the from the in-memory
RevIndex
(greyhound) from #1238 were kept working, but it is not well supported (doesn't allow saving/loading from disk, for example), and should be wholly replaced bysourmash::index::revindex::disk_revindex
(the on-disk RevIndex) in the future.It is confusing to have these different RevIndex impls in Rust, and I started converging them, but the work is not completely done yet.
#2727 is a better starting point for how
Index
abc/trait should work acrosss Python/Rust, and I started moving the Rust indices to start from aLinearIndex
and later specialize into aRevIndex
, which will make easier to translate the work from #2727 for future indices across FFI.A couple of new concepts introduced in this PR:
a
Collection
is aManifest
+Storage
. So a zip file like the ones for GTDB databases fit this easily (storage =ZipStorage
, manifest is read from the zipfile), but a file paths list does too (manifest built from the file paths, storage =FSStorage
). This goes in a bit of different direction than thoughts onStorage
, manifests, etc. #1901, which was extendingStorage
to support more functionality. I thinkStorage
should stay pretty bare and mostly deal with loading/saving data, but not much knowledge of what data is there (this is covered withManifest
).a
CollectionSet
is a consistent collection of signatures. Consistent here means: same k-size, downsample-compatible for scaled, same moltype. You can create aCollectionSet
by running.select()
on aCollection
.CollectionSet
is required for building indices (because we shouldn't be building indices mixing k-size/moltype), and greatly simplifies the logic in many places because it is not necessary to check for compatibility.LinearIndex
was rewritten based onCollection
(and also thegreyhound
/branchwater
parallelism), and this supports the "parallel search without an index" use case. There is no index construction per se here, pretty much just a thin layer on top ofCollection
implementing functionality expected from indices.Manifest
,Selection
, andPicklist
are still incomplete, but the relevant function definitions are in place, need to barrage it with tests (and potentially exposing to Python and reusing the ones there in Feature: implement FFI for manifest, picklist and selection in Rust #2726)Feature
Manifest
,Selection
, andPicklist
followingthe Python API.
Collection
is a new abstraction for working with a set of signatures. Acollection needs a
Storage
for holding the signatures (on-disk, in-memory,or remotely), and a
Manifest
to describe the metadata for each signature.sourmash::index::revindex::disk_revindex
with the on-diskRevIndex implementation based on RocksDB.
iter
anditer_mut
methods forSignature
.load_sig
andsave_sig
methods toStorage
trait for higher-level datamanipulation and caching.
spec
method toStorage
to allow constructing a concreteStorage
froma string description.
InnerStorage
for synchronizing parallel access toStorage
implementations.
MemStorage
for keeping signatures in-memory (mostly for debugging andtesting).
Refactor
HashFunctions
variants to follow camel-case, soMurmur64Protein
instead ofmurmur64_protein
LinearIndex
is now implemented as a thin layer on top ofCollection
.GatherResult
tosourmash::index
module.sourmash::index::revindex
tosourmash::index::mem_revindex
(this isthe Greyhound version of revindex, in-memory only). It was also refactored
internally to build a version of a
LinearIndex
that will be merged in thefuture with
sourmash::index::LinearIndex
select
method fromIndex
trait into a separateSelect
trait,and implement it for
Signature
based on the newSelection
API.SigStore
intosourmash::storage
module, and remove the generic. Nowit always stores
Signature
. Also implementSelect
for it.Build
branchwater
feature (enabled by default), which can be disabled by downstream projects to limit bringing heavy dependencies like rocksdbrkyv
feature (disabled by default), makingMinHash
serializablewith the
rkyv
crate.musllinux
wheels (need to figure out how to build rocksdb for it)