Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support standalone manifests containing zip files #266

Closed
ctb opened this issue Mar 5, 2024 · 5 comments · Fixed by #430
Closed

support standalone manifests containing zip files #266

ctb opened this issue Mar 5, 2024 · 5 comments · Fixed by #430

Comments

@ctb
Copy link
Collaborator

ctb commented Mar 5, 2024

build a manifest containing zip files with sig collect:

sourmash sig collect podar-ref/1.fa.sig.zip -o manifest-with-zipfiles.csv -F csv

try to run manysearch on it:

sourmash scripts manysearch manifest-with-zipfiles.csv manifest-with-zipfiles.csv -o xxx.csv

and you will get:

Reading query(s) from: 'manifest-with-zipfiles.csv'
Loaded 1 query signature(s)
thread '<unnamed>' panicked at src/manysearch.rs:32:90:
called `Result::unwrap()` on an `Err` value: Error: Failed to load query record: CP001941.1 Aciduliprofundum boonei T469, complete genome
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/t/miniforge3/envs/py311/bin/sourmash", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/Users/t/dev/sourmash/src/sourmash/__main__.py", line 20, in main
    retval = mainmethod(args)
             ^^^^^^^^^^^^^^^^
  File "/Users/t/dev/pyo3_branchwater/src/python/sourmash_plugin_branchwater/__init__.py", line 71, in main
    status = sourmash_plugin_branchwater.do_manysearch(args.query_paths,
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Error: Failed to load query record: CP001941.1 Aciduliprofundum boonei T469, complete genome

Also note that we should load files from within manifests as if they are relative to the manifest dir per sourmash-bio/sourmash#3054 and sourmash-bio/sourmash#3008 (comment).

Originally noted in #237.

@ctb
Copy link
Collaborator Author

ctb commented Jun 19, 2024

note: #364 temporarily removes the text from the docs that suggests using manifest CSVs.

@ctb
Copy link
Collaborator Author

ctb commented Aug 18, 2024

I dug into this some more today. The chain of code is:

In utils.rs, collection_from_manifest loads the manifest as a Collection:

Ok(Collection::new(
manifest,
InnerStorage::new(
FSStorage::builder()
.fullpath("".into())
.subdir("".into())
.build(),
),
))
}

then, the resulting Collection is used in load_sketches, also in utils.rs:

.filter_map(|(_idx, record)| {
let sig = collection.sig_from_record(record).ok()?;
let selected_sig = sig.clone().select(selection).ok()?;
let minhash = selected_sig.minhash()?.clone();
Some(SmallSignature {
location: record.internal_location().to_string(),
name: sig.name(),
md5sum: sig.md5sum(),
minhash,
})
})

this, in turn, hits collection.sig_from_record(record) from sourmash collection.rs:

    pub fn sig_from_record(&self, record: &Record) -> Result<SigStore> {
        let match_path = record.internal_location().as_str();
        let selection = Selection::from_record(record)?;
        let sig = self.storage.load_sig(match_path)?.select(&selection)?;
        assert_eq!(sig.signatures.len(), 1);
        Ok(sig)

which executes load_sig from FSStorage in storage/mod.rs

    fn load_sig(&self, path: &str) -> Result<SigStore> {
        let raw = self.load(path)?;
        let sig = Signature::from_reader(&mut &raw[..])?
            // TODO: select the right sig?
            .swap_remove(0);


        Ok(sig.into())

which only understands JSON files stored on the file system.

It seems to me that the key problem with dealing with this generically (i.e. in sourmash, letting manifests load sketches from other .zip files) is this set of calls,

self.storage.load_sig(match_path)?.select(&selection)?;

Here, load_sig must return precisely one SigStore object corresponding to one sketch, but we are not providing it with the information necessary to precisely locate The Right Sketch. I think it's built around the idea that there will be one signature (with multiple sketches, but all from the same sequence) per file?

(Then again, this puzzles me, because we do support loading multiple sketches from one .sig file, so ... more digging needed ;))

There is a separate problem in that collection_from_manifest creates a manifest with sketches that cannot be resolved using load_sketches (and also load_sketches is erroring out silently when it can't load).

Anyway: in looking at a short-term fix, I think I need to override the collection_from_manifest function in some way. Exploring ideas now.

@ctb
Copy link
Collaborator Author

ctb commented Aug 20, 2024

over in sourmash-bio/sourmash#3303, I am trying out the following code (in impl Collection in collection.rs:

    pub fn sig_from_record2(&self, record: &Record) -> Result<SigStore> {
        eprintln!("fetching: {:?}", record);
        let match_path = record.internal_location().as_str();
        Ok(match match_path {
            x if x.ends_with(".sig") || x.ends_with(".sig.gz") => {
                let selection = Selection::from_record(record)?;
                let sig = self.storage.load_sig(x)?.select(&selection)?;
                assert_eq!(sig.signatures.len(), 1);
                sig
            }
            x if x.ends_with(".zip") => {
                let zipcoll = Collection::from_zipfile(x)?;


                let ziprec = zipcoll.manifest().iter().find(|r| {
                    r.md5() == record.md5() && r.name() == record.name()
                }).unwrap();
                eprintln!("ziprec: {}", ziprec.md5());


                let sig = zipcoll.sig_from_record(ziprec)?;
                sig


//                todo!("more zip better")
            }
            _ => todo!("unknown, dying now")
        })
    }

This implementation is clearly Bad, and violates the way Collection and Storage work together: the current design is that Collection has a single Storage, and that Storage is responsible for loading signatures. Here I am building a new Collection instead. Since it's called within each load, it's also expensive!

We could imagine implementing a Storage that is designed for a Collection that loaded sketches from multiple files, multiple types of files, etc. Unfortunately, storage.load(...) doesn't get the Record. If it did, it would have enough information to implement this within a new Storage struct. But I think @luizirber chose this route intentionally, based on the PR description in sourmash-bio/sourmash#2230:

a Collection is a Manifest + Storage. So a zip file like the ones for GTDB databases fit this easily (storage = ZipStorage, manifest is read from the zipfile), but a file paths list does too (manifest built from the file paths, storage = FSStorage). This goes in a bit of different direction than #1901, which was extending Storage to support more functionality. I think Storage should stay pretty bare and mostly deal with loading/saving data, but not much knowledge of what data is there (this is covered with Manifest).

A sourmash focused solution would be to add a new implementation of Collection, somehow - like a subclass, but, y'know, in Rust terms. I'm honestly not sure what that would look like, or if it's possible to do so without modifying the definition of Collection. It would probably involve traits somehow but my Rust abilities just aren't there yet :).

One solution that would work entirely within the branchwater plugin: build our own Collection-style superstructure on top of sourmash Collection and Storage, and then use that as a way to figure out what to do in sourmash core, once it's robust and tested here.

@ctb
Copy link
Collaborator Author

ctb commented Aug 20, 2024

Also: right now it's hard to see the Rust Manifest struct being used for standalone manifests and pathlists; it just seems semantically incompatible. This is definitely my fault, because of the overloading of internal_location that I did in the Python implementation to support both external files (for standalone manifests and pathlists) and internal files (for e.g. zipfiles). But it is worth mentioning.

And, also, supports the idea of building a new type of Collection that would look kind of like MultiIndex in the Python layer - https://github.com/sourmash-bio/sourmash/blob/d3ae5daffb66a794d859e59f774a54383bceb0ac/src/sourmash/index/__init__.py#L912.

@ctb
Copy link
Collaborator Author

ctb commented Aug 23, 2024

#430 has settled on the MultiCollection implementation with only a few oddities being ironed out ;)

@ctb ctb closed this as completed in #430 Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant