Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to intersect sig details for template signatures? #3322

Open
bluegenes opened this issue Sep 13, 2024 · 1 comment
Open

how to intersect sig details for template signatures? #3322

bluegenes opened this issue Sep 13, 2024 · 1 comment
Labels

Comments

@bluegenes
Copy link
Contributor

bluegenes commented Sep 13, 2024

In directsketch, I'd like to be able to resume from failure while writing sketches. The main challenge is figuring out how to intersect the manifest of existing sketches with the signature templates I'm using to build new signatures.

The main things I'd ideally check are ksize, moltype, scaled, num, with_abundance and it'd be great to have those be hashable so we can easily check that all match (I can check filename and name separately). However, many of these are not directly accessible once we build signature templates, because the info is inside of signatures, which is a vector than can contain multiple sketches. We could use get_sketch to get the single sketch, but that loads the sketch, which we want to avoid.

@ctb comment: how much of this is caused by not having good getters on signatures? I feel like this kind of problem crops up frequently, is there something we can change or add? Main thing is we don’t want to read whole sketch unless necessary.

Getters that allow quickly pulling out ksize, scaled, num, moltype, abund would help me select which sig templates to keep.

Other, hacky solutions:

  • Pass inParams (or ComputeParams, if I can sort out how to switch to that) instead of sig templates to the sketching function. This means we'd have to rebuild sig templates each time, rather than building once and cloning as needed.
  • Continue passing in sigs as templates, but also a build Collection (to allow building Manifest) out of the template sigs and then implement a PartialEq that only looks at these parameters to facilitate a simple intersection?
    • would Record would even work given that many items are empty?
    • we can't modify Records, right? So this collection would not work as a series of template sigs that we can add to, just as one we could select sigs from, make a new blank sig to build.

Thinking about it now, I think a mutable Collection-style object might be even better. Collection itself is designed to just load and select on sig collections. But if we had a similar struct that would allow building the collection as we go and would allow PartialEq on just sig params, we could reduce the overhead of having to build each signature and then build the Record for each signature so we can build a Manifest that we can write. We could also implement write methods for this collection to simplify and standardize sig writing.

For this sort of collection, allowing Record items to be mutable may also facilitate changing e.g. location when we are copying sigs from one storage to another, e.g. tmpdir *sig.gz files to a *.zip file without rebuilding the record.

thoughts @luizirber @ctb?

@bluegenes
Copy link
Contributor Author

Note, I've now introduced BuildCollection, and associated structs to handle this. I'm liking the structs so far, comments welcome.

sourmash-bio/sourmash_plugin_directsketch#101

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant