Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider using sourmash core ComputeParameters instead of BuildParams #113

Closed
bluegenes opened this issue Oct 7, 2024 · 2 comments · Fixed by #127
Closed

consider using sourmash core ComputeParameters instead of BuildParams #113

bluegenes opened this issue Oct 7, 2024 · 2 comments · Fixed by #127

Comments

@bluegenes
Copy link
Collaborator

bluegenes commented Oct 7, 2024

With #112, BuildParams gets closer to sourmash::cmd::ComputeParameters.

Differences:

  • ComputeParameters have multiple ksizes, which makes them non 1:1 with records.
    • I suppose i could just implement a from_param_str that builds computeparams with single ksizes, which then would be a 1:1 drop-in.
  • BuildParams uses the HashFunctions enum when reading in a command line parameter string. Not necessary, but kinda nice.
bluegenes added a commit that referenced this issue Oct 8, 2024
- refactor for reusability: move all generalized sketch building utils
--> `utils/buildutils.rs`
- do minor refactoring to use `sourmash::encodings::HashFunctions` for
`moltype`, created `Abund` enum.
- requires sourmash-bio/sourmash#3344 to use
`Hash` for `sourmash::encodings::HashFunctions`

Related:
-
#113
@bluegenes
Copy link
Collaborator Author

bluegenes commented Oct 9, 2024

Actually, I think the main barrier right now is the sig filtering strategy. Currently, we store a hashed version of the BuildParams in the template Record. I then use that to filter out any BuildParams that we find in already-generated sketches.

Change filter to use an intersect_manifest approach (but inverse, keeping any non-matches), I think we could use the ComputeParameters without issue -- we would just iterate across ksizes while creating the template signatures + records in BuildCollection, which we already have to do anyway.

potential strategy:

  • when reading existing records, store HashMap of name: mini manifest to avoid iterating through the entire manifest for each signature name?
  • implement intersect_manifest for BuildManifest and BuildCollection
  • I think this would need a PartialEq between BuildRecord and standard Record that only checks a subset of params (e.g. ksize, scaled, NOT md5sum)

@bluegenes
Copy link
Collaborator Author

This is in progress in #127

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant