-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sourmash scripts singlesketch output differs from sourmash sketch #538
Comments
Try adding |
That works, thanks. Then the two signatures are identical. So, if you want consistency, should Or, should the same default behaviour be added to If not we can just close this issue. |
I'll think on't! One thing that we may need to write more clearly in the docs here: the branchwater plugin is not intended to be fully identical to sourmash in its particulars. Our medium term goal is to integrate the speed improvements from branchwater into sourmash proper, but that involves a still rather tremendous amount of detail work 😭 . The reasons why are that sourmash has much more functionality and many more tests as well, for many more edge cases. Satisfying all of those edge cases will take a lot! In the meantime, the branchwater plugin supports massive speed advantages and is helping us identify exactly what we need to add to the higher level sourmash Rust API. It's also served as a testbed for some of us to learn Rust ;). That all having been said: the core comparisons should yield identical results - search and containment and gather. If they don't, that is a bug! And if there are features in sourmash that you need in branchwater, please let us know! |
After some head-scratching I realised that this "trivial" difference in the signatures is important with branchwater. This might be better filed as a separate issue (just say the word). Classical ❯ sourmash compare all_sigs.csv --csv sourmash.csv --estimate-ani --containment -k=31
== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
loaded 3 signatures total.
0-MGV-GENOME-0264... [1. 0.996 0.984]
1-MGV-GENOME-0266... [0.998 1. 0.982]
2-OP073605.fasta [1. 0.996 1. ]
min similarity in matrix: 0.982 However, ❯ sourmash scripts manysearch -m DNA -o manysearch.csv all_sigs.csv all_sigs.csv --scaled 300 -k 31 --debug
== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
=> sourmash_plugin_branchwater 0.9.11; cite Irber et al., doi: 10.1101/2022.11.02.514947
ksize: 31 / scaled: 300 / moltype: DNA / threshold: 0.01
searching all sketches in 'all_sigs.csv' against 'all_sigs.csv' using 8 threads
selection scaled: Some(300)
Reading query(s) from: 'all_sigs.csv'
Loaded 3 query signature(s)
Reading search(s) from: 'all_sigs.csv'
Loaded 3 search signature(s)
DONE. Processed 3 search sigs
...manysearch is done! results in 'manysearch.csv'
query p_genome avg_abund p_metag metagenome name
-------- -------- --------- ------- ---------------
MGV-GENOME-026457 100.0% N/A N/A MGV-GENOME-026457
MGV-GENOME-026457 99.1% N/A N/A OP073605.fasta
MGV-GENOME-026457 93.7% N/A N/A MGV-GENOME-026645
MGV-GENOME-026645 88.9% N/A N/A MGV-GENOME-026457
MGV-GENOME-026645 88.9% N/A N/A OP073605.fasta
MGV-GENOME-026645 100.0% N/A N/A MGV-GENOME-026645
OP073605.fasta 60.1% N/A N/A MGV-GENOME-026457
OP073605.fasta 100.0% N/A N/A OP073605.fasta
OP073605.fasta 56.8% N/A N/A MGV-GENOME-026645 If I generated the ❯ rm *.sig *.csv
❯ sourmash sketch dna -p "k=31,scaled=300" OP073605.fasta -o OP073605.sig
...
❯ sourmash sketch dna -p "k=31,scaled=300" MGV-GENOME-0264574.fas -o MGV-GENOME-0264574.sig
...
❯ sourmash sketch dna -p "k=31,scaled=300" MGV-GENOME-0266457.fna -o MGV-GENOME-0266457.sig
...
❯ sourmash sig collect --quiet -F csv -o all_sigs.csv *.sigLoading signature information from MGV-GENOME-0264574.sig.
Loading signature information from MGV-GENOME-0266457.sig.
Loading signature information from OP073605.sig.
saved 3 manifest rows to 'all_sigs.csv'
❯ sourmash scripts manysearch -m DNA -o manysearch.csv all_sigs.csv all_sigs.csv --scaled 300 -k 31 --debug
== This is sourmash version 4.8.11. ==
== Please cite Irber et. al (2024), doi:10.21105/joss.06830. ==
=> sourmash_plugin_branchwater 0.9.11; cite Irber et al., doi: 10.1101/2022.11.02.514947
ksize: 31 / scaled: 300 / moltype: DNA / threshold: 0.01
searching all sketches in 'all_sigs.csv' against 'all_sigs.csv' using 8 threads
selection scaled: Some(300)
Reading query(s) from: 'all_sigs.csv'
Error: No query signatures loaded, exiting. The error message is unhelpful here. |
well that's a straight up bug, methinks ;). |
Fascinating. The problem only occurs when you use a manifest, it seems. It works fine with a .sig or a .sig.zip file. I've tracked it down in |
(I've figured it out - sourmash-bio/sourmash#3434) |
Signature::name()
returns filename()
and then md5sum()
if .name
is empty.
sourmash-bio/sourmash#3441
The intersect manifest bug is fixed in the newly released v0.9.12. Enjoy! |
I would expect
sourmash scripts singlesketch
andsourmash sketch
to give identical signature files. However, they differ slightly (in the metadata).Using viral genome https://github.com/pyani-plus/pyani-plus/blob/main/tests/fixtures/viral_example/OP073605.fasta as a test case. This is deliberately not in the current directory to clarify the issue (see below). Testing here on macOS with Python 3.12:
We expect the signatures to be identical, but they are not:
The only difference is
singlesketch
has added,"name":"OP073605.fasta"
which seems almost redundant given the"filename":"viral_example/OP073605.fasta"
entry.The text was updated successfully, but these errors were encountered: