-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mash pre-cluster sketch size? #137
Comments
Hi Jianshu,
Best, |
Hello Matt, Thanks for the quick response, what if I want to pre cluster at 85% ANI, then exact ANI at 90%, but the sketch size 1000, will never approximate 85%, but 88% or so (small sketch size will need to underestimate ANI, so a 85% ANI (as you thought) precluster could indicate larger ANI value ). So two pair that is actually around 90% ANI will have the possibility to be put into different clusters, the exact fastANI comparison will then miss this pair of comparison, so dereliction can be not what the user expect. Do you see my point? For very high ANI dereplcation, like 95%, there are no problems because pre cluster will never reach that resolution. This only arise when we want to dereplicated at smaller ANI like 90%, 85%, or so. Thanks, Jianshu |
Hi Jianshu, Ah I see- I understand now. In that circumstance it would certainly make sense to increase the Best, |
Hi Matt, True, most of the cases, users want to dereplicate at higher ANI so speed is more important. I was in a case where I want to cluster at 85% ANI, precluster should be 80% or something, even with 10^4 sketch size, mash is till much faster than FastANI, even though the overall process will take a long time. So yes, just a reminder that this could happen and we should be cautious. And say that if users want to have a lower pre cluster ANI value, should increase sketch size. Does that sound reasonable? I have strange dereplication results compare to use FastANI only at 85%. Thanks, Jianshu |
I see- this does make sense and does sound reasonable. I'll look into adding a warning like this during the next dRep update |
Dear dRep team,
This is confusing to me when using mash sketch size 1000:
def run_mash_on_genome_chunks(genome_chunks, mash_exe, sketch_folder, MASH_folder, logdir, **kwargs):
dry = kwargs.get('dry', False)
p = kwargs.get('processors', 6)
MASH_s = kwargs.get('MASH_sketch', 1000)
multi_round = kwargs.get('multiround_primary_clustering', True)
If you check the fastANI paper, table 2, sketch size 1000 is very bad at nearly all dataset with traditional blast based ani and fastANI. At lease 10^4 is a good one, or 10^5, so that the pre cluster ANI is close to the FastANI or traditional ANI value. Even with 10^5 (Figure 1 (a)), below 80%, mash is still not close to the real ANI values but an approximate. Any idea why use sketch size 1000, which works only for very distantly related genomes ? Pre cluster at any ANI value larger than 80%, 1000 is far away from enough. It will be nice if there is a sketch size and kmer option passed to mash.
Thanks,
Jianshu
The text was updated successfully, but these errors were encountered: