-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: LULU ASV post-clustering curation #609
Comments
I suggest (like in #608) VSEARCH instead of BLASTN. Could save a lot of resources/energy I think. We recently had a Slack discussion regarding post-denoising clustering (https://nfcore.slack.com/archives/CEA7TBJGJ/p1690893776838869), so that seems to be something people want. In that discussion, another tool -- |
Swarm is another tool I haven't tried. I'll test it out and look into the literature to see which tool seems more popular/better. |
I just read this interesting article (https://archimer.ifremer.fr/doc/00688/80057/83060.pdf), they used DADA2 for ASVs, then they used swarm on the ASVs go get OTUs, then they ran LULU to curate the ASVs and OTUs to check which methods produce the most accurate results. To sum up their argument, the best method depends on your dataset and taxa of interest. So maybe it's best to have both swarm and LULU as optional steps. |
I noticed the qiime2 modules have these lines
Is it acceptable for me to do something similar to this when I just have a docker container for a tool, but no conda? |
If the docker container was/is maintained over time (when no conda packages are available that isnt to be taken for granted) and there is no way around it (such as choosing a similar tool that provides conda & container), then yes. In the example you quote its QIIME2, which is a very popular tool, so that warrants the decision here I think. |
I've looked into this issue a bit more. Problem: LULU doesn't have a conda package, and while there is a docker container I'm using, it isn't part of the quay.io/biocontainers registery. Is there a better method of post-clustering? Is LULU popular?
So from what I can see, it seems like LULU is a relatively popular tool for post-clustering. Do we need a tool designed for post-clustering curation, or can we use any clustering tool (e.g., Swarm) to functionally perform a similar role?
So maybe post-clustering with Swarm should be a separate issue. That's something I'll think about more. The Docker container I'm using isn't part of the biocontainers registery, but it does work. I can look into the process of getting a container added to the biocontainers repository if that would provide more confidence to the rest of the Ampliseq team, but I won't do that if it's not necessary. I have most of the code written locally to add this feature to Ampliseq, so it's just the container/conda issue that's holding me back. |
I am with @erikrikarddaniel that VSEARCH is a fine tool and it has also proper containerization. And it can cluster sequences. Papers from software developers about their own tools should be typically taken with a bit of skepticism, independent benchmarks are usually better. But benchmarks are sometimes hard to generalize, still, the best we got to decide on tools ofc. I had a quick look at bioconda, swarm is listed, also AMPTk and apscale. Possibly those last two containers have all requirements for LULU, thats also a way to get it ;) Next steps if you really want to go with LULU ask the LULU devs to add it to bioconda. Otherwise there is in nf-core slack the channel #bioconda that might have experts to help. |
I think for now I'll try VSEARCH because there is already an nf-core module for VSEARCH_CLUSTER. I have noticed a bug with this module, so I'll fix the bug in the module and try adding the fixed module to the pipeline. |
I'm closing this issue for now. I've added VSEARCH instead of LULU for ASV post-clustering. |
Description of feature
I have a LULU subworkflow that I can add to Ampliseq. The subworkflow uses blastn to create the matchlist for LULU, then uses LULU for post-clustering curation. The input files for the subworkflow are an asv fasta file and a tsv file. The tsv file is similar to the DADA2_table.tsv file that is already produced in Ampliseq. The output file is a curated version of that tsv file. I feel this should be easy enough to add to Amliseq.
The text was updated successfully, but these errors were encountered: