-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
build_kraken add_existing
fails at least for bacteria group
#4
Comments
That's true. I think I also patched kraken2. Have you tried this fix? Also make sure to rerun the full database build. It's a bit unclear what will happen if you run add_existing on a cached build. I could try the new k2 script from kraken2 instead. But it isn't really maintained either. |
That fix also didn't do it, but using |
Good to know. Maybe we should switch to it then. Is the k2 command installed in the conda version by default? |
I don't think it is. In my container installation (which uses conda), For |
Okay that's fine. Can also just add a step to the pipeline that finds k2. |
Or have the user set it via |
So, everything has downloaded fine. Unfortunately, the |
Yeah, sorry. It's a bit tricky with Kraken and I haven't found out a good way to cache only the files changed by the addition steps. Like you figured out yourself, you can just build again but if you start the pipeline it will download again and that can break the db. What size does your |
Yea, I get that -- it's not completely straightforward. Given the fact that the nextflow cache can be super error-prone, I would probably tend to a self-contained installation script (python/bash) if I had to do it myself. That would allow for much better control over certain steps. Re the Your |
Okay, I'll get on this. I will keep future changes in a separate branch until they are tested to avoid those issues. I agree that this is a hard problem for caching. The nextflow cache is pretty good, it's more the algorithm itself. A workaround would be to recommend not using the resume option or bundling the additions. I wouldn't go with a bash script because it wouldn't resolve the root issue (ending up with a broken DB if intermediate steps fail) and you would lose the resource management and monitoring. Though you could probably also make the cache work by figuring out which files exactly are being changed and only tracking those. This would for instance ensure that a failed kraken2-build run won't invalidate the fasta file addition. |
The integer parsing should be fixed now. |
Makes perfect sense
No matter what part of it -- from experience it doesn't do what it should in too many cases, even with I don't have concrete evidence for this, but some of my recent experiments (aside from trying to run MEDI) give me reason to think that using as process input a directory with changing content (such as the db directory, which is modified and passed around between processes in the MEDI installation workflows) may cause problems with the cache as well.
Yea, that is a bit of a stretch, though. Each failure would mean having to start from scratch. People would give up on the installation quite quickly.
The resource management and monitoring wouldn't matter so much for the installation routine, though. |
Yes, that is how it works. Basically nextflow runs a
I think I don't understand what you mean by installation routine. Did you mean the Kraken2 installation? |
I mean |
When running
build_kraken
, theadd_existing
process fails with messages such asThis seems to be a known, yet unsolved, issue with kraken2, s. here
DerrickWood/kraken2#518
and
DerrickWood/kraken2#797
Patching line 46 in
rsync_from_ncbi.pl
doesn't do the trick, nor does running with the latest kraken2 version (2.1.3). I haven't yet tested regression to earlier versions.This is of course rather an issue for the kraken2 author(s), but it currently prevents the
kraken2
database installation formedi
.The text was updated successfully, but these errors were encountered: