Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_kraken add_existing fails at least for bacteria group #4

Open
cschu opened this issue Feb 11, 2024 · 14 comments
Open

build_kraken add_existing fails at least for bacteria group #4

cschu opened this issue Feb 11, 2024 · 14 comments

Comments

@cschu
Copy link
Contributor

cschu commented Feb 11, 2024

When running build_kraken, the add_existing process fails with messages such as

rsync_from_ncbi.pl: unexpected FTP path (new server?) for https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/000/762/265/GCF_000762265.1_ASM76226v1

This seems to be a known, yet unsolved, issue with kraken2, s. here

DerrickWood/kraken2#518

and

DerrickWood/kraken2#797

Patching line 46 in rsync_from_ncbi.pl doesn't do the trick, nor does running with the latest kraken2 version (2.1.3). I haven't yet tested regression to earlier versions.

This is of course rather an issue for the kraken2 author(s), but it currently prevents the kraken2 database installation for medi.

@cdiener
Copy link
Collaborator

cdiener commented Feb 12, 2024

That's true. I think I also patched kraken2. Have you tried this fix? Also make sure to rerun the full database build. It's a bit unclear what will happen if you run add_existing on a cached build. I could try the new k2 script from kraken2 instead. But it isn't really maintained either.

@cschu
Copy link
Contributor Author

cschu commented Feb 12, 2024

That fix also didn't do it, but using k2 seems to work (although I will only be able confirm, when viral and bacteria have finished downloading.)

@cdiener
Copy link
Collaborator

cdiener commented Feb 12, 2024

Good to know. Maybe we should switch to it then. Is the k2 command installed in the conda version by default?

@cschu
Copy link
Contributor Author

cschu commented Feb 12, 2024

I don't think it is. In my container installation (which uses conda), k2 (and k2mask) are located in the libexec folder so I softlinked them to a location that's in the PATH.

For k2mask to run in this setup, I've also had to modify LD_LIBRARY_PATH, so it includes the /path/to/conda/lib folder.

@cdiener
Copy link
Collaborator

cdiener commented Feb 12, 2024

Okay that's fine. Can also just add a step to the pipeline that finds k2.

@cschu
Copy link
Contributor Author

cschu commented Feb 12, 2024

Or have the user set it via params.

@cschu
Copy link
Contributor Author

cschu commented Feb 19, 2024

So, everything has downloaded fine. Unfortunately, the kraken-build process failed (upcoming issue/PR) and thanks to nextflow's stupid caching behaviour (despite having cache = "lenient" in my run.config) it started to download everything again. At that point I just ran the remaining commands manually. However, I have the feeling that this didn't go correctly, as the self_classify step had every input sequence as unclassified, which I don't assume is expected.

@cdiener
Copy link
Collaborator

cdiener commented Feb 22, 2024

Yeah, sorry. It's a bit tricky with Kraken and I haven't found out a good way to cache only the files changed by the addition steps. Like you figured out yourself, you can just build again but if you start the pipeline it will download again and that can break the db.

What size does your hash.k2d have? It should be around 500GB otherwise the new download probably reset the db.

@cschu
Copy link
Contributor Author

cschu commented Feb 25, 2024

Yea, I get that -- it's not completely straightforward. Given the fact that the nextflow cache can be super error-prone, I would probably tend to a self-contained installation script (python/bash) if I had to do it myself. That would allow for much better control over certain steps.

Re the hash.k2d -- that explains a lot now..

Your --max-db-size ${Integer.parseInt(params.max_db_size)*Integer(1e9)} caused a nextflow error for me (with the latest 23.something version, both the Integer.parseInt and the Integer(1e9)) and then I accidentally replaced it with 1000000 instead of 1000000000 (so the hash is 500MB... urgh...). Time to rebuild again.

@cdiener
Copy link
Collaborator

cdiener commented Feb 26, 2024

Okay, I'll get on this. I will keep future changes in a separate branch until they are tested to avoid those issues.

I agree that this is a hard problem for caching. The nextflow cache is pretty good, it's more the algorithm itself. A workaround would be to recommend not using the resume option or bundling the additions. I wouldn't go with a bash script because it wouldn't resolve the root issue (ending up with a broken DB if intermediate steps fail) and you would lose the resource management and monitoring.

Though you could probably also make the cache work by figuring out which files exactly are being changed and only tracking those. This would for instance ensure that a failed kraken2-build run won't invalidate the fasta file addition.

@cdiener
Copy link
Collaborator

cdiener commented Feb 27, 2024

The integer parsing should be fixed now.

@cschu
Copy link
Contributor Author

cschu commented Feb 28, 2024

Okay, I'll get on this. I will keep future changes in a separate branch until they are tested to avoid those issues.

Makes perfect sense

The nextflow cache is pretty good, it's more the algorithm itself.

No matter what part of it -- from experience it doesn't do what it should in too many cases, even with lenient option set. In the past 3 years I have witnessed the waste of literally tens of thousands of computations of various sizes due to resume-failures (Nevertheless, nf is still my go-to workflow manager as Snakemake is worse and all the other options are not really options...)

I don't have concrete evidence for this, but some of my recent experiments (aside from trying to run MEDI) give me reason to think that using as process input a directory with changing content (such as the db directory, which is modified and passed around between processes in the MEDI installation workflows) may cause problems with the cache as well.

A workaround would be to recommend not using the resume option or bundling the additions.

Yea, that is a bit of a stretch, though. Each failure would mean having to start from scratch. People would give up on the installation quite quickly.

I wouldn't go with a bash script because it wouldn't resolve the root issue (ending up with a broken DB if intermediate steps fail) you would lose the resource management and monitoring.

The resource management and monitoring wouldn't matter so much for the installation routine, though.
I think with a bit of dedicated error handling (assuming that kraken-build or k2 fail gracefully), one could even get the individual download (--add-to-library/--download-library) steps to work.

@cdiener
Copy link
Collaborator

cdiener commented Feb 28, 2024

I don't have concrete evidence for this, but some of my recent experiments (aside from trying to run MEDI) give me reason to think that using as process input a directory with changing content (such as the db directory, which is modified and passed around between processes in the MEDI installation workflows) may cause problems with the cache as well.

Yes, that is how it works. Basically nextflow runs a stat on the input and if it changes it considers the cache invalidated. Kind of makes sense because Nextflow can't know which changes are okay and which ones are not for maintaining the cache. I probably just need to dig more into what is actually modified by -add-to/download-library and only cache those parts which should resolve the issue with build.

The resource management and monitoring wouldn't matter so much for the installation routine, though.
I think with a bit of dedicated error handling (assuming that kraken-build or k2 fail gracefully), one could even get the individual download (--add-to-library/--download-library) steps to work.

I think I don't understand what you mean by installation routine. Did you mean the Kraken2 installation?

@cschu
Copy link
Contributor Author

cschu commented Feb 28, 2024

I think I don't understand what you mean by installation routine. Did you mean the Kraken2 installation?

I mean build_kraken.nf - i.e. part 2 of the MEDI "database installation routine" :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants