Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New module: Kraken2/Bracken on Unaligned Sequences for Contamination Detection #1388

Merged
merged 38 commits into from
Sep 19, 2024

Conversation

egreenberg7
Copy link
Contributor

@egreenberg7 egreenberg7 commented Sep 19, 2024

Replace #1351

Closes #271. This contribution adds Kraken2/Bracken as an optional quality control step to the rnaseq pipeline for the HISAT2 and STAR/Salmon aligners. Contamination is a widely reported issue in rna-sequencing data, and the use of metagenomics tools can be used to address this. Kraken2 is particularly strong at detecting low levels of pathogens, which makes it appropriate for this task. This PR adds the option of providing a Kraken2 database to perform taxonomic classifications on unaligned reads.

Note: If the --bracken-precision parameter is set to something other than 'S', the current MultiQC version does not work properly. In future versions of MultiQC, this will not be an issue (see this MultiQC bug fix).

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/rnaseq branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

EDIT: by @maxulysse adding link to previous PR

Copy link

github-actions bot commented Sep 19, 2024

nf-core lint overall result: Passed ✅ ⚠️

Posted for pipeline commit 02f65ab

+| ✅ 174 tests passed       |+
#| ❔   9 tests were ignored |#
!| ❗   7 tests had warnings |!

❗ Test warnings:

  • files_exist - File not found: assets/multiqc_config.yml
  • files_exist - File not found: .github/workflows/awstest.yml
  • files_exist - File not found: .github/workflows/awsfulltest.yml
  • pipeline_todos - TODO string in main.nf: Optionally add in-text citation tools to this list.
  • pipeline_todos - TODO string in main.nf: Optionally add bibliographic entries to this list.
  • pipeline_todos - TODO string in main.nf: Only uncomment below if logic in toolCitationText/toolBibliographyText has been filled!
  • pipeline_todos - TODO string in methods_description_template.yml: #Update the HTML below to your preferred methods description, e.g. add publication citation for this pipeline

❔ Tests ignored:

✅ Tests passed:

Run details

  • nf-core/tools version 2.14.1
  • Run at 2024-09-19 16:16:49

@egreenberg7
Copy link
Contributor Author

Reset of PR #1351 due to merge issue. I copied here the main discussion of its functionality. Other comments were made about some of the smaller details of the code, which you can see there.

From @MatthiasZepper

Impressive work! I think, that some users will indeed like that feature, but it may also produce false positive hits.

Without doing an in-depth review, I am therefore already missing some usage documentation on this feature. Since the quality of the reference database is paramount, pipeline users should be provided with some help how to generate an appropriate reference database:

The TCGA read data were analyzed with the Kraken program (13), a very fast algorithm that assigns reads to a taxon using exact matches of 31 base pairs (bp) or longer. The Kraken program is highly accurate, but it depends critically on the database of genomes to which it compares each read. Poore et al. used a database containing 59,974 microbial genomes, of which 5,503 were viruses and 54,471 were bacteria or archaea, including many draft genomes. Notably, their Kraken database did not include the human genome, nor did it include common vector sequences. This dramatically increased the odds for human DNA sequences present in the TCGA reads to be falsely reported as matching microbial genomes. This problem can be mitigated by including the human genome and using only complete bacterial genomes in the Kraken database.

(from Major data analysis errors invalidate cancer microbiome findings)

Also, I wonder if another tool like Sylph would have been sufficiently accurate for a fast screening, since a Kraken2/Bracken run is of course computationally heavy and this is not a metatranscriptomic pipeline.

I think this is an interesting functionality, but Kraken2/Bracken would generally not be my first choice of tools here, for being computationally heavy and very dependent on an appropriate reference database.

Therefore, I suggest a parameter contaminant screening that would allow choosing other tools as well in the future and have the default value 'off'. This parameter would allow to enable this functionality independently of the save_unaligned parameter, which should only decide if the files are published.

Apart from the remarks in the code, I think this PR also needs a better documentation in the usage.md and an update to the metro map.

But generally speaking, I support this addition to the pipeline.

From @egreenberg7

I had not heard of Sylph specifically before, but while Kraken2/Bracken is one of the more computationally expensive options, it appears from the literature that it will have the best level of recall for contaminating species and is the fastest (amortized).

See Ye et al., 2019, where Kraken2 is shown to be the best performing k-mer method (and Metaphlan4 the best performing marker-based based method)
Kibegwa et al, 2020 that Kraken2 performs better than MG-RAST specifically
Lindgreen et al., 2016 for a bit older article where Kraken was the best of the tools at the time
And lastly the article that I mentioned in my first comment that shows Kraken2 performs much better than Metaphlan for detecting low levels of pathogens (Pereira-Marques et al., 2024).
As for Sylph, while the BioArxiv paper does look promising, since it is such a new tool, it would require more development to incorporate into the pipeline (we would have to code a new MultiQC module and a new nf-core module for it), and it may be worth waiting until there are more benchmarks of it against other tools/it is more widely established. Alternatively, in the long run, we could have two different options users can choose from, Kraken2 for a more expensive but higher recall tool and Sylph or something else for a less expensive tool. Still, Kraken2's computational expense is largely based on the database size, and there are various size-restricted databases available for those with that concern (see the pre-constructed indices).

In terms of the database, I agree that it can have a significant impact. I was presuming that people would primarily use the standard database or the PlusPF database from the pre-constructed databases mentioned above. Both of these include human and vector genomes, which would avoid some of the issues mentioned in Major data analysis errors invalidate cancer microbiome findings. I considered putting an option to build the Kraken2 standard database within the pipeline (primarily for an indexing-only run), but it seemed sort of silly given the computational expense and the availability of pre-constructed databases.

Another source for database construction could be the genomes in the OpenContami database. Since this consists of only known contaminants from a statistically rigorous contamination-detection procedure, this would probably avoid most of the bacteria from extreme environments (though one would have to make sure to also add in the human and vector genomes). (If you haven't seen OpenContami before, it is based on doing Bowtie2 alignments followed by Blasting unknown sequences. As a result, it is very computationally expensive, and also, the code for it isn't available on Github).

For some other discussions of database construction, see

In Smith et al., they discuss that making sure to include pertinent genomes increases Kraken2's accuracy in rumen.
This however is primarily based on when uncommon species are present, as in metagenomics. I did some profiling of Kraken2 on common contaminants mentioned on the OpenContami website with datasets generated by InSilicoSeq, and the PlusPF database did a good job at the genus level of classifying almost all of them.
In Baud & Kennedy, 2024, they present an algorithm, Moonbase that constructs a Kraken2 database based on an initial run of Metaphlan3 in order to improve results.
This however would raise two issues in our situation. First, using Metaphlan to construct the database would likely void some of the advantages of Kraken2 over Metaphlan in terms of recall. Second, having to build the database rather than have it inputted pre-made would increase computational expense.
Some questions:
If we added a section on the database to be used, would that go in the usage.md file, output.md, or elsewhere? Lastly, for false positives, I was of the opinion that we should use a higher recall but lower precision tool like Kraken2 to error on the side of caution and leave it to the researchers to use their judgement with interpreting results. I think it may be worth simply putting a warning in the documentation that very small numbers of non-host reads should not necessarily be immediately accepted as truth. In my InSilicoSeq profiling I mentioned, there were generally (though not always) significant differences in the number of reads of actual contaminants vs false positives (and for pure human samples, at least with in silico data there were not false positives when HISAT2 followed by Kraken2 was performed).

One other note: I chose the --confidence-level 0.05 based on Pereira-Marques et al., 2024 and --minimum-hit-groups 3 based on Lu et al., 2022, a protocol paper for finding pathogens with Kraken2.

From @Shaun-Regenbaum

Just adding my two cents. I helped Ezra in-person so many of my comments are not here, but think what is here is fantastic and pretty close to being ready for merging. Totally agree on the default being not to run this, but think its inclusion is a positive change.

In my eyes, this is the start to supporting alternative QC checks into the pipeline that have been largely ignored in the bioinformatics community up until now. Even if it is computationally heavy right now to run these, simply adding them as optional points can set a good precedent and gold standard on how to add further custom QC into the pipeline (both for OS community and industry).

In fact, I think there is room for work to be done by the community to optimize these kinds of pipeline to make them light enough to work as a default without incurring too much cost/time.

As a side note, we are going to start using this internally in our university lab (and company) as an additional sanity check. I am also considering using this to conduct a large scale meta analysis on the state of RNA-Seq data (especially human) to estimate contamination across the field, as I don't believe it has been done. Starting to write some potential grants for this now.

Overall, just want to write massive props to @egreenberg7 for his work. He has a bright future, and hopefully we can get his name on a couple cool papers in the near future :)

From @davidecarlson

While this is possibly not the right place to say it, I just wanted to note that I love this potential addition to the pipeline. I have to check RNA-seq data for contaminants more often than I would like (currently I use nf-core taxprofiler for this purpose), but having this option within the rnaseq pipeline would be fabulous!

@egreenberg7
Copy link
Contributor Author

Before the next release, the new metro map will need to be animated

@maxulysse
Copy link
Member

Thanks for updating the subway map, I'll update the animated subway map in a separate PR

@maxulysse
Copy link
Member

It looks good to me, you already add approval from @MatthiasZepper, but I'd like a confirmation from @pinin4fjords as well before going forward

Copy link
Contributor

@Shaun-Regenbaum Shaun-Regenbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Congrats on all the hard work Ezra!

@Shaun-Regenbaum Shaun-Regenbaum merged commit da7b999 into nf-core:dev Sep 19, 2024
40 checks passed
@maxulysse
Copy link
Member

I would have liked a confirmation by @pinin4fjords as I said in my comment, but that ship has sailed 🚀.

@Shaun-Regenbaum Thanks for approval and merging.

@Shaun-Regenbaum
Copy link
Contributor

Shaun-Regenbaum commented Sep 20, 2024

Sorry I didn't see your comment before, my bad, was a bit too trigger happy.
Edit:
Actually after looking at the timestamps, I actually merged before your comments.

@maxulysse
Copy link
Member

I commented 2 hours before you approved and merged

@maxulysse
Copy link
Member

We had merging cowboy before, so you're just following the tradition from the best of us cc @apeltzer

@Shaun-Regenbaum
Copy link
Contributor

Ah you are right, I apologize again. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants