Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New module: Kraken2/Bracken on Unaligned Sequences for Contamination Detection #1388

Merged
merged 38 commits into from
Sep 19, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
3e0793f
Install kraken2 and bracken
egreenberg7 Jul 30, 2024
a7281e5
Change nextflow configurations
egreenberg7 Jul 30, 2024
734919e
Initial addition of kraken and bracken to pipeline
egreenberg7 Jul 30, 2024
8695f50
Fix include statements
egreenberg7 Jul 31, 2024
44ba282
Fix include statements
egreenberg7 Jul 31, 2024
3657e3e
Include module config files
egreenberg7 Aug 5, 2024
7ab90f6
Use presence of kraken_db to determine bracken run, add save params
egreenberg7 Aug 5, 2024
39cfdef
Fix multiqc for Bracken/Kraken
egreenberg7 Aug 5, 2024
2e9c243
Add Bracken/Kraken citations
egreenberg7 Aug 5, 2024
03314c9
Fix bracken config
egreenberg7 Aug 6, 2024
7913cbc
Update bracken
egreenberg7 Aug 7, 2024
063b4ac
Adjust multiqc configs for updated bracken
egreenberg7 Aug 7, 2024
075bca8
Add input validation and update scheme
egreenberg7 Aug 7, 2024
a0a225c
Change default Bracken precision to species
egreenberg7 Aug 7, 2024
bd4ea1c
Debug input validation
egreenberg7 Aug 7, 2024
7789827
Documentation/output image
egreenberg7 Aug 7, 2024
336a7b0
Update changelog
egreenberg7 Aug 7, 2024
a1e1d99
Update Kraken2 module
egreenberg7 Aug 9, 2024
33c9483
Linting
egreenberg7 Aug 9, 2024
102152e
Change to --contaminant_screening param
egreenberg7 Aug 15, 2024
6dd7521
Fixing save unaligned default
egreenberg7 Aug 15, 2024
fdd85ad
Debugging
egreenberg7 Aug 15, 2024
c0099d4
Update usage
egreenberg7 Aug 15, 2024
c25aa50
Provide motivation for Kraken2 parameters
egreenberg7 Aug 15, 2024
0063c2d
Fix typo
egreenberg7 Aug 16, 2024
53f46b2
Update metro map
egreenberg7 Aug 16, 2024
2dc5746
Merge branch 'dev' into dev
Shaun-Regenbaum Aug 18, 2024
2a322c4
Update schema
egreenberg7 Aug 19, 2024
75a10f7
Update docs/usage.md
egreenberg7 Aug 19, 2024
cfc8945
Change output directory for kraken2/bracken
egreenberg7 Aug 19, 2024
69f2d0a
Update Changelog
egreenberg7 Sep 10, 2024
68b9a21
Another changelog fix
egreenberg7 Sep 10, 2024
b4332d9
Linting fix
egreenberg7 Sep 10, 2024
bc193df
Merge conflcits
egreenberg7 Sep 19, 2024
fd9b449
Merge branch 'dev' of github.com:egreenberg7/rnaseq into dev
egreenberg7 Sep 19, 2024
41bcd9c
Update hisat2 patch
egreenberg7 Sep 19, 2024
3e0b3e9
(Hopefully) final linting fix
egreenberg7 Sep 19, 2024
02f65ab
Change PR number
egreenberg7 Sep 19, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,36 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

### Enhancements & fixes

- [PR #1388](https://github.com/nf-core/rnaseq/pull/1351) - Adding Kraken2/Bracken on unaligned reads as an additional quality control step to detect sample contamination
- [PR #1186](https://github.com/nf-core/rnaseq/pull/1186) - Bump pipeline version to 3.16.0dev

### Parameters

| Old parameter | New parameter |
| ------------- | --------------------------- |
| | `--contaminant_screening` |
| | `--kraken_db` |
| | `--save_kraken_assignments` |
| | `--save_kraken_unassigned` |
| | `--bracken_precision` |

> **NB:** Parameter has been **updated** if both old and new parameter information is present.
> **NB:** Parameter has been **added** if just the new parameter information is present.
> **NB:** Parameter has been **removed** if new parameter information isn't present.

### Software dependencies

| Dependency | Old version | New version |
| ---------- | ----------- | ----------- |
| `Kraken2` | ----------- | 2.1.3 |
| `Bracken` | ----------- | 2.9 |

> **NB:** Dependency has been **updated** if both old and new version information is present.
>
> **NB:** Dependency has been **added** if just the new version information is present.
>
> **NB:** Dependency has been **removed** if new version information isn't present.

## [[3.15.1](https://github.com/nf-core/rnaseq/releases/tag/3.15.1)] - 2024-09-16

### Enhancements & fixes
Expand Down
8 changes: 8 additions & 0 deletions CITATIONS.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,10 @@

> Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Mar 15;26(6):841-2. doi: 10.1093/bioinformatics/btq033. Epub 2010 Jan 28. PubMed PMID: 20110278; PubMed Central PMCID: PMC2832824.

- [Bracken](https://doi.org/10.7717/peerj-cs.104)

> Lu, J., Breitwieser, F. P., Thielen, P., & Salzberg, S. L. (2017). Bracken: estimating species abundance in metagenomics data. PeerJ. Computer Science, 3(e104), e104. https://doi.org/10.7717/peerj-cs.104

- [fastp](https://www.ncbi.nlm.nih.gov/pubmed/30423086/)

> Chen S, Zhou Y, Chen Y, Gu J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics. 2018 Sep 1;34(17):i884-i890. doi: 10.1093/bioinformatics/bty560. PubMed PMID: 30423086; PubMed Central PMCID: PMC6129281.
Expand All @@ -38,6 +42,10 @@

> Kim D, Paggi JM, Park C, Bennett C, Salzberg SL. Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype Graph-based genome alignment and genotyping with HISAT2 and HISAT-genotype. Nat Biotechnol. 2019 Aug;37(8):907-915. doi: 10.1038/s41587-019-0201-4. Epub 2019 Aug 2. PubMed PMID: 31375807.

- [Kraken2](https://doi.org/10.1186/s13059-019-1891-0)

> Wood, D. E., Lu, J., & Langmead, B. (2019). Improved metagenomic analysis with Kraken 2. Genome Biology, 20(1), 257. https://doi.org/10.1186/s13059-019-1891-0

- [MultiQC](https://pubmed.ncbi.nlm.nih.gov/27312411/)

> Ewels P, Magnusson M, Lundin S, Käller M. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics. 2016 Oct 1;32(19):3047-8. doi: 10.1093/bioinformatics/btw354. Epub 2016 Jun 16. PubMed PMID: 27312411; PubMed Central PMCID: PMC5039924.
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,7 @@
3. [`dupRadar`](https://bioconductor.org/packages/release/bioc/html/dupRadar.html)
4. [`Preseq`](http://smithlabresearch.org/software/preseq/)
5. [`DESeq2`](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
6. [`Kraken2`](https://ccb.jhu.edu/software/kraken2/) -> [`Bracken`](https://ccb.jhu.edu/software/bracken/) on unaligned sequences; _optional_
15. Pseudoalignment and quantification ([`Salmon`](https://combine-lab.github.io/salmon/) or ['Kallisto'](https://pachterlab.github.io/kallisto/); _optional_)
16. Present QC for raw read, alignment, gene biotype, sample similarity, and strand-specificity checks ([`MultiQC`](http://multiqc.info/), [`R`](https://www.r-project.org/))

Expand Down
Binary file added docs/images/bracken-top-n-plot.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/images/nf-core-rnaseq_metro_map_grey.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
414 changes: 235 additions & 179 deletions docs/images/nf-core-rnaseq_metro_map_grey.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
22 changes: 21 additions & 1 deletion docs/output.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes d
- [Preseq](#preseq) - Estimation of library complexity
- [featureCounts](#featurecounts) - Read counting relative to gene biotype
- [DESeq2](#deseq2) - PCA plot and sample pairwise distance heatmap and dendrogram
- [Kraken2/Bracken](#kraken2bracken) - Taxonomic classification of unaligned reads
- [MultiQC](#multiqc) - Present QC for raw reads, alignment, read counting and sample similiarity
- [Pseudoalignment and quantification](#pseudoalignment-and-quantification)
- [Salmon](#pseudoalignment) - Wicked fast gene and isoform quantification relative to the transcriptome
Expand Down Expand Up @@ -656,6 +657,25 @@ The plot on the left hand side shows the standard PC plot - notice the variable

<p align="center"><img src="images/mqc_deseq2_clustering.png" alt="MultiQC - DESeq2 sample similarity plot" width="600"></p>

### Kraken2/Bracken

<details markdown="1">
<summary>Output files</summary>

- `<ALIGNER>/contaminants/kraken2/kraken_reports`
- `*.kraken2.report.txt`: Classification of unaligned reads in the Kraken report format. See the [kraken2 manual](https://github.com/DerrickWood/kraken2/wiki/Manual#output-formats) for more details
- `*.classified*.fastq.gz` If `--save_kraken_alignments`, outputs fastq file for each sample with each classified read annotated with taxonomic identification from Kraken2.
- `*.unclassified*.fastq.gz` If `save_kraken_unassigned`, outputs fastq file with all reads that were not classified by Kraken2.
- `<ALIGNER>/contaminants/bracken/`
- `*.kraken2.report_bracken.txt`: Kraken-style reports of the Bracken abundance estimate results. See the [kraken2 manual](https://github.com/DerrickWood/kraken2/wiki/Manual#output-formats) for more details.
- `*.tsv` Summary of estimated reads for each taxon member at the given classification level and what corrections were made from Kraken2.

</details>

[Kraken2](https://ccb.jhu.edu/software/kraken2/) is a taxonomic classification tool that uses k-mer matches paired with a lowest common ancestory (LCA) algorithm to classify species reads. [Bracken](https://ccb.jhu.edu/software/bracken/) is a statistical method to generate abundance estimates based off of the Kraken2 output. These algorithms are run on unaligned sequences to detect potential contamination of samples. MultiQC reports the top 5 taxon members detected at the level of classification used for Bracken, with toggles available for higher taxonomic levels. If Bracken is skipped, MultiQC will report the top 5 species detected by Kraken2.

![MultiQC - Bracken top species plot](images/bracken-top-n-plot.png)

### MultiQC

<details markdown="1">
Expand All @@ -675,7 +695,7 @@ Results generated by MultiQC collate pipeline QC from supported tools i.e. FastQ

### Pseudoalignment

The principal output files are the same between Salmon and Kallsto:
The principal output files are the same between Salmon and Kallisto:

<details markdown="1">
<summary>Output files</summary>
Expand Down
8 changes: 8 additions & 0 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,14 @@ Notes:

By default, the input GTF file will be filtered to ensure that sequence names correspond to those in the genome fasta file, and to remove rows with empty transcript identifiers. Filtering can be bypassed completely where you are confident it is not necessary, using the `--skip_gtf_filter` parameter. If you just want to skip the 'transcript_id' checking component of the GTF filtering script used in the pipeline this can be disabled specifically using the `--skip_gtf_transcript_filter` parameter.

## Contamination screening options

The pipeline provides the option to scan unaligned reads for contamination from other species using [Kraken2](https://ccb.jhu.edu/software/kraken2/), with the possibility of applying corrections from [Bracken](https://ccb.jhu.edu/software/bracken/). Since running Bracken is not computationally expensive, we recommend always using it to refine the abundance estimates generated by Kraken2.

It is important to note that the accuracy of Kraken2 is [highly dependent on the database](https://doi.org/10.1099/mgen.0.000949) used. Specifically, it is [crucial](https://doi.org/10.1128/mbio.01607-23) to ensure that the host genome is included in the database. If you are particularly concerned about certain contaminants, it may be beneficial to use a smaller, more focused database containing primarily those contaminants instead of the full standard database. Various pre-built databases [are available for download](https://benlangmead.github.io/aws-indexes/k2), and instructions for building a custom database can be found in the [Kraken2 documentation](https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown). Additionally, genomes of contaminants detected in previous sequencing experiments are available on the [OpenContami website](https://openlooper.hgc.jp/opencontami/help/help_oct.php).

While Kraken2 is capable of detecting low-abundance contaminants in a sample, false positives can occur. Therefore, if only a very small number of reads from a contaminating species are detected, these results should be interpreted with caution.

## Running the pipeline

The typical command for running the pipeline is as follows:
Expand Down
13 changes: 12 additions & 1 deletion modules.json
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,11 @@
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
"installed_by": ["modules"]
},
"bracken/bracken": {
"branch": "master",
"git_sha": "c214fad97b328eb6d6233f779be9ba44814a9136",
"installed_by": ["modules"]
},
"cat/fastq": {
"branch": "master",
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
Expand Down Expand Up @@ -68,7 +73,8 @@
"hisat2/align": {
"branch": "master",
"git_sha": "ad30f90cfc383dfaa505771d24f9e292c53157ab",
"installed_by": ["fastq_align_hisat2"]
"installed_by": ["fastq_align_hisat2"],
"patch": "modules/nf-core/hisat2/align/hisat2-align.diff"
},
"hisat2/build": {
"branch": "master",
Expand All @@ -90,6 +96,11 @@
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
"installed_by": ["modules", "quantify_pseudo_alignment"]
},
"kraken2/kraken2": {
"branch": "master",
"git_sha": "a13d5d945742a60bbef6e5c177e81cda540f75dc",
"installed_by": ["modules"]
},
"multiqc": {
"branch": "master",
"git_sha": "06c8865e36741e05ad32ef70ab3fac127486af48",
Expand Down
7 changes: 7 additions & 0 deletions modules/nf-core/bracken/bracken/environment.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

55 changes: 55 additions & 0 deletions modules/nf-core/bracken/bracken/main.nf

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

51 changes: 51 additions & 0 deletions modules/nf-core/bracken/bracken/meta.yml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 13 additions & 0 deletions modules/nf-core/bracken/bracken/nextflow.config

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

5 changes: 5 additions & 0 deletions modules/nf-core/bracken/bracken/tests/genus_test.config

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Loading
Loading