Skip to content

Commit

Permalink
Merge pull request #29 from bhattlab/improvements
Browse files Browse the repository at this point in the history
feat: v2.0 Improvements
  • Loading branch information
bsiranosian authored Jun 7, 2023
2 parents dd2928e + fad5e9c commit cb730eb
Show file tree
Hide file tree
Showing 754 changed files with 182,105 additions and 326 deletions.
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
.snakemake/
large_data/
Rplots.pdf
tests/db
tests/test_output/scripts
21 changes: 21 additions & 0 deletions LICENESE.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c)2023 Benjamin Siranosian

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
65 changes: 53 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,6 @@ A Snakemake pipeline wrapper of the Kraken2 short read metagenomic classificatio
## Introduction
[Kraken2](http://ccb.jhu.edu/software/kraken/) is a short read classification system that is fast and memory efficient. It assigns a taxonomic identification to each sequencing read, by using the lowest common ancestor (LCA) of matching genomes in the database. Using [Bracken](https://github.com/jenniferlu717/Bracken/) provides accurate estimates of proportions of different species. This guide will cover some of the basics, but the full [Kraken2 manual](http://ccb.jhu.edu/software/kraken/MANUAL.html) has much more detail.

## NEW AS OF 2019-09-01!
The outputs of this pipeline have been vastly improved! Both internally and saved data now use the GCTx data format, from the [CMapR](https://github.com/cmap/cmapR) package. Basically, a GCT object is a data matrix that has associated row and column metadata. This allows for consistent metadata to live with the classification data, for both the rows (taxonomy information) and columns (sample metadata). See section [8. GCTx data processing](manual/gctx.md) for more information and tools for working with the new implementation.

Also as of this update, the NCBI taxonomy information used by Kraken is filtered and improved some before saving any data or figures. For example, there were previously many taxonomy levels simply labeled "environmental samples" that are now named with their pared taxa name to remove ambiguity. Also, levels without a proper rank designation (listed with an abbreviation and a number in the kraken report) have been forced into a specific rank when nothing was below them. This makes the taxonomy "technically incorrect", but much more practically useful in these cases. Contact me with any questions. The full list of changes is described in [Additional considerations](manual/extra.md)

## Table of contents
1. [Installation](manual/installation.md)
2. [Usage](manual/usage.md)
Expand All @@ -22,18 +17,43 @@ Also as of this update, the NCBI taxonomy information used by Kraken is filtered
8. [GCTx data parsing](manual/gctx.md)

## Quickstart
#### Install
If you're in the Bhatt lab, most of this work will take place on the SCG cluster. Otherwise, set this up on your own cluster or local machine. You will have to [build a database](manual/db_construction.md), create the `taxonomy_array.tsv` file (instructions at the [build a database page](manual/db_construction.md)), set the database options, and create the [sample input files](manual/usage.md). Thanks to Singularity containerization, you can run this pipeline with only Snakemake installed and no extra software!
### Install
If you're in the Bhatt lab, most of this work will take place on the SCG cluster. External users should set this pipeline up on their infrastructure of choice: cloud, HPC, or even a laptop will work for processing small datasets. You will have to [download or build a database](manual/db_construction.md), set the database options, and create the [sample input files](manual/usage.md). All steps of this pipeline are containerized, meaning only `snakemake` and `singularity` are required to run all tools.

If you're in the Bhatt lab, use [these instructions](https://github.com/bhattlab/bhattlab_workflows/blob/master/manual/setup.md) to set up snakemake and set up a profile to submit jobs to the SCG cluster. External users should follow these instructions:

If you're in the Bhatt lab, use the instructions [here](https://github.com/bhattlab/bhattlab_workflows/blob/master/manual/setup.md) to set up snakemake and set up a profile to submit jobs to the SCG cluster. If not, I recommend installing snakemake in a [conda](https://docs.conda.io/en/latest/miniconda.html) environment.
1. [Install mambaforge](https://github.com/conda-forge/miniforge#mambaforge).
2. Create a fresh environment with `snakemake` or add it to an existing environment:
```
mamba create --name snakemake --channel conda-forge --channel bioconda snakemake
```
3. [Install singularity](https://docs.sylabs.io/guides/latest/user-guide/quick_start.html#quick-installation-steps).

Then, clone this repo wherever is convenient for you. I use a directory in `~/projects`
Then, clone this repo in a convenient location.
```
cd ~/projects
git clone https://github.com/bhattlab/kraken2_classification.git
cd kraken2_classification
```

### Run with test data
A Kraken2 database is required to use this pipeline. Pre-built database can be downloaded from [Ben Langmead's site](https://benlangmead.github.io/aws-indexes/k2). As an example, we download the standard database limited to 8GB memory use, and unpack it into a folder to use with the tests:
```
cd kraken2_classification/tests
wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230605.tar.gz
mkdir db
tar -C db -xvf k2_standard_08gb_20230605.tar.gz
```
#### Run
Copy the `config.yaml` file into the working directory for your samples. Change the options to suit your projects and make sure you specify the right `samples.tsv` file. See [Usage](manual/usage.md) for more detail.

A small test dataset from [Yassour et. al (2018)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6091882/) is included in this repo. 10,000 reads from several timepoints from a mother-infant pair are used. Even with such low coverage, the differences in microbiome composition are apparent in clustering and taxonomic barplots. Launch and end-to-end test run with a command like so:
```
# Launch this from the kraken2_classification directory
snakemake -s Snakefile --configfile tests/test_config/config_pe.yaml -j1 --use-singularity
```

The script `tests/run_tests.sh` ensures basic functionality of the pipeline executes as expected.

### Run with real-world data
Copy the `config.yaml` file into the working directory for your samples. Change the options to suit your project. The main input is the `sample_reads_file` which defines the mapping from sample names to sequencing reads. See [Usage](manual/usage.md) for more detail.

On the Bhatt lab SCG cluster, you can then launch the workflow with a snakemake command like so:
```
Expand All @@ -51,6 +71,11 @@ After running the workflow and you're satisfied the results, run the cleanup com
snakemake cleanup -s path/to/Snakefile --configfile config.yaml
```

### Run analysis on existing data
If you have a collection of kraken/bracken reports and just want to run the downstream analysis in this pipeline, you can provide the `sample_reports_file` in the config, which is a map from sample names to kraken and bracken report files. See `tests/test_config/config_downstream_only_bracken.yaml` as an example. Then, launch the pipeline with `Snakefile_downstream_only`. Tune the filtering and job submission parameters to meet your needs.
```
snakemake -s Snakefile_downstream_only --configfile tests/test_config/config_downstream_only_bracken.yaml -j1 --use-singularity
```

## Parsing output reports
The Kraken reports `classification/sample.krak.report`, bracken reports `sample.krak_bracken.report`, and data matrices or GCTx objects in the `processed_results` folder are the best for downstream analysis. See [Downstream processing and plotting](manual/downstream_plotting.md) for details on using the data in R.
Expand Down Expand Up @@ -109,3 +134,19 @@ _Taxonomic barplot_

_PCoA plot example_
![pcoa_plot](images/pcoa_plot.png "PCoA plot")

## Changelog

### 2023-06-07
v2.0 (Breaking changes introduced to to configuration files and the ways parameters are used).
This set of changes did a bit to modernize the pipeline:
* All steps are now available with containerized execution
* Created separate pipeline `Snakefile_downstream_only` which works from a list of report files to only un the downstream analysis steps.
* Included a small test dataset and better test execution
* Various code and README/manual changes
* License file added

### 2019-09-01
The outputs of this pipeline have been vastly improved! Both internally and saved data now use the GCTx data format, from the [CMapR](https://github.com/cmap/cmapR) package. Basically, a GCT object is a data matrix that has associated row and column metadata. This allows for consistent metadata to live with the classification data, for both the rows (taxonomy information) and columns (sample metadata). See section [8. GCTx data processing](manual/gctx.md) for more information and tools for working with the new implementation.

Also as of this update, the NCBI taxonomy information used by Kraken is filtered and improved some before saving any data or figures. For example, there were previously many taxonomy levels simply labeled "environmental samples" that are now named with their pared taxa name to remove ambiguity. Also, levels without a proper rank designation (listed with an abbreviation and a number in the kraken report) have been forced into a specific rank when nothing was below them. This makes the taxonomy "technically incorrect", but much more practically useful in these cases. Contact me with any questions. The full list of changes is described in [Additional considerations](manual/extra.md)
Loading

0 comments on commit cb730eb

Please sign in to comment.