Merge pull request #29 from bhattlab/improvements

feat: v2.0 Improvements
bhattlab · Jun 7, 2023 · cb730eb · cb730eb
2 parents dd2928e + fad5e9c
commit cb730eb
Show file tree

Hide file tree

Showing 754 changed files with 182,105 additions and 326 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+.snakemake/
+large_data/
+Rplots.pdf
+tests/db
+tests/test_output/scripts
diff --git a/LICENESE.txt b/LICENESE.txt
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c)2023 Benjamin Siranosian
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -6,11 +6,6 @@ A Snakemake pipeline wrapper of the Kraken2 short read metagenomic classificatio
 ## Introduction
 [Kraken2](http://ccb.jhu.edu/software/kraken/) is a short read classification system that is fast and memory efficient. It assigns a taxonomic identification to each sequencing read, by using the lowest common ancestor (LCA) of matching genomes in the database. Using [Bracken](https://github.com/jenniferlu717/Bracken/) provides accurate estimates of proportions of different species. This guide will cover some of the basics, but the full [Kraken2 manual](http://ccb.jhu.edu/software/kraken/MANUAL.html) has much more detail.
 
-## NEW AS OF 2019-09-01!
-The outputs of this pipeline have been vastly improved! Both internally and saved data now use the GCTx data format, from the [CMapR](https://github.com/cmap/cmapR) package. Basically, a GCT object is a data matrix that has associated row and column metadata. This allows for consistent metadata to live with the classification data, for both the rows (taxonomy information) and columns (sample metadata). See section [8. GCTx data processing](manual/gctx.md) for more information and tools for working with the new implementation. 
-
-Also as of this update, the NCBI taxonomy information used by Kraken is filtered and improved some before saving any data or figures. For example, there were previously many taxonomy levels simply labeled "environmental samples" that are now named with their pared taxa name to remove ambiguity. Also, levels without a proper rank designation (listed with an abbreviation and a number in the kraken report) have been forced into a specific rank when nothing was below them. This makes the taxonomy "technically incorrect", but much more practically useful in these cases. Contact me with any questions. The full list of changes is described in [Additional considerations](manual/extra.md)
-
 ## Table of contents
 1. [Installation](manual/installation.md)
 2. [Usage](manual/usage.md)
@@ -22,18 +17,43 @@ Also as of this update, the NCBI taxonomy information used by Kraken is filtered
 8. [GCTx data parsing](manual/gctx.md)
 
 ## Quickstart
-#### Install
-If you're in the Bhatt lab, most of this work will take place on the SCG cluster. Otherwise, set this up on your own cluster or local machine. You will have to [build a database](manual/db_construction.md), create the `taxonomy_array.tsv` file (instructions at the [build a database page](manual/db_construction.md)), set the database options, and create the [sample input files](manual/usage.md). Thanks to Singularity containerization, you can run this pipeline with only Snakemake installed and no extra software! 
+### Install
+If you're in the Bhatt lab, most of this work will take place on the SCG cluster. External users should set this pipeline up on their infrastructure of choice: cloud, HPC, or even a laptop will work for processing small datasets. You will have to [download or build a database](manual/db_construction.md), set the database options, and create the [sample input files](manual/usage.md). All steps of this pipeline are containerized, meaning only `snakemake` and `singularity` are required to run all tools.
+
+If you're in the Bhatt lab, use [these instructions](https://github.com/bhattlab/bhattlab_workflows/blob/master/manual/setup.md) to set up snakemake and set up a profile to submit jobs to the SCG cluster. External users should follow these instructions:
 
-If you're in the Bhatt lab, use the instructions [here](https://github.com/bhattlab/bhattlab_workflows/blob/master/manual/setup.md) to set up snakemake and set up a profile to submit jobs to the SCG cluster. If not, I recommend installing snakemake in a [conda](https://docs.conda.io/en/latest/miniconda.html) environment. 
+1. [Install mambaforge](https://github.com/conda-forge/miniforge#mambaforge).
+2. Create a fresh environment with `snakemake` or add it to an existing environment:
+```
+mamba create --name snakemake --channel conda-forge --channel bioconda snakemake
+```
+3. [Install singularity](https://docs.sylabs.io/guides/latest/user-guide/quick_start.html#quick-installation-steps).
 
-Then, clone this repo wherever is convenient for you. I use a directory in `~/projects`
+Then, clone this repo in a convenient location.
 ```
-cd ~/projects
 git clone https://github.com/bhattlab/kraken2_classification.git
+cd kraken2_classification
+```
+
+### Run with test data
+A Kraken2 database is required to use this pipeline. Pre-built database can be downloaded from [Ben Langmead's site](https://benlangmead.github.io/aws-indexes/k2). As an example, we download the standard database limited to 8GB memory use, and unpack it into a folder to use with the tests: 
+```
+cd kraken2_classification/tests
+wget https://genome-idx.s3.amazonaws.com/kraken/k2_standard_08gb_20230605.tar.gz
+mkdir db
+tar -C db -xvf k2_standard_08gb_20230605.tar.gz
 ```
-#### Run
-Copy the `config.yaml` file into the working directory for your samples. Change the options to suit your projects and make sure you specify the right `samples.tsv` file. See [Usage](manual/usage.md) for more detail.
+
+A small test dataset from [Yassour et. al (2018)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6091882/) is included in this repo. 10,000 reads from several timepoints from a mother-infant pair are used. Even with such low coverage, the differences in microbiome composition are apparent in clustering and taxonomic barplots. Launch and end-to-end test run with a command like so: 
+```
+# Launch this from the kraken2_classification directory
+snakemake -s Snakefile --configfile tests/test_config/config_pe.yaml -j1  --use-singularity
+```
+
+The script `tests/run_tests.sh` ensures basic functionality of the pipeline executes as expected. 
+
+### Run with real-world data
+Copy the `config.yaml` file into the working directory for your samples. Change the options to suit your project. The main input is the `sample_reads_file` which defines the mapping from sample names to sequencing reads. See [Usage](manual/usage.md) for more detail.
 
 On the Bhatt lab SCG cluster, you can then launch the workflow with a snakemake command like so:
 ```
@@ -51,6 +71,11 @@ After running the workflow and you're satisfied the results, run the cleanup com
 snakemake cleanup -s path/to/Snakefile --configfile config.yaml
 ```
 
+### Run analysis on existing data
+If you have a collection of kraken/bracken reports and just want to run the downstream analysis in this pipeline, you can provide the `sample_reports_file` in the config, which is a map from sample names to kraken and bracken report files. See `tests/test_config/config_downstream_only_bracken.yaml` as an example. Then, launch the pipeline with `Snakefile_downstream_only`. Tune the filtering and job submission parameters to meet your needs.
+```
+snakemake -s Snakefile_downstream_only --configfile tests/test_config/config_downstream_only_bracken.yaml -j1  --use-singularity
+```
 
 ## Parsing output reports
 The Kraken reports `classification/sample.krak.report`, bracken reports `sample.krak_bracken.report`, and data matrices or GCTx objects in the `processed_results` folder are the best for downstream analysis. See [Downstream processing and plotting](manual/downstream_plotting.md) for details on using the data in R. 
@@ -109,3 +134,19 @@ _Taxonomic barplot_
 
 _PCoA plot example_
 ![pcoa_plot](images/pcoa_plot.png "PCoA plot")
+
+## Changelog
+
+### 2023-06-07
+v2.0 (Breaking changes introduced to to configuration files and the ways parameters are used). 
+This set of changes did a bit to modernize the pipeline:
+* All steps are now available with containerized execution
+* Created separate pipeline `Snakefile_downstream_only` which works from a list of report files to only un the downstream analysis steps.
+* Included a small test dataset and better test execution
+* Various code and README/manual changes
+* License file added
+
+### 2019-09-01
+The outputs of this pipeline have been vastly improved! Both internally and saved data now use the GCTx data format, from the [CMapR](https://github.com/cmap/cmapR) package. Basically, a GCT object is a data matrix that has associated row and column metadata. This allows for consistent metadata to live with the classification data, for both the rows (taxonomy information) and columns (sample metadata). See section [8. GCTx data processing](manual/gctx.md) for more information and tools for working with the new implementation. 
+
+Also as of this update, the NCBI taxonomy information used by Kraken is filtered and improved some before saving any data or figures. For example, there were previously many taxonomy levels simply labeled "environmental samples" that are now named with their pared taxa name to remove ambiguity. Also, levels without a proper rank designation (listed with an abbreviation and a number in the kraken report) have been forced into a specific rank when nothing was below them. This makes the taxonomy "technically incorrect", but much more practically useful in these cases. Contact me with any questions. The full list of changes is described in [Additional considerations](manual/extra.md)