Assnake is a system for Illumina NGS data analysis, vizualisation and management. It allows you to go from raw reads to biological insights in just a few commands.
As of pre-alpha release Assnake is capable of full-blown metagenomic data analysis (Both Shotgun WGS and Amplicon 16s included!) and some RNA-seq analysis.
Assnake was born in an effort to provide userfriendly, scalable and reproducable system for NGS data analysis accessible to researches without advanced computer skills, but at the same time make such system flexible, easy to modify and extend with your own pipelines or analysis code.
- assnake-core-preprocessing
- assnake-core-taxonomy
- assnake-core-assembly
- assnake-core-mapping
- assnake-core-binning
- assnake-dada2
- assnake-core-transcriptome
- Install through conda. (Now only installation by cloning GitHub repo is available)
- Initialize your installation
assnake init start --just-do-it
(1 minute) - Navigate to the folder where you want to store data for study and init assnake Dataset
assnake dataset init
(1 minute) - Import reads into Dataset
assnake dataset import-reads -d {DATASET_NAME} -r {FOLDER_WITH_READS}
(1 minute) - Run the pipeline of your choice!
assnake result dada2-full -d {DATASET_NAME} gather -j 8 --run
Chances are, you work with more than one NGS dataset, and maybe you are not the only one working on a server/cluster. How many times have you found yourself searching for the data all over the filesystem and asking your collegues if they remember where someone put that data? When you finally find raw data, the result files and code are most proably in a state of creative chaos and gatehring them together may be a tough task. If you want to compare results of different studies, validate your findings using external data or provide your readers with an easy way to reproduce your analysis, all you code and files should be in strict order. Moreover, it is hard to keep track of all software dependecies and conflicts, versions of tools and pipelines, deploying your environment on a new machine may be a real pain in the ass. ASSnake solves all these problems in an userfriendly and extendable way, by allowing you to catalogue your data, run and reproduce pipelines and statistical analysis.
- You store reads of your Samples inside Datasets.
Techincallly Datasets are just folders on your file system, and you need to tell Assnake about their existence.
Assnake assumes that raw reads are stored inside{PATH_TO_DATASET_FOLDER}/{YOUR_DATASET_NAME}/reads/raw
. So, for example you may have your reads stored at/home/ozzy/bat_microbiome/reads/raw
./home/ozzy
is the prefix of your Dataset andbat_microbiome
is its name. You need to register your dataset inside Assnake withassnake dataset create -f /home/ozzy -d bat_microbiome
Don't be afraid, Assnake is very careful and will never overwrite or delete your data! - Data needs quality control and cleaning before being analyzed. When preprocessing your reads, say, removing low quality or contaminant sequences, you get new read files. Theay are stored inside
{fs_prefix}/{df}/reads/{preprocessing}
folders. So, the name of the folder is the name of your preprocessing! For raw reads the name of preprocessing would beraw
, nameraw__tmtic_def
means that raw reads were preprocessed with Trimmomatic with default parameters.preprocessing
name fully describes all the steps that were apllied to the reads inside the folder, steps are separated using double underscore__
- Everything that generates meaningful and useful data produces Result, which you can request from Assnake using
assnake result <RESULT_NAME>
command. For example commandassnake result fastqc --df bat_microbiome run
will produce fastqc reports for all samples in Dataset bat_microbiome. By the way, by performing preprocessing results you get Result in form of reads!{df_sample}_R1.fastq.gz {df_sample}_R2.fastq.gz
- Results are provided by SnakeModules. They implement all the logic connected with the Result and one Module can implemet any number of results. For example, assnake-dada2 module implements DADA2 pipeline. It extends Assnake with 2 Results -
dada2-filter-and-trim
for trimming reads anddada2-full
for running full pipeline and generate table with ASVs (Amplicon Sequence Variants (link to github issue about the term)) abundances across samples and table with taxonomic annotation of ASVs. It also exposes API that allows you to easy access, manipulate and vizualise this data (Heatmaps, Barplots, PCA plots) from Python or R (Notebooks are great!) More on that in Extending Assnake section - not written - Pipelines are builr from several Results that comes from SnakeModules
This four concepts are the foundation of the Assnake. This assumtions are pretty general and applicable for many types of data. Actually snakemake is a framework for creating data-processing pipelines with the primary focus on NGS and other omics data.
Assnake draws inspiration mainly from Anvio (incorporated into Assnake) and QIIME-2.
Assnake uses Snakemake as a workflow subsystem, thus it gets Snakemake's ability to run on all kinds of servers, clusters, personal computers and in the cloud.
Assnake is entirely open sourced and is built with great open-source tools. We encourage the community to join Assnake's initiative in creating open, reproducable, userfriendly, and scalable omics data analysis and management.
- Install conda https://docs.conda.io/projects/conda/en/latest/user-guide/install/linux.html
I Download package from here using wget. You need Linux x64. Direct link
II. Run downloaded file with bash
bash Miniconda3-latest-Linux-x86_64.sh
and follow the instructions. III. Close and reopen you terminal in order to changes take effect IV. Verify installation by executingconda list
- Create new environment
conda create -n assnake python=3.6
- Activate your new environment
source activate assnake
- Navigate to some directory on your file system, for example your home directory
cd ~
- Clone this repository to your computer using
git clone https://github.com/Fedorov113/assnake.git
- Enter to just created assnake directory
cd assnake
- Install the package using
pip install -e ./
- Verify your installation by running
assnake --help
Call assnake init start
It will ask you which directory you would like to use for assnake database, choose some folder on file system with at least 20 Gb of free space. If the folder is not yet created, assnake will create it.
Now you can start processing your data!
Now we need to register or create dataset in assnake. Run assnake dataset create --df <DATASET_NAME> --fs_prefix <FOLDER WHERE TO STORE DATA>
Full folder name looks like this: '{fs_prefix}/{df}'. Say, I want to create dataset with name miseq_sop
and I want to put in /home/fedorov/bio
.
I will call assnake dataset create --df miseq_sop --fs_prefix /home/fedorov/bio
. Folder /home/fedorov/bio/miseq_sop
will be created on file system, if not created already. If you already have same folder it is totally OK.
Now run assnake dataset info -d <DATASET_NAME>
This version is for Vlad and dada2, so we need only
- https://github.com/Fedorov113/assnake-dada2
- https://github.com/Fedorov113/assnake-core-preprocessing
Just clone this repositories and install with pip install -e ./
while assnake
conda env is activated (source activate assnake
)
After installation run assnake result request --help
and you will see new available results.
Run command without < >
symbols around your dataset name.
- Run
assnake result request dada2-filter-and-trim -d <YOUR_DATASET> -p raw run --threads 1 --jobs 4 --run
. This will filter your reads by quality with default parameters using 4 jobs in parallel and 1 thread on each job. - Execute
assnake result request dada2-full -d <YOUR_DATASET> -p raw__dada2fat_def run -t 4 -j 1 --run
. Now we run 1 jobs with 4 threads. If you know that your machine has more available cores, feel free to use them and increase threads or jobs.
You are done! You can find dada2 results at {FS_PREFIX}/{YOUR_DF}/dada2/sample_set/learn_erros__def/seqtab_nochim__20.rds
and {FS_PREFIX}/{YOUR_DF}/dada2/sample_set/learn_erros__def/taxa_20.rds
.
Just load this files in R using readRDS()
function.
Feature | Assnake | Anvio | QIIME2 | nf-core | snakePipes | StaG | SqueezeMeta | Sunbeam | MetaWRAP |
---|---|---|---|---|---|---|---|---|---|
Link | https://github.com/ASSNAKE/assnake | http://merenlab.org/software/anvio/ | https://qiime2.org/ | https://github.com/nf-core | https://academic.oup.com/bioinformatics/article/35/22/4757/5499080 | https://github.com/ctmrbio/stag-mwc | https://github.com/jtamames/SqueezeMeta | https://github.com/sunbeam-labs/sunbeam | https://github.com/bxlab/metaWRAP |
Data Management | YES | PARTIAL | NO | NO | ? | NO | PARTIAL | YES | NO |
Modularity (without editing source code) | FULL | PARTIAL? | FULL | FULL | NO | NO | NO | FULL | NO |
Metagenomic pipeline | YES | ONLY BINNING | ONLY 16s | In development | NO | YES | Only assembly based | YES | YES |
Workflow system | Snakemake | Snakemake | NONE | Nextflow | Snakemake | Snakemake | Snakemake | Snakemake | NONE |
Command Line Interface is built using Click library. Heavily relies on pandas for data flow and management. Snakemake obviously.
Result has only 2 necessary files: workflow.smk
- for all the snakemake rules and necessary code and wc_config.yaml
with all necessary wildcard strings.
You can build and structure your project as you like, if you need subworkflows, just gather them all in workflow.smk
, using snakemakes include
directive.
If your Results has one of the standart inputs, like illumina_sample
, CLI command for invocing the result can be built for you automatically. Or, you can take full control and pass your CLI invocation command as a property when creating Result class instance.
Results are packed into SnakeModules, which are parsed by Assnake later on. You just need to import all your created Results and add them to the SnakeModule when creating SnakeModule class instance (Passed in a list now, will be chnged to dict). Assnake knows about results by dynamically parsing SnakeModules and programmatically importing all the necessary parts.