Snakemake workflow for genome assembly evaluation.
If you use this workflow in a paper, don't forget to give credits to the authors by citing the URL of this (original) repository and, if available, its DOI (see above; currently N/A).
- Per Unneberg (@percyfal)
- align transcripts to a reference sequence with gmap
- estimate gene body coverage with genecovr
- run quast, jellyfish, busco, and kraken2
- summarize quality metrics with MultiQC
- (WIP): run some of the steps for adding reads, coverage files, and sequences to the blobtoolkit viewer
- Create a new github repository using this workflow as a template.
- Clone the newly created repository to your local system, into the place where you want to perform the data analysis.
Configure the workflow according to your needs via editing the files
in the config/
folder. Adjust config.yaml
to configure the
workflow execution (see section Configuration
section below for more
details) . In addition, edit the following files:
assemblies.tsv
Assembly definition file which lists species, version, and assembly
fasta file. Mandatory.
transcripts.tsv
Transcripts definition file that lists transcript fasta files and
connects them to a given species
reads.tsv
Raw sequence read files.
NOTE: the config directory doesn't have to be in the workflow source directory, in which case snakemake must be invoked with the full path to the Snakemake file:
snakemake -s /path/to/manticore-smk/workflow/Snakefile
Install Snakemake using conda:
conda create -c bioconda -c conda-forge -n snakemake snakemake
For installation details, see the instructions in the Snakemake documentation.
Activate the conda environment:
conda activate snakemake
Test your configuration by performing a dry-run via
snakemake --use-conda -n
Execute the workflow locally via
snakemake --use-conda --cores $N
using $N
cores or run it in a cluster environment via
snakemake --use-conda --cluster qsub --jobs 100
or
snakemake --use-conda --drmaa --jobs 100
You can also use a snakemake profile for fine-tuning executions. For instance, to use the slurm profile run
cookiecutter https://github.com/Snakemake-Profiles/slurm.git
snakemake --use-conda --profile slurm --jobs 100
If you not only want to fix the software stack but also the underlying OS, use
snakemake --use-conda --use-singularity
in combination with any of the modes above. See the Snakemake documentation for further details.
After successful execution, you can create a self-contained interactive HTML report with all results via:
snakemake --report report.html
The report contains documentation and results from the workflow.
In case you have also changed or added steps, please consider contributing them back to the original repository:
- Fork the original repo to a personal or lab account.
- Clone the fork to your local system, to a different place than where you ran your analysis.
- Copy the modified files from your analysis to the clone of your
fork, e.g.,
cp -r workflow path/to/fork
. Make sure to not accidentally copy config file contents or sample sheets. Instead, manually update the example config files if necessary. - Commit and push your changes to your fork.
- Create a pull request against the original repository.
For a quick overview of example configuration files, see config/config.yaml and the test configuration .test/config/config.yaml
All configuration files are evaluated against configuration schemas. The validation ensures configuration keys are populated, which minimizes configuration overhead for the user. The schemas are self-documented and define and describe all available configuration options.
As an example, workflow/schemas/assemblies.schema.yaml
defines a
tabular sample input file format for individual assemblies. There are
four defined properties: id
(identifier), species
, version
, and
fasta
, of which species
, version
and fasta
are required (id
will be auto-generated from species
_version
if missing).
See the tutorial understanding jsonschema for an accessible introduction to schemas.
The workflow is configured by defining up to three input data files
and what applications to run. The input files are assemblies
,
transcripts
, and reads
, where only assemblies
is mandatory:
assemblies: config/assemblies.tsv
transcripts: config/transcripts.tsv
reads: config/reads.tsv
Each input data file is a tab-separated file listing file names and associated metadata. See the schemas for details or tests for examples.
The tabular input files can also be represented as lists in yaml format. For instance, the following tabular data
species version fasta
foo v1 resources/assembly.v1.fasta
foo v2 resources/assembly.v2.fasta
would be represented as follows in yaml format
- species: foo
description: Version 1 assembly of foo
version: v1
fasta: resources/assembly.v1.fasta
- species: foo
description: Version 2 assembly of foo
version: v2
fasta: resources/assembly.v2.fasta
Tool configurations are listed below.
The purpose of the genecovr
property is to define what assemblies
and transcripts to use for gene body coverage calculations. Briefly,
it consists of sections that either list an input file to genecovr
or defines a combination of assemblies and transcripts:
genecovr:
dataset1:
csvfile: config/genecovr.csv
dataset2:
assemblies: ["foo_v2", "foo_v1"]
transcripts: ["A", "B"]
The genecovr.csv
file consists of columns dataset
(an identifier
name), psl
that gives a path to a psl file of mapped transcripts,
assembly
which points to an assembly file, and trxset
which points
to a transcript fasta file. The assemblies
and transcripts
list
assembly and transcript identifiers as defined in the input assembly
and transcript files described in the previous section.
Busco needs a lineage to run which at current is the only property needed here.
busco:
lineage: viridiplantae_odb10
ids: ["foo_v2", "foo_v1"]
Jellyfish will count kmers of given sizes. The configuration section lists what kmer sizes to use:
jellyfish:
kmer: [21]
ids: ["foo_v2", "foo_v1"]
Quast will calculate quality metrics of an assembly.
quast:
ids: ["foo_v2", "foo_v1"]
kraken2 assigns taxonomic ids to sequences and is used for contamination screening. Analyses are performed in parallel over windows (default size 10kb). The results shows the distribution of taxids over windows.
kraken2:
ids: ["foo_v2"]
db: /path/to/Kraken2/database
window_size: 20000
npartitions: 50
MultiQC doesn't have a configuration section per se. Instead, it will collect and plot results for the following applications:
- busco
- jellyfish
- quast
- kraken2
Plot configurations can be tweaked in a multiqc configuration file
multiqc_config.yaml
.
Every rule has a corresponding configuration entry, with keywords for
options
, envmodules
. Java rules have an additional keyword
java_options
. To modify rule configuration for a key, add the
corresponding keyword in the rules
section under the rule name. For
instance, to change options
for jellyfish_count_chunk
, add
rules:
jellyfish_count_chunk:
options: --size 20G --canonical
Test cases are in the subfolder .test
. They are automatically
executed via continuous integration with Github
Actions. To run the tests, cd to
.test
and issue
snakemake --use-conda --conda-frontend mamba \
--show-failed-logs --cores 2 --conda-cleanup-pkgs cache \
-s ../workflow/Snakefile \
--wrapper-prefix file://$(pwd)/../workflow/wrappers
Once the test run has finished, create a report and view it:
snakemake --cores 1 -s ../workflow/Snakefile \
--wrapper-prefix file://$(pwd)/../workflow/wrappers \
--report report.html