-
Notifications
You must be signed in to change notification settings - Fork 8
Database generation tutorial
This is a tutorial for creating custom databases.
Please first read the README for instructions on general setup of Struo2.
This tutorial will use the following:
- a small set of reference genomes
- GTDB taxonomy
- UniRef90
See the README for alternatives.
Software dependency installation via conda is described in the README.
# Create a directory to hold all of the necessary Struo2 data files
# The directory location can be anywhere on your file system
OUTDIR=./data/
mkdir -p $OUTDIR
wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release95/taxdump/names.dmp
wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release95/taxdump/nodes.dmp
These are only needed for creating HUMAnN3 databases.
Note: you will need humann3 installed via conda or other means
To download the humann3 uniref90 database:
humann_databases --download uniref uniref90_diamond $OUTDIR
See the README for a description.
wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/install/uniref_2019.01/uniref50-90.pkl
These genomes will be used for database construction.
The genome assembly fasta files will be used directly for Kraken2/Bracken.
For the "genes" database, genes will be called for each genome via PROKKA
,
and then clustered via mmseqs2
.
For HUMAnN3, the genes will be annotated by default against UniRef via either mmseqs2
or diamond
.
wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/dev_data/genomes/GTDBr95_n10.tar.gz
tar -pzxvf $OUTDIR/GTDBr95_n10.tar.gz --directory $OUTDIR
This table lists all of the reference genomes. See this example table.
Note: you only need the following columns in the table (all others are ignored):
- samples_col: 'ncbi_organism_name'
- accession_col: 'accession'
- fasta_file_path_col: 'fasta_file_path'
- taxID_col: 'gtdb_taxid'
- taxonomy_col: 'gtdb_taxonomy'
The config.yaml file sets the Struo2 pipeline parameters.
Below is a config.yaml that will work with our dataset.
#-- email notifications of pipeline success/failure (use "Skip" to deactivate) --#
email: None
#-- I/O --#
# file listing samples and associated data
samples_file: data/GTDBr95_n10/GTDBr95_n10.tsv
## column names in samples table
samples_col: 'ncbi_organism_name'
accession_col: 'accession'
fasta_file_path_col: 'fasta_file_path'
taxID_col: 'gtdb_taxid' # or 'ncbi_species_taxid'
taxonomy_col: 'gtdb_taxonomy' # or 'ncbi_taxonomy'
# output location
output_dir: tests/pipeline_output/GTDBr95_n10/
# temporary file directory (your username will be added automatically)
tmp_dir: /ebio/abt3_scratch/
#-- databases to create --#
# Replace "Create" with "Skip" to skip creation of any of these
# Note that braken relies on the kraken2 database
databases:
kraken2: Create
bracken: Create
genes: Create
humann3_bowtie2: Create
humann3_diamond: Create
# Name of UniRef clustering (uniref90 or uniref50)
## "uniref90" highly recommended
uniref_name: uniref50
# Name of the humann3 diamond database to be created
## This must match the naming allowed by humann3 (eg., "uniref90_201901.dmnd")
dmnd_name: uniref50_201901.dmnd
# Index mapping UniRef90 clusters to UniRef50 (saves time vs re-annotating)
## This is skipped if annotating with UniRef50 instead of UniRef90
cluster_idx: data/uniref50-90.pkl
#-- if custom NCBI/GTDB taxdump files, "Skip" if standard NCBI taxdump --#
# Used for kraken taxonomy & metaphlan
names_dmp: data/taxdump/names.dmp
nodes_dmp: data/taxdump/nodes.dmp
#-- keep intermediate files required for re-creating DBs (eg., w/ more genomes) --#
# If "True", the intermediate files are saved to `output_dir`
# Else, the intermediate files are temporarily stored in `temp_folder`
keep_intermediate: True
#-- software parameters --#
# `vsearch_per_genome` = per-genome gene clustering
# use "Skip" at the start of any param to skip (if possible to skip)
# for humann3, use either mmseqs or diamond (mmseqs gets priority if neither skipped)
# for humann3::mmseqs_search::run, --num-iterations must be >=2
params:
ionice: -c 3
bracken:
build_kmer: 35
build_read_lens:
- 100
- 150
genes:
prodigal: ""
vsearch_per_genome: --id 0.97 --strand both --qmask none --fasta_width 0
mmseqs_cluster: --min-seq-id 0.9 -c 0.8
mmseqs_cluster_method: linclust # or "cluster", which is slower
humann3:
batches: 2
mmseqs_search:
db: Skip #data/mmmseq2/uniref50 # UniRef90 is recommended!
index: -s 6
run: -e 1e-3 --max-accept 1 --max-seqs 100 --num-iterations 2 --start-sens 1 --sens-steps 3 -s 6
diamond:
db: data/uniref90_201901.dmnd
run: --evalue 1e-3 --query-cover 80 --id 50 --max-target-seqs 1 --block-size 4 --index-chunks 2
propagate_annotations: --min-cov 80 --min-pident 90
#-- snakemake pipeline --#
pipeline:
snakemake_folder: ./
script_folder: ./bin/scripts/
name: Struo2_db-create
config: create
- Add an email if you want notifications when the pipeline completes/fails
- The parameters under
databases:
determine which databases to create (e.g., you can skip HUMAnN3 database construction). -
batches:
determines the number of gene batches for parallel mmseqs2 or diamond jobs for annotating all genes.- The gene sequence fasta is shuffled, split, and each split is annotated in a separate job.
- The "Skip" for
db:
undermmseqs_search
will skip gene annotation with mmseqs2 and instead use diamond. -
propagate_annotations:
propagates UniRef90 to UniRef50, if applicable. - You shouldn't need to edit the
pipeline:
parameters.
See the snakemake docs for general instructions.
First, a dry run:
snakemake --use-conda -j -Fqn
Now, an actual run with 4 cores:
snakemake --use-conda -j 2 -F
See the README for running snakemake on a cluster (recommended).
See the README for details on the output.