-
Notifications
You must be signed in to change notification settings - Fork 8
Database updating tutorial: adding genes
This is a tutorial for updating >= Struo2-generated custom database.
Please first read the README for instructions on general setup of Struo2.
This tutorial will use the following:
- A small set of genes to add to the existing custom databases
- GTDB taxonomy
- NCBI taxonomy is an alternative
- The databases generated from the database generation tutorial
- See the ftp site for custom GTDB databases
This tutorial assumes that you've created your custom databases in the ./data/
directory.
# set the location of the database files
OUTDIR=./data/
mkdir -p $OUTDIR
These are the genes that you will add to the existing databases.
wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/dev_data/genes/UniRef50_n5.tar.gz
tar -pzxvf $OUTDIR/UniRef50_n5.tar.gz --directory $OUTDIR
You need either the amino acid or nucleotide version of the genes.
Nucleotide is better, since it can be translated to amino acid via mmseqs translatenucs
.
If only amino acid sequences are provided, then they will be rev-translated via mmseqs translateaa
.
Plass and linclust can be used to create a gene set from your metagenomes, which you could add to the existing custom GTDB databases.
You also need a metadata table for the new gene sequences. The following columns are required:
-
seq_uuid
- A UUID is recommended, but any unique ID should work
-
seq_orig_name
- The original name/annotation of the gene
-
domain
- Taxonomic level
- All values can be blank/empty
-
phylum
- Taxonomic level
- All values can be blank/empty
-
class
- Taxonomic level
- All values can be blank/empty
-
order
- Taxonomic level
- All values can be blank/empty
-
family
- Taxonomic level
- All values can be blank/empty
-
genus
- Taxonomic level
-
species
- Taxonomic level
-
taxid
- All values can be blank/empty
-
genome_name
- Genome origin of the gene (e.g., assembly accession), if available
- All values can be blank/empty
-
genome_length_bp
- Total assembly length (base pairs) of the genome-of-origin
- Useful for normalization
- All values can be blank/empty
Note: the order of the columns in the table does not matter.
No samples table is needed, since only genes will be added.
The snakemake pipeline config is a bit different for updating a database versus generating a new database.
Below is an example config:
#-- email notifications of pipeline success/failure (use "Skip" to deactivate) --#
email: None
#-- databases to update --#
# Replace "Create" with "Skip" to skip creation of any of these
# Skipping kraken2/braken, since those are just genome-based, and we are adding genes to the database
databases:
kraken2: Create # <-- automatically skipped if no new genomes are provided
bracken: Create # <-- automatically skipped if no new genomes are provided
genes: Create
humann3_bowtie2: Create
humann3_diamond: Create
#-- Input --#
#--- If just a set of gene sequences to add ---#
# If you have nucleotide/amino-acid gene sequences formatted for humann
# If translate = True, missing nuc or AA seqs will be (rev)translated from the other, else seqs not used
new_genes:
amino_acid: data/UniRef50/genome_reps_filtered.faa.gz
nucleotide: data/UniRef50/genome_reps_filtered.fna.gz
metadata: data/genome_reps_filtered.txt.gz
translate: True
#--- If a set of genomes to add ---#
# file listing samples and associated data
samples_file: Skip # <-- Not needed, since no new genomes
## column names in samples table
samples_col: 'ncbi_organism_name' # <-- Not used, since no new genomes
accession_col: 'accession' # <-- Not used, since no new genomes
fasta_file_path_col: 'fasta_file_path' # <-- Not used, since no new genomes
taxID_col: 'gtdb_taxid' # <-- Not used, since no new genomes
taxonomy_col: 'gtdb_taxonomy' # <-- Not used, since no new genomes
# Saved databases that will be updated
kraken2_db:
library: tests/output/GTDBr95_n10/kraken2/library/
taxonomy: tests/output/GTDBr95_n10/kraken2/taxonomy/
genes_db:
genes:
mmseqs_db: tests/output/GTDBr95_n10/genes/genes_db.tar.gz
amino_acid: tests/output/GTDBr95_n10/genes/genome_reps_filtered.faa.gz
nucleotide: tests/output/GTDBr95_n10/genes/genome_reps_filtered.fna.gz
metadata: tests/output/GTDBr95_n10/genes/genome_reps_filtered.txt.gz
cluster:
mmseqs_db: tests/output/GTDBr95_n10/genes/cluster/clusters_db.tar.gz
humann_db:
query:
hits: tests/output/GTDBr95_n10/humann3/annotation_hits.gz
cluster:
reps: tests/output/GTDBr95_n10/genes/cluster/clusters_reps.faa.gz
membership: tests/output/GTDBr95_n10/genes/cluster/clusters_membership.tsv.gz
#-- Output --#
# output location
output_dir: tests/output/GTDBr95_n10-n5/
# Name of UniRef clustering (uniref90 or uniref50)
## "uniref90" highly recommended (but takes longer)!
uniref_name: uniref50
# Name of the humann3 diamond database to create
## This must match naming allowed by humann3
dmnd_name: uniref50_201901.dmnd # UniRef90 is recommended
# Index mapping UniRef90 clusters to UniRef50 (saves time vs re-annotating)
## Skip if annotating with UniRef50
cluster_idx: data/uniref50-90.pkl
# temporary file directory (your username will be added automatically)
tmp_dir: tmp/db_update_tmp/
#-- if custom NCBI/GTDB taxdump files, "Skip" if standard NCBI taxdump --#
# Used for kraken taxonomy & metaphlan
names_dmp: data/taxdump/names.dmp
nodes_dmp: data/taxdump/nodes.dmp
#-- keep intermediate files required for re-creating DBs (eg., w/ more genomes) --#
# If "True", the intermediate files are saved to `output_dir`
# Else, the intermediate files are temporarily stored in `temp_folder`
keep_intermediate: True
#-- software parameters --#
# `vsearch_per_genome` = per-genome gene clustering
# for humann3, use either mmseqs or diamond (mmseqs gets priority if neither skipped)
# for humann3::mmseqs_search::run, --num-iterations must be >=2
params:
ionice: -c 3
bracken:
build_kmer: 35
build_read_lens:
- 100
- 150
genes:
prodigal: ""
vsearch_per_genome: --id 0.97 --strand both --qmask none --fasta_width 0
mmseqs_cluster_update: --min-seq-id 0.9 -c 0.8 -s 4.0
humann3:
batches: 2
filter_existing: --min-pident 0 # any existing genes w/ < cutoff with be re-queried
mmseqs_search:
db: data/UniRef90/uniref90
index: -s 6
run: -e 1e-3 --max-accept 1 --max-seqs 100 --num-iterations 2 --start-sens 1 --sens-steps 3 -s 6
diamond:
db: Skip #data/uniref90_ec-filtered/uniref90_ec_filt_201901.dmnd
run: --evalue 1e-3 --query-cover 80 --id 90 --max-target-seqs 1 --block-size 4 --index-chunks 2
propagate_annotations: --min-cov 80 --min-pident 90
#-- snakemake pipeline --#
pipeline:
snakemake_folder: ./
script_folder: ./bin/scripts/
name: Struo2_db-update
config: update
See here for general notes about the config, regardless of database creation or updating.
-
kraken2_db:
,genes_db:
, andhumann_db:
specify the locations for the existing database files- See the database generation tutorial for how to generate the files
- WARNING: use a different
output_dir:
other than where the existing databases are located; otherwise, the database files may be over-written!
See the snakemake docs for general instructions.
First, a dry run:
snakemake --use-conda -j -Fqn
Now, an actual run with 4 cores:
snakemake --use-conda -j 2 -F
See the README for running snakemake on a cluster (recommended).
See the README for details on the output.
A good quick sanity check is to compare the size of the updated databases versus the size of the original database files. The updated file sizes should be larger.