-
Notifications
You must be signed in to change notification settings - Fork 8
GTDB database download and usage
This tutorial describes how to download and utilize the custom databases generated by Struo2 from various GTDB releases.
See the main README for a list of databases.
For this tutorial, we will be downloading and using the GTDB r202 metagenome profiling databases. All files can be found at the Struo2 ftp site.
You can just use the database_download.py
utility script in ./util_scripts/
to download pre-built custom Struo2 databases.
An example of downloading GTDB-r202 Kraken2/Bracken databases (and associated files):
# requires `requests` and `bs4` python packages
# using 4 threads in this example
./util_scripts/database_download.py -t 4 -r 202 -d kraken2 metadata taxdump phylogeny -- custom_dbs
The taxdump files are used for creating/updating the Kraken2 database.
By default, the pipeline uses custom GTDB taxIDs generated with gtdb_to_taxdump from the GTDB taxonomy. To download the custom taxdump files:
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/taxdump/taxdump.tar.gz
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/taxdump/taxdump.tar.gz.md5
md5sum --check $DBDIR/taxdump.tar.gz.md5
tar -pzxvf $DBDIR/taxdump.tar.gz --directory $DBDIR
UniRef databases are used for annotating genes. UniRef IDs are required for HUMANnN3. You do not need any UniRef databases if just creating Kraken2/Bracken databases.
mmseqs UniRef database(s)
See the mmseqs2 wiki on database downloading
# you must have mmseqs2 installed
# Example of downloading UniRef50 (WARNING: slow!)
mmseqs databases --remove-tmp-files 1 --threads 4 UniRef50 $DBDIR/mmseqs2/UniRef50 data/mmseqs2_TMP
HUMAnN3 UniRef diamond database(s)
See the "Download a translated search database" section of the humann3 docs.
# Example download of UniRef50 DIAMOND database
wget --directory-prefix $DBDIR http://huttenhower.sph.harvard.edu/humann_data/uniprot/uniref_annotated/uniref50_annotated_v201901.tar.gz
tar -pzxvf $DBDIR/uniref50_annotated_v201901.tar.gz --directory $DBDIR
Optional, but recommended
This is needed to map annotations from UniRef90 clusters to UniRef50 clusters. This allows for just annotating against UniRef90 and then mapping those annotations to UniRef50 cluster IDs. You then do not have to annotate against UniRef90 clusters and UniRef50 clusters, which requires a lot more querying of genes against UniRef.
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/install/uniref_2019.01/uniref50-90.pkl
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/database.kraken
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/database.kraken.md5
md5sum --check $DBDIR/database.kraken.md5
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/hash.k2d
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/opts.k2d
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/taxo.k2d
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/kraken2/k2d.md5
md5sum --check $DBDIR/k2d.md5
You only need the read size that matches the lengths of your reads (eg., "100mers" for 100 bp reads)
# 100 bp
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/bracken/database100mers.kmer_distrib
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/bracken/database100mers.kraken
# 150 bp
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/bracken/database150mers.kmer_distrib
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/bracken/database150mers.kraken
# md5sum check
wget --directory-prefix $DBDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/bracken/database_mers.md5
md5sum --check $DBDIR/database_mers.md5
You can choose between UniRef50 and UniRef90. The later with be more sensitive but HUMAnN3 will take substantially longer to complete.
# bowtie2 database
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.1.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.2.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.3.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.4.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.rev.1.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/all_genes_annot.rev.2.bt2l
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/bt2l.md5
md5sum --check $DBDIR/bt2l.md5
# DIAMOND database
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/protein_database/uniref50_201901.dmnd
wget --directory-prefix http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release202/humann3/uniref50/protein_database/uniref50_201901.md5
md5sum --check $DBDIR/uniref50_201901.md5
mkdir -p $DBDIR/protein_database
mv $OUTDIR/uniref50_201901.dmnd $OUTDIR/protein_database/uniref50_201901.dmnd
# assuming paired-end input reads
kraken2 --db $DBDIR --report --output - --paired {input.read1} {input.read2} > sample.kreport
# assuming 150 bp reads
bracken -r 150 -d $DBDIR -i sample.kreport -o bracken_output
humann3 \
--nucleotide-database $DBDIR \
--protein-database $DBDIR \
--input-format fastq \
--output-basename hm3 \
--input {input.reads} \
--output output