Skip to content

Database generation tutorial

Nick Youngblut edited this page May 23, 2022 · 16 revisions

This is a tutorial for creating custom databases.

Please first read the README for instructions on general setup of Struo2.

Setup

This tutorial will use the following:

  • a small set of reference genomes
  • GTDB taxonomy
  • UniRef90

See the README for alternatives.

Struo2 installation

Software dependency installation via conda is described in the README.

Database download

First, set a location where all of the database files will be downloaded:

# Create a directory to hold all of the necessary Struo2 data files
# The directory location can be anywhere on your file system
OUTDIR=./data/
mkdir -p $OUTDIR

taxdump files

These are custom taxdump files for the GTDB taxonomy.

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/GTDB_release95/taxdump/taxdump.tar.gz
tar -pzxvf $OUTDIR/taxdump.tar.gz --directory $OUTDIR

UniRef

These are only needed for creating HUMAnN3 databases.

Annotation databases

You can use any one of the following:

  • mmseqs2 (recommended)
    • UniRef50
    • UniRef90 (recommended)
  • DIAMOND
    • UniRef50
    • UniRef90
    • UniRef90_EC-filtered
      • See the HUMAnN3 for more info about this database.

All databases are from the 2019.01 UniRef release (matches HUMAnN3).

To download the mmseqs2 UniRef90 database:

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/install/uniref_2019.01/UniRef90_mmseqs.tar.gz
tar -pzxvf $OUTDIR/UniRef90_mmseqs.tar.gz --directory $OUTDIR
# the process is similar for uniref50

Note: you may want to use screen or use a smaller database, since the uncompression will take a while

UniRef50-90 Index

This is used to map UniRef90 IDs to UniRef50 IDs. See the README for a description.

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/install/uniref_2019.01/uniref50-90.pkl

Getting reference genomes

These genomes will be used for database construction.

The genome assembly fasta files will be used directly for Kraken2/Bracken.

For the "genes" database, genes will be called for each genome via PROKKA, and then clustered via mmseqs2.

For HUMAnN3, the genes will be annotated by default against UniRef via either mmseqs2 or diamond.

A collection of 10 genomes

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/dev_data/genomes/GTDBr95_n10.tar.gz
tar -pzxvf $OUTDIR/GTDBr95_n10.tar.gz --directory $OUTDIR

Running the pipeline

Samples table

This table lists all of the reference genomes. See this example table.

Note: you only need the following columns in the table (all others are ignored):

  • samples_col: 'ncbi_organism_name'
  • accession_col: 'accession'
  • fasta_file_path_col: 'fasta_file_path'
  • taxID_col: 'gtdb_taxid'
  • taxonomy_col: 'gtdb_taxonomy'

Config file

The config.yaml file sets the Struo2 pipeline parameters.

Below is a config.yaml that will work with our dataset.

#-- email notifications of pipeline success/failure (use "Skip" to deactivate) --#
email: None

#-- I/O --#
# file listing samples and associated data
samples_file: data/GTDBr95_n10/GTDBr95_n10.tsv

## column names in samples table
samples_col: 'ncbi_organism_name'
accession_col: 'accession'
fasta_file_path_col: 'fasta_file_path'
taxID_col: 'gtdb_taxid'          # or 'ncbi_species_taxid'
taxonomy_col: 'gtdb_taxonomy'    # or 'ncbi_taxonomy'

# output location
output_dir: tests/output/GTDBr95_n10/

# temporary file directory (your username will be added automatically)
tmp_dir: /ebio/abt3_scratch/

#-- databases to create --#
# Replace "Create" with "Skip" to skip creation of any of these
# Note that braken relies on the kraken2 database
databases:
  kraken2: Create
  bracken: Create
  genes: Create
  humann3_bowtie2: Create
  humann3_diamond: Create

# Name of UniRef clustering (uniref90 or uniref50)
## "uniref90" highly recommended
uniref_name: uniref90
# Name of the humann3 diamond database to be created
## This must match the naming allowed by humann3 (eg., "uniref90_201901.dmnd")
dmnd_name: uniref90_201901.dmnd
# Index mapping UniRef90 clusters to UniRef50 (saves time vs re-annotating)
## This is skipped if annotating with UniRef50 instead of UniRef90
cluster_idx: data/uniref50-90.pkl

#-- if custom NCBI/GTDB taxdump files, "Skip" if standard NCBI taxdump --#
# Used for kraken taxonomy & metaphlan
names_dmp: data/taxdump/names.dmp
nodes_dmp: data/taxdump/nodes.dmp

#-- keep intermediate files required for re-creating DBs (eg., w/ more genomes) --#
# If "True", the intermediate files are saved to `output_dir`
# Else, the intermediate files are temporarily stored in `temp_folder`
keep_intermediate: True

#-- software parameters --#
# `vsearch_per_genome` = per-genome gene clustering
# use "Skip" at the start of any param to skip (if possible to skip)
# for humann3, use either mmseqs or diamond (mmseqs gets priority if neither skipped)
# for humann3::mmseqs_search::run, --num-iterations must be >=2
params:
  ionice: -c 3
  bracken:
    build_kmer: 35
    build_read_lens:
      - 100
      - 150
  genes:
    prodigal: ""
    vsearch_per_genome: --id 0.97 --strand both --qmask none --fasta_width 0
    mmseqs_cluster: --min-seq-id 0.9 -c 0.8
    mmseqs_cluster_method: linclust     # or "cluster", which is slower
  humann3:
    batches: 2
    mmseqs_search:
      db: data/UniRef90/uniref90
      index: -s 6
      run: -e 1e-3 --max-accept 1 --max-seqs 100 --num-iterations 2 --start-sens 1 --sens-steps 3 -s 6
    diamond:
      db: Skip #data/uniref90_ec-filtered/uniref90_ec_filt_201901.dmnd
      run: --evalue 1e-3 --query-cover 80 --id 50 --max-target-seqs 1 --block-size 4 --index-chunks 2
    propagate_annotations: --min-cov 80 --min-pident 90

#-- snakemake pipeline --#
pipeline:
  snakemake_folder: ./
  script_folder: ./bin/scripts/
  name: Struo2_db-create
  config: create

Some notes about the config

  • Add an email if you want notifications when the pipeline completes/fails
  • The parameters under databases: determine which databases to create (e.g., you can skip HUMAnN3 database construction).
  • batches: determines the number of gene batches for parallel mmseqs2 or diamond jobs for annotating all genes.
    • The gene sequence fasta is shuffled, split, and each split is annotated in a separate job.
  • The "Skip" for db: under mmseqs_search will skip gene annotation with mmseqs2 and instead use diamond.
  • propagate_annotations: propagates UniRef90 to UniRef50, if applicable.
  • You shouldn't need to edit the pipeline: parameters.

Pipeline run

See the snakemake docs for general instructions.

First, a dry run:

snakemake --use-conda -j -Fqn

Now, an actual run with 4 cores:

snakemake --use-conda -j 2 -F

See the README for running snakemake on a cluster (recommended).

Output

See the README for details on the output.