Skip to content

Database updating tutorial: adding genomes

Nick Youngblut edited this page May 23, 2022 · 10 revisions

This is a tutorial for updating >= Struo2-generated custom database.

Please first read the README for instructions on general setup of Struo2.

Setup

This tutorial will use the following:

Database

This tutorial assumes that you've created your custom databases in the ./data/ directory.

# Create a directory to hold all of the necessary Struo2 data files
# The directory location can be anywhere on your file system
OUTDIR=./data/
mkdir -p $OUTDIR

Reference genomes

These are the genomes that you will add to the existing databases.

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/dev_data/genomes/GTDBr95_n5.tar.gz
tar -pzxvf $OUTDIR/GTDBr95_n5.tar.gz --directory $OUTDIR

Samples table

This table lists all of the reference genomes. See this example table.

Note: you only need the following columns in the table (all others are ignored):

  • samples_col: 'ncbi_organism_name'
  • accession_col: 'accession'
  • fasta_file_path_col: 'fasta_file_path'
  • taxID_col: 'gtdb_taxid'
  • taxonomy_col: 'gtdb_taxonomy'

Config

The snakemake pipeline config is a bit different for updating a database versus generating a new database.

Below is an example config:

#-- email notifications of pipeline success/failure (use "Skip" to deactivate) --#
email: None

#-- databases to update --#
# Replace "Create" with "Skip" to skip creation of any of these
# Note that braken relies on the kraken2 database
databases:
  kraken2: Create
  bracken: Create
  genes: Skip
  humann3_bowtie2: Skip
  humann3_diamond: Skip

#-- Input --#
#--- If just a set of gene sequences to add ---#
# If you have nucleotide/amino-acid gene sequences formatted for humann
# If translate = True, missing nuc or AA seqs will be (rev)translated from the other, else seqs not used
new_genes:   # <-- skipping for this tutorial, since we are just adding new genomes
  amino_acid: Skip #data/UniRef50/genome_reps_filtered.faa.gz
  nucleotide: Skip #data/UniRef50/genome_reps_filtered.fna.gz
  metadata: Skip #data/genome_reps_filtered.txt.gz
  translate: False

#--- If a set of genomes to add ---#
# file listing samples and associated data
samples_file: data/GTDBr95_n10/GTDBr95_n5.tsv

## column names in samples table
samples_col: 'ncbi_organism_name'
accession_col: 'accession'
fasta_file_path_col: 'fasta_file_path'
taxID_col: 'gtdb_taxid'          # or 'ncbi_species_taxid'
taxonomy_col: 'gtdb_taxonomy'    # or 'ncbi_taxonomy' 

# Saved databases that will be updated
kraken2_db:
  library:  tests/output_GTDBr95_n10/kraken2/library/
  taxonomy: tests/output_GTDBr95_n10/kraken2/taxonomy/
genes_db:
  genes:
    mmseqs_db:  tests/output_GTDBr95_n10/genes/genes_db.tar.gz
    amino_acid: tests/output_GTDBr95_n10/genes/genome_reps_filtered.faa.gz
    nucleotide: tests/output_GTDBr95_n10/genes/genome_reps_filtered.fna.gz
    metadata:   tests/output_GTDBr95_n10/genes/genome_reps_filtered.txt.gz
  cluster:
    mmseqs_db:  tests/output_GTDBr95_n10/genes/cluster/clusters_db.tar.gz    
humann_db:
  query:
    hits: tests/output_GTDBr95_n10/humann3/annotation_hits.gz
  cluster:
    reps: tests/output_GTDBr95_n10/genes/cluster/clusters_reps.faa.gz
    membership: tests/output_GTDBr95_n10/genes/cluster/clusters_membership.tsv.gz

#-- Output --#
# output location
output_dir: tests/output_GTDBr95_n10-n5/

# Name of UniRef clustering (uniref90 or uniref50)
## "uniref90" highly recommended
uniref_name: uniref50
# Name of the humann3 diamond database to create
## This must match naming allowed by humann3
dmnd_name: uniref50_201901.dmnd   # UniRef90 is recommended
# Index mapping UniRef90 clusters to UniRef50 (saves time vs re-annotating)
## Skip if annotating with UniRef50
cluster_idx: data/uniref50-90.pkl

# temporary file directory (your username will be added automatically)
tmp_dir: tmp/db_update_tmp/

#-- if custom NCBI/GTDB taxdump files, "Skip" if standard NCBI taxdump --#
# Used for kraken taxonomy & metaphlan
names_dmp: data/taxdump/names.dmp
nodes_dmp: data/taxdump/nodes.dmp

#-- keep intermediate files required for re-creating DBs (eg., w/ more genomes) --#
# If "True", the intermediate files are saved to `output_dir`
# Else, the intermediate files are temporarily stored in `temp_folder`
keep_intermediate: True

#-- software parameters --#
# `vsearch_per_genome` = per-genome gene clustering
# for humann3, use either mmseqs or diamond (mmseqs gets priority if neither skipped)
# for humann3::mmseqs_search::run, --num-iterations must be >=2
params:
  ionice: -c 3
  bracken:
    build_kmer: 35
    build_read_lens:
      - 100
      - 150
  genes:
    prodigal: ""
    vsearch_per_genome: --id 0.97 --strand both --qmask none --fasta_width 0
    mmseqs_cluster_update: --min-seq-id 0.9 -c 0.8 -s 4.0    
  humann3:
    batches: 2
    filter_existing: --min-pident 0  # any existing genes w/ < cutoff with be re-queried
    mmseqs_search:
      db: data/mmseqs2/uniref50   # UniRef90 is recommended!
      index: -s 6
      run: -e 1e-3 --max-accept 1 --max-seqs 100 --num-iterations 2 --start-sens 1 --sens-steps 3 -s 6
    diamond:
      db: Skip #data/uniref50_201901.dmnd   # UniRef90 is recommended!
      run: --evalue 1e-3 --query-cover 80 --id 90 --max-target-seqs 1 --block-size 4 --index-chunks 2
    propagate_annotations: --min-cov 80 --min-pident 90

#-- snakemake pipeline --#
pipeline:
  snakemake_folder: ./
  script_folder: ./bin/scripts/
  name: Struo2_db-update
  config: update

See here for general notes about the config, regardless of database creation or updating.

Notes specific to database updating:

  • kraken2_db:, genes_db:, and humann_db: specify the locations for the existing database files
  • WARNING: use a different output_dir: other than where the existing databases are located; otherwise, the database files may be over-written!

Pipeline run

See the snakemake docs for general instructions.

First, a dry run:

snakemake --use-conda -j -Fqn

Now, an actual run with 4 cores:

snakemake --use-conda -j 2 -F

See the README for running snakemake on a cluster (recommended).

Output

See the README for details on the output.

A good quick sanity check is to compare the size of the updated databases versus the size of the original database files. The updated file sizes should be larger.

Clone this wiki locally