Database updating tutorial: adding genes

This is a tutorial for updating >= Struo2-generated custom database.

Please first read the README for instructions on general setup of Struo2.

Setup
Output

Setup

This tutorial will use the following:

A small set of genes to add to the existing custom databases
GTDB taxonomy
- NCBI taxonomy is an alternative
The databases generated from the database generation tutorial
- See the ftp site for custom GTDB databases

Database

This tutorial assumes that you've created your custom databases in the ./data/ directory.

# set the location of the database files
OUTDIR=./data/
mkdir -p $OUTDIR

Genes to add

These are the genes that you will add to the existing databases.

wget --directory-prefix $OUTDIR http://ftp.tue.mpg.de/ebio/projects/struo2/dev_data/genes/UniRef50_n5.tar.gz
tar -pzxvf $OUTDIR/UniRef50_n5.tar.gz --directory $OUTDIR

You need either the amino acid or nucleotide version of the genes. Nucleotide is better, since it can be translated to amino acid via mmseqs translatenucs. If only amino acid sequences are provided, then they will be rev-translated via mmseqs translateaa.

Plass and linclust can be used to create a gene set from your metagenomes, which you could add to the existing custom GTDB databases.

You also need a metadata table for the new gene sequences. The following columns are required:

seq_uuid
- A UUID is recommended, but any unique ID should work
seq_orig_name
- The original name/annotation of the gene
domain
- Taxonomic level
- All values can be blank/empty
phylum
- Taxonomic level
- All values can be blank/empty
class
- Taxonomic level
- All values can be blank/empty
order
- Taxonomic level
- All values can be blank/empty
family
- Taxonomic level
- All values can be blank/empty
genus
- Taxonomic level
species
- Taxonomic level
taxid
- All values can be blank/empty
genome_name
- Genome origin of the gene (e.g., assembly accession), if available
- All values can be blank/empty
genome_length_bp
- Total assembly length (base pairs) of the genome-of-origin
- Useful for normalization
- All values can be blank/empty

Note: the order of the columns in the table does not matter.

Samples table

No samples table is needed, since only genes will be added.

Config

The snakemake pipeline config is a bit different for updating a database versus generating a new database.

Below is an example config:

#-- email notifications of pipeline success/failure (use "Skip" to deactivate) --#
email: None

#-- databases to update --#
# Replace "Create" with "Skip" to skip creation of any of these
# Skipping kraken2/braken, since those are just genome-based, and we are adding genes to the database
databases:
  kraken2: Create  # <-- automatically skipped if no new genomes are provided
  bracken: Create  # <-- automatically skipped if no new genomes are provided
  genes: Create
  humann3_bowtie2: Create
  humann3_diamond: Create

#-- Input --#
#--- If just a set of gene sequences to add ---#
# If you have nucleotide/amino-acid gene sequences formatted for humann
# If translate = True, missing nuc or AA seqs will be (rev)translated from the other, else seqs not used
new_genes:
  amino_acid: data/UniRef50/genome_reps_filtered.faa.gz
  nucleotide: data/UniRef50/genome_reps_filtered.fna.gz
  metadata: data/genome_reps_filtered.txt.gz
  translate: True

#--- If a set of genomes to add ---#
# file listing samples and associated data
samples_file: Skip    # <-- Not needed, since no new genomes

## column names in samples table
samples_col: 'ncbi_organism_name'        # <-- Not used, since no new genomes
accession_col: 'accession'               # <-- Not used, since no new genomes
fasta_file_path_col: 'fasta_file_path'   # <-- Not used, since no new genomes
taxID_col: 'gtdb_taxid'                  # <-- Not used, since no new genomes
taxonomy_col: 'gtdb_taxonomy'            # <-- Not used, since no new genomes

# Saved databases that will be updated
kraken2_db:
  library:  tests/output/GTDBr95_n10/kraken2/library/
  taxonomy: tests/output/GTDBr95_n10/kraken2/taxonomy/
genes_db:
  genes:
    mmseqs_db:  tests/output/GTDBr95_n10/genes/genes_db.tar.gz
    amino_acid: tests/output/GTDBr95_n10/genes/genome_reps_filtered.faa.gz
    nucleotide: tests/output/GTDBr95_n10/genes/genome_reps_filtered.fna.gz
    metadata:   tests/output/GTDBr95_n10/genes/genome_reps_filtered.txt.gz
  cluster:
    mmseqs_db:  tests/output/GTDBr95_n10/genes/cluster/clusters_db.tar.gz    
humann_db:
  query:
    hits: tests/output/GTDBr95_n10/humann3/annotation_hits.gz
  cluster:
    reps: tests/output/GTDBr95_n10/genes/cluster/clusters_reps.faa.gz
    membership: tests/output/GTDBr95_n10/genes/cluster/clusters_membership.tsv.gz

#-- Output --#
# output location
output_dir: tests/output/GTDBr95_n10-n5/

# Name of UniRef clustering (uniref90 or uniref50)
## "uniref90" highly recommended (but takes longer)!
uniref_name: uniref50
# Name of the humann3 diamond database to create
## This must match naming allowed by humann3
dmnd_name: uniref50_201901.dmnd   # UniRef90 is recommended
# Index mapping UniRef90 clusters to UniRef50 (saves time vs re-annotating)
## Skip if annotating with UniRef50
cluster_idx: data/uniref50-90.pkl

# temporary file directory (your username will be added automatically)
tmp_dir: tmp/db_update_tmp/

#-- if custom NCBI/GTDB taxdump files, "Skip" if standard NCBI taxdump --#
# Used for kraken taxonomy & metaphlan
names_dmp: data/taxdump/names.dmp
nodes_dmp: data/taxdump/nodes.dmp

#-- keep intermediate files required for re-creating DBs (eg., w/ more genomes) --#
# If "True", the intermediate files are saved to `output_dir`
# Else, the intermediate files are temporarily stored in `temp_folder`
keep_intermediate: True

#-- software parameters --#
# `vsearch_per_genome` = per-genome gene clustering
# for humann3, use either mmseqs or diamond (mmseqs gets priority if neither skipped)
# for humann3::mmseqs_search::run, --num-iterations must be >=2
params:
  ionice: -c 3
  bracken:
    build_kmer: 35
    build_read_lens:
      - 100
      - 150
  genes:
    prodigal: ""
    vsearch_per_genome: --id 0.97 --strand both --qmask none --fasta_width 0
    mmseqs_cluster_update: --min-seq-id 0.9 -c 0.8 -s 4.0    
  humann3:
    batches: 2
    filter_existing: --min-pident 0  # any existing genes w/ < cutoff with be re-queried
    mmseqs_search:
      db: data/UniRef90/uniref90
      index: -s 6
      run: -e 1e-3 --max-accept 1 --max-seqs 100 --num-iterations 2 --start-sens 1 --sens-steps 3 -s 6
    diamond:
      db: Skip #data/uniref90_ec-filtered/uniref90_ec_filt_201901.dmnd  
      run: --evalue 1e-3 --query-cover 80 --id 90 --max-target-seqs 1 --block-size 4 --index-chunks 2
    propagate_annotations: --min-cov 80 --min-pident 90

#-- snakemake pipeline --#
pipeline:
  snakemake_folder: ./
  script_folder: ./bin/scripts/
  name: Struo2_db-update
  config: update

See here for general notes about the config, regardless of database creation or updating.

Notes specific to database updating:

kraken2_db:, genes_db:, and humann_db: specify the locations for the existing database files
- See the database generation tutorial for how to generate the files
WARNING: use a different output_dir: other than where the existing databases are located; otherwise, the database files may be over-written!

Pipeline run

See the snakemake docs for general instructions.

First, a dry run:

snakemake --use-conda -j -Fqn

Now, an actual run with 4 cores:

snakemake --use-conda -j 2 -F

See the README for running snakemake on a cluster (recommended).

Output

See the README for details on the output.

A good quick sanity check is to compare the size of the updated databases versus the size of the original database files. The updated file sizes should be larger.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly