Skip to content

Latest commit

 

History

History
424 lines (336 loc) · 17.1 KB

README.md

File metadata and controls

424 lines (336 loc) · 17.1 KB

metashot/aweMAGs is an automated workflow for large-scale microbial eukaryotic MAGs (metagenome assembled genomes) analysis. Although aweMAGs has been specifically designed for eukaryotic genomes, it can also work on prokaryotic or viral MAGs.

Reproducibility is guaranteed by Nextflow and versioned Docker images.

Note: This workflow is not intended for classifying and annotating "finished" SAGs (single amplified genomes) or MAGs. The "finished" category is reserved for genomes that can be assembled with extensive manual review and editing.

MetaShot Home

Please cite

Davide Albanese, Claudia Coleine, Laura Selbmann, Claudio Donati. aweMAGs: a fully automated workflow for quality assessment and annotation of eukaryotic genomes from metagenomes. bioRxiv 2023.02.08.527609; doi: https://doi.org/10.1101/2023.02.08.527609

Table of contents

Main features

aweMAGs is a container-enabled Nextflow pipeline for an automated quality assessment, dereplication, gene discovery, taxonomic and functional annotation of eukariotic genomes/MAGs. It can run out-of-the-box on any platform that supports Nextflow, Docker or Singularity, including computing clusters or batch infrastructures in the cloud. Main features:

  • Completeness, contamination estimates and basic assambly statistics using BUSCO1 v5 and BBTools;
  • Dereplication with dRep2;
  • Multiple sequence alignment (MSA) of BUSCO single copy genes (SGCs) with MUSCLE3 v5; phylogenetic tree inference with RAxML4;
  • Fast taxonomic classification using MMseqs2 taxonomy5;
  • Sensitive, reference-based gene discovery with MetaEuk6;
  • Fast functional annotation using EggNOG-mapper7;
  • Internal transcriber spacers extraction using ITSx8;
  • Automatic download of databases;
  • Summary tables for genome quality, taxonomy predictions and functional annotations.

Software included:

Software Version
BUSCO 5.1.3
BBTools 38.79
dRep 2.6.2
MUSCLE 5.1
RAxML 8.2.12
MMSeq2 13
MetaEuk 5
EggNOG-mapper 2.1.5
ITSx 1.1.2

Quick start

Install Docker (or Singulariry) and Nextflow (see Dependencies)

Example 1

Full pipeline, auto lineage mode (eukaryotes)

nextflow run metashot/awemags -r 1.0.0 \
  --genomes '*.fa' \
  --outdir results

Using this command, the reference databases (BUSCO, MMseqs2 and eggNOG) will be downloaded automatically from the Internet.

Example 2

  1. Manually download the MMseqs2 Swiss-Prot database for the taxonomic classification;
  2. Manually download the BUSCO fungal database;
  3. Extract fungal genomes only, skip gene prediction and EggNOG annotation.
# MMseq2 database download (Swiss-Prot)
alias mmseqs='docker run --rm --user $(id -u):$(id -g) -v $PWD:/wd -w /wd metashot/mmseqs2:13-1 mmseqs'
#alias mmseqs='singularity exec -B $PWD docker://metashot/mmseqs2:13-1 mmseqs'
mkdir mmseqs2_db
mmseqs databases UniProtKB/Swiss-Prot mmseqs2_db/swissprot tmp

# BUSCO database download (fungi)
wget https://busco-data.ezlab.org/v5/data/lineages/fungi_odb10.2021-06-28.tar.gz \
  --no-check-certificate -O fungi_odb10.tar.gz
tar -zxf fungi_odb10.tar.gz

# Run aweMAGs
nextflow run metashot/awemags -r 1.0.0 \
  --genomes '*.fa' \
  --outdir results \
  --lineage ./fungi_odb10 \
  --mmseqs_db mmseqs2_db/swissprot \
  --skip_genepred \
  --skip_eggnog

Documentation

Options and default values are decladed in nextflow.config.

Input and output

Options

  • --genomes: input genomes/bins in FASTA format (default "data/*.fa").
  • --outdir: output directory (default "results").

Quality assessment

Quality assessment is performed for each input genome/metagenomic bin using BUSCO v5 and the BBTools's statswrapper program.

Options

  • --busco_db: BUSCO database path for offline mode. (default 'none': download from Internet).
  • --lineage: BUSCO lineage or lineage mode (default "auto-euk"). Accepted values are:
    • "auto-euk", "auto-prok", "auto" (auto lineage mode)
    • a dataset name (e.g. "fungi" or "fungi_odb10") or
    • a path (e.g. "/home/user/fungi_odb10") .

Outputs

  • quality.tsv: summary of genomes quality (including completeness, contamination, N50 ...).
  • busco: main BUSCO's original output files.
  • statswrapper: original BBTools's statswrapper output.

Genomes filtering

This step discards sequences that do not meet the filtering (completeness/contamination) criteria.

Options

  • --skip_filtering: skip genomes filtering.
  • --min_completeness: discard sequences with less than min_completeness% completeness (default 50). Completeness is computed as the fraction of complete BUSCOs genes (single-copy or duplicate) found for a particular lineage.
  • --max_contamination: discard sequences with more than max_contamination% contamination (default 10). Contamination (better "redundancy" in this case) is computed as # duplicated BUSCOs / # complete BUSCOs.

Outputs

  • filtered: this folder contains the genomes filtered according to --min_completeness and --max_contamination options.

Dereplication

In this step, input genomes will be clustered using a defined average nucleotide identity (ANI) threshold (by default 99% ANI, parameter --ani_thr 0.99). For each cluster, the genome with the higher score is selected as representative. The score will be computed using the following formula:

score = completeness - 5 x contamination + 0.5 x log(N50)

If genomes filtering was not performed (--skip_filtering parameter), all the input genomes will be analyzed and the following formula will be used:

score =  log(size)

Options

  • --skip_dereplication: skip the dereplication step.
  • --ani_thr: ANI threshold for dereplication (> 0.90, default 0.99).
  • --min_overlap: minimum overlap fraction between genomes (default 0.3).

Outputs

  • derep_info.tsv: dereplication summary, a TSV file containing three columns:
    • Genome: genome filename
    • Cluster: the cluster ID (from 0 to N-1)
    • Representative: is this genome the cluster representative?
  • representatives: this folder contains the representative genomes
  • drep: original data tables, figures and log of dRep.

Single-copy genes (SCG) MSA

When the BUSCO auto-lineage search is deactivaded (e.g. --lineage fungi and not auto, auto-prok or auto-euk) the single-copy core genes predicted by BUSCO will be aligned using MUSCLE v5.

Options

  • --skip_msa: skip BUSCO SCG MSA

Outputs

  • sgc/faa: this folder contains the single-copy genes (one SCG per FASTA file)
  • sgc/msa: contains the MSA for each SCG

Phylogenetic tree inference (requires MSA)

For each SCG MSA, columns represented in <50% of the genomes or columns with less than 25% or more than 95% amino acid consensus are trimmed in order to remove sites with weak phylogenetic signals9. To reduce total number of columns selected for tree inference, the alignment was further trimmed by randomly selecting floor( MAX_NCOLS / N_SGC ) columns, where MAX_NCOLS is the maximum number of columns for the final MSA (parameter --max_ncols, default 5000) and N_SGC is the total number of BUSCO SGC for the specified lineage. Finally, the trimmed MSAs are concatenated into a single MSA and the phylogenomic tree is inferred using RAxML. Two modes are available for RAxML:

  • default mode: construct a maximum likelihood (ML) tree. This mode runs the default RAxML tree search algorithm10 and perform multiple searches for the best tree (10 distinct randomized MP trees by default, see the parameter --raxml_nsearch). The following RAxML parameters will be used:

    -f d -m PROTGAMMAWAG -N [RAXML_NSEARCH] -p 42
  • rbs mode: assess the robustness of inference and construct a ML tree. This mode runs the rapid bootstrapping full analysis11. The bootstrap convergence criterion or the number of bootstrap searches can be specified with the parameter --raxml_nboot. The following parameters will be used:

    -f a -m PROTGAMMAWAG -N [RAXML_NBOOT] -p 42 -x 43

Options

  • --skip_tree skip phylogenetic tree inference.
  • --max_ncols: maximum number of MSA columns (taken randomly) for tree inference (default 5000).
  • --seed_cols: random seed for the random selection of the columns (default 42).
  • --raxml_mode: RAxML mode , "default": default RAxML tree search algorithm or "rbs": rapid bootstrapping full analysis.
  • --raxml_nsearch: "default" mode only: number of inferences on the original alignment using distinct randomized MP trees (if ­10 is specified, RAxML will compute 10 distinct ML trees starting from 10 distinct randomized maximum parsimony starting trees) (default 10).
  • --raxml_nboot: "rbs" mode only. Bootstrap convergence criterion or number of bootstrap searches (see -I and -#/-N options in RAxML) (default "autoMRE").

Outputs

  • tree/trim_msa: this folder contains the trimmed MSA for each SCG.
  • tree/concat.msa.faa: the concatenated MSA used for the tree inference.
  • tree/RAxML_bestTree.run: the best-scoring ML tree of a thorough ML analysis.
  • tree/RAxML_bipartitions.run: the best-scoring ML tree with the BS support values (from 0 to 100, when --raxml_mode rbs).

Taxonomic classification

This step takes advantage of the MMseqs2 easy-taxonomy workflow to predict the taxonomy of each input genome. Before classification, for each genome contigs are concatenated into a single pseudochromosome using the sequence NNNNNCATTCCATTCATTAATTAATTAATGAATGAATGNNNNN as separator, which provides a stop codon and a start site in all six reading frames12.

This step requires a MMseqs2 database augmented with taxonomic information (see MMseqs2 database).

Options

  • --skip_taxonomy: skip the taxonomic classification

Outputs

  • taxonomy.tsv: a single TSV file containing the classification summary
  • mmseqs: directory containing the original MMseqs2 taxonomy files (including the Kraken style reports)

Gene discovery/prediction

Gene prediction is performed using the MetaEuk easy-predict workflow.

This step requires a MMseqs2 database (see MMseqs2 database).

Options

  • --skip_genepred: skip the gene prediction

Outputs

  • metaeuk: directory containing the original MetaEuk files (including the predicted protein sequences and the GFF files)

The MMseqs2 database

The available databases are listed at https://github.com/soedinglab/mmseqs2/wiki#downloading-databases (default "UniProtKB/Swiss-Prot").

Options

  • --mmseqs_db: MMseqs2 database path (used by MMseqs2 and MetaEuk) (default "none": download from Internet). See the mmseqs_db_name parameter.
  • --mmseqs_db_name: MMseqs2 database name, used when mmseqs_db_path = "none".

EggNOG

Translated CDS are functionally annotated using EggNOG-mapper against the eggNOG Orthologous Groups (OGs) database 13 v5.0. The eggNOG database integrates functional annotations collected from several sources, including Gene Ontology (GO) terms, KEGG functional orthologs and COG categories.

Options

  • --skip_eggnog: skip EggNOG-mapper annotation.
  • --eggnog_db: eggNOG v5.0 database dir. (default "none": download from Internet).
  • --eggnog_db_mem: store the eggNOG sqlite DB into memory (~44GB memory required) increasing the annotation speed.

Outputs

  • eggnog_tables: this folder contains the count matrix for each transferred annotation.
  • eggnog: directory containing the original EggNOG-mapper files.

ITSx

Extract ITS1, 5.8S and ITS2 regions, as well as full-length ITS sequences.

Options

  • --skip_itsx: skip ITSx.

Outputs

  • itsx: ITSx output for each input genome.

Resource limits

  • --max_cpus: maximum number of CPUs for each process (default 8)
  • --max_memory: maximum memory for each process (default 240.GB)
  • --max_time: maximum time for each process (default 240.h)

See also System requirements.

System requirements

Please refer to System requirements for the complete list of system requirements options.

Footnotes

  1. Manni, Mosè, Matthew R. Berkeley, Mathieu Seppey, Felipe A. Simão, and Evgeny M. Zdobnov. 2021. "BUSCO Update: Novel and Streamlined Workflows along with Broader and Deeper Phylogenetic Coverage for Scoring of Eukaryotic, Prokaryotic, and Viral Genomes." Molecular Biology and Evolution 38 (10): 4647–54.

  2. Olm, Matthew R., Christopher T. Brown, Brandon Brooks, and Jillian F. Banfield. 2017. "dRep: A Tool for Fast and Accurate Genomic Comparisons That Enables Improved Genome Recovery from Metagenomes through de-Replication." The ISME Journal 11 (12): 2864–68.

  3. Edgar, Robert C. 2022. "High-Accuracy Alignment Ensembles Enable Unbiased Assessments of Sequence Homology and Phylogeny." bioRxiv. https://doi.org/10.1101/2021.06.20.449169.

  4. Stamatakis, Alexandros. 2014. "RAxML Version 8: A Tool for Phylogenetic Analysis and Post-Analysis of Large Phylogenies." Bioinformatics 30 (9): 1312–13.

  5. Steinegger, Martin, and Johannes Söding. "MMseqs2 Enables Sensitive Protein Sequence Searching for the Analysis of Massive Data Sets." Nature Biotechnology 35 (11): 1026–28.

  6. Karin, Eli Levy, Milot Mirdita, and Johannes Söding. 2020. "MetaEuk—sensitive, High-Throughput Gene Discovery, and Annotation for Large-Scale Eukaryotic Metagenomics." Microbiome. https://doi.org/10.1186/s40168-020-00808-x.

  7. Cantalapiedra, Carlos P., Ana Hernández-Plaza, Ivica Letunic, Peer Bork, and Jaime Huerta-Cepas. 2021. "eggNOG-Mapper v2: Functional Annotation,Orthology Assignments, and Domain Prediction at the Metagenomic Scale." Molecular Biology and Evolution 38 (12): 5825–29.

  8. Bengtsson-Palme, Johan, Martin Ryberg, Martin Hartmann, Sara Branco, Zheng Wang, Anna Godhe, Pierre De Wit, et al. 2013. "Improved Software Detection and Extraction of ITS1 and ITS2 from Ribosomal ITS Sequences of Fungi and Other Eukaryotes for Analysis of Environmental Sequencing Data." Methods in Ecology and Evolution. https://doi.org/10.1111/2041-210x.12073.

  9. Rinke, C., Chuvochina, M., Mussig, A.J. et al. A standardized archaeal taxonomy for the Genome Taxonomy Database. Nat Microbiol 6, 946–959 (2021). https://doi.org/10.1038/s41564-021-00918-8

  10. Stamatakis A., Blagojevic F., Nikolopoulos D.S. et al. Exploring New Search Algorithms and Hardware for Phylogenetics: RAxML Meets the IBM Cell. J VLSI Sign Process Syst Sign Im 48, 271–286 (2007). Link.

  11. Stamatakis A., Hoover P., Rougemont J. A Rapid Bootstrap Algorithm for the RAxML Web Servers. Systematic Biology, Volume 57, Issue 5, October 2008, Pages 758–771, Link.

  12. Tettelin, Hervé, et al. Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proceedings of the National Academy of Sciences 102.39 (2005): 13950-13955. https://www.pnas.org/doi/10.1073/pnas.0506758102

  13. Huerta-Cepas, Jaime, Damian Szklarczyk, Davide Heller, Ana Hernández-Plaza, Sofia K. Forslund, Helen Cook, Daniel R. Mende, et al. "eggNOG 5.0: A Hierarchical, Functionally and Phylogenetically Annotated Orthology Resource Based on 5090 Organisms and 2502 Viruses." Nucleic Acids Research. https://doi.org/10.1093/nar/gky1085.