Skip to content

🚦🧬 Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs (NAL mod with Pyrodigal and PyHMMSearch)

License

Notifications You must be signed in to change notification settings

new-atlantis-labs/metacoag-nal

 
 

Repository files navigation

MetaCoAG logo MetaCoAG logo

MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs

Modified by Josh L. Espinoza for Pyrodigal and PyHMMSearch support.

DOI DOI

MetaCoAG is a metagenomic contig binning tool that makes use of the connectivity information found in assembly graphs, apart from the composition and coverage information. MetaCoAG makes use of single-copy marker genes along with a graph matching technique and a label propagation technique to bin contigs. MetaCoAG is tested on contigs obtained from next-generation sequencing (NGS) data. Currently, MetaCoAG supports contigs assembled using metaSPAdes and MEGAHIT, and recently we have added support for Flye assemblies (has not been tested extensively).

For detailed instructions on installation, usage and visualisation, please refer to the documentation hosted at Read the Docs.

For original implementation, please refer to https://github/metagentools/MetaCoAG

Dependencies

MetaCoAG installation requires Python 3.7 or above. You will need the following python dependencies to run MetaCoAG and related support scripts. The latest tested versions of the dependencies are listed as well.

Modifications

MetaCoAG NAL mod allows for precomputed gene predictions, cleans up intermediate files, and uses the following tools to scan for single-copy marker genes. These tools have been tested on the following versions.

Installing NewAtlantis Labs modified MetaCoAG

Install via pip

pip install https://github.com/jolespin/metacoag-nal/releases/download/1.2.3rc2/metacoag-1.2.3rc2.tar.gz

Test the setup

After setting up, run the following command to ensure that metacoag is working.

metacoag --help
Usage: metacoag [OPTIONS]

  MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly
  Graphs

Options:
  --assembler [spades|megahit|megahitc|flye|custom]
                                  name of the assembler used. (Supports
                                  SPAdes, MEGAHIT and Flye)  [required]
  --graph PATH                    path to the assembly graph file  [required]
  --contigs PATH                  path to the contigs file  [required]
  --abundance PATH                path to the abundance file  [required]
  --paths PATH                    path to the contigs.paths (metaSPAdes) or
                                  assembly.info (metaFlye) file
  --output PATH                   path to the output folder  [required]
  --proteins TEXT                 path to proteins fasta in Prodigal format
  --proteins_to_contigs TEXT      Tab-delimited file mapping proteins to
                                  contigs [id_protein]<tab>[id_contig].  If
                                  --proteins and provided without
                                  --proteins_to_contigs then id_protein
                                  formatting is assumed to be
                                  [id_contig]_[gene_number]
  --hmm TEXT                      path to marker.hmm[.gz] file.  [default:
                                  auxiliary/marker.hmm.gz]
  --hmm_marker_field [accession|name]
                                  HMM reference type  [default: accession]
  --score_type [full|domain]      Score reflects full sequence or domain only
                                  [default: full]
  --threshold_method [gathering|noise|trusted|e]
                                  Cutoff threshold method  [default: trusted]
  --evalue FLOAT                  E-value threshold .  [default: 10]
  --min_length INTEGER            minimum length of contigs to consider for
                                  binning.  [default: 1000]
  --p_intra FLOAT RANGE           minimum probability of an edge matching to
                                  assign to the same bin.  [default: 0.1;
                                  0<=x<=1]
  --p_inter FLOAT RANGE           maximum probability of an edge matching to
                                  create a new bin.  [default: 0.01; 0<=x<=1]
  --d_limit INTEGER               distance limit for contig matching.
                                  [default: 20]
  --depth INTEGER                 depth to consider for label propagation.
                                  [default: 10]
  --n_mg INTEGER                  total number of marker genes.  [default:
                                  108]
  --bin_mg_threshold FLOAT RANGE  minimum fraction of marker genes that should
                                  be present in a bin.  [default: 0.33333;
                                  0<=x<=1]
  --min_bin_size INTEGER          minimum size of a bin to output in base
                                  pairs (bp).  [default: 200000]
  --delimiter [,|;|$'\t'|" "]     delimiter for output results. Supports a
                                  comma (,), a semicolon (;), a tab ($'\t'), a
                                  space (" ") and a pipe (|) .  [default: ,]
  --nthreads INTEGER              number of threads to use.  [default: 8]
  -v, --version                   Show the version and exit.
  --help                          Show this message and exit.

Example Usage

metaSPAdes

metacoag --assembler spades --graph /path/to/graph_file.gfa --contigs /path/to/contigs.fasta --paths /path/to/paths_file.paths --abundance /path/to/abundance.tsv --output /path/to/output_folder

MEGAHIT

metacoag --assembler megahit --graph /path/to/graph_file.gfa --contigs /path/to/contigs.fasta --abundance /path/to/abundance.tsv --output /path/to/output_folder

metaFlye

metacoag --assembler flye --graph /path/to/assembly_graph.gfa --contigs /path/to/assembly.fasta --paths /path/to/assembly_info.txt --abundance /path/to/abundance.tsv --output /path/to/output_folder

Adding precomputed gene predictions

metacoag --assembler megahit --graph /path/to/graph_file.gfa --contigs /path/to/contigs.fasta --abundance /path/to/abundance.tsv --output /path/to/output_folder --proteins /path/to/proteins.faa

Adding precomputed gene predictions that do not follow Prodigal formatting

metacoag --assembler megahit --graph /path/to/graph_file.gfa --contigs /path/to/contigs.fasta --abundance /path/to/abundance.tsv --output /path/to/output_folder --proteins /path/to/proteins.faa --proteins_to_contigs /path/to/proteins_to_contigs.tsv

Citation

If you use MetaCoAG in your work, please cite the following publications.

Mallawaarachchi, V., Lin, Y. (2022). MetaCoAG: Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs. In: Pe'er, I. (eds) Research in Computational Molecular Biology. RECOMB 2022. Lecture Notes in Computer Science(), vol 13278. Springer, Cham. DOI: https://doi.org/10.1007/978-3-031-04749-7_5

Vijini Mallawaarachchi and Yu Lin. Accurate Binning of Metagenomic Contigs Using Composition, Coverage, and Assembly Graphs. Journal of Computational Biology 2022 29:12, 1357-1376. DOI: https://doi.org/10.1089/cmb.2022.0262

Funding

Initial implementation of MetaCoAG is funded by an Essential Open Source Software for Science Grant from the Chan Zuckerberg Initiative.

About

🚦🧬 Binning Metagenomic Contigs via Composition, Coverage and Assembly Graphs (NAL mod with Pyrodigal and PyHMMSearch)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 93.9%
  • Euphoria 4.3%
  • Shell 1.8%