From e24bb2a62926092a8f3afee623dcad22e8d4bf23 Mon Sep 17 00:00:00 2001 From: Mikhail Schelkunov Date: Thu, 10 Aug 2023 14:38:28 +0300 Subject: [PATCH] Update README.md --- README.md | 60 +++++++++++++++++++++++++++---------------------------- 1 file changed, 29 insertions(+), 31 deletions(-) diff --git a/README.md b/README.md index c5a5ac3..df4006d 100644 --- a/README.md +++ b/README.md @@ -1,15 +1,11 @@ **Mabs** is a genome assembly tool which optimizes parameters of genome assemblers Hifiasm and Flye.

-The core idea of Mabs is to optimize parameters of a genome assembler in order to make an assembly where **protein-coding genes** are assembled more accurately than when the assembler is run with its default parameters.

+The core idea of Mabs is to optimize parameters of a genome assembler to make an assembly where **protein-coding genes** are assembled more accurately than when the assembler is run with its default parameters.

Briefly, Mabs works as follows: 1) It makes a series of genome assemblies by Hifiasm or Flye, using different values of parameters of these programs. Mabs uses a couple of tricks to accelerate the assembly process. 2) For each genome assembly, Mabs evaluates the quality of BUSCO genes' assembly using a special metric that I call "AG". For how AG is calculated, see [calculate_AG](#calculate_ag). 3) The genome assembly with the largest AG is considered the best. -For details about the algorithm of Mabs see [a preprint on BioRxiv](https://www.biorxiv.org/content/10.1101/2022.12.19.521016v2). - -

-
If you have tried Mabs, please let me know via Issues (https://github.com/shelkmike/Mabs/issues), even if you experienced no problems. Please, write about how Mabs compares with other assemblers you tried and whether you have any recommendations for future versions of Mabs.
-
+Mabs is, on average, 3 times slower than Hifiasm or Flye, but usually produces better or equal assemblies. For details, see [a preprint on BioRxiv](https://www.biorxiv.org/content/10.1101/2022.12.19.521016v2). ## Table of Contents @@ -24,7 +20,11 @@ For details about the algorithm of Mabs see [a preprint on BioRxiv](https://www.

## Installation Mabs requires Python 3, Perl 5, GCC, Zlib-dev, Make.
-To install Mabs, download the latest version from [Releases](https://github.com/shelkmike/Mabs/releases), then extract the archive and run
+Mabs should be installed in two steps.
+1. Install Python libraries Pandas, Plotnine, SciPy. This can be done, for example, by running the following two commands one after another:
+`pip3 install --upgrade --user pip`
+`pip3 install --upgrade --user Pandas Plotnine scipy`
+2. Download the latest version of Mabs from [Releases](https://github.com/shelkmike/Mabs/releases). Extract the archive and run
`bash install.sh`

## How to use @@ -54,7 +54,7 @@ Example 2:


#### b) Mabs-flye Mabs-flye is intended for Nanopore reads and PacBio CLR reads (also known as "old PacBio reads"). Similarly to Mabs-hifiasm, Mabs-flye requires two values: -1. A path to reads, provided via options "--nanopore_reads", "--pacbio_clr_reads" or "--pacbio_hifi_reads". If you have several read datasets created by different technologies, these options can be used simultaneously. Keep in mind that if you have only HiFi reads, it's better to use Mabs-hifiasm. +1. A path to reads, provided via options "--nanopore_reads", "--pacbio_clr_reads" or "--pacbio_hifi_reads". If you have several read datasets created by different technologies, these options can be used simultaneously. Keep in mind that if you have only HiFi reads, it's usually better to use Mabs-hifiasm. 2. A BUSCO dataset, provided via options "--download_busco_dataset" or "--local_busco_dataset". For details, see "2." in the description of Mabs-hifiasm above. To see the full list of options, run @@ -70,7 +70,7 @@ Example 2:


#### c) The output of Mabs Both Mabs-hifiasm and Mabs-flye have a similar output structure. Both of them create a folder which, by default, is named "Mabs_results". The name can be changed via the "--output_folder" option. The two main files that a user may need are: -1) ./Mabs_results/mabs_logs.txt
+1) ./Mabs_results/mabs_log.txt
This file contains information on how Mabs-hifiasm or Mabs-flye run and whether they encountered any problems. 2) ./Mabs_results/The_best_assembly/assembly.fasta
These are the contigs you need. @@ -84,15 +84,15 @@ or These two commands assemble the first chromosome of Saccharomyces cerevisiae, which is approximately 200 kbp. If after the assembly finishes you see a file ./Mabs_results/The_best_assembly/assembly.fasta which is slightly larger than 200 KB, Mabs works correctly.

## calculate_AG -Besides Mabs-hifiasm and Mabs-flye, Mabs contains a third tool, named calculate_AG. Its purpose is to assess genome assembly quality. +Besides Mabs-hifiasm and Mabs-flye, Mabs contains a third tool, named **calculate_AG**. Its purpose is to assess the genome assembly quality.

calculate_AG is used internally by Mabs-hifiasm and Mabs-flye, but also can be used externally if a user wants to assess the quality of some assembly.

-The main concept in calculate_AG is "AG", which is a metric of gene assembly quality used by Mabs. "AG" is short for "the number of Accurately assembled Genes". It is calculated as a sum of the following two values:
-a) The number of genes in single-copy BUSCO orthogroups.
-b) The number of genes in true multicopy orthogroups. "True multicopy" means that there is more than one gene in these orthogroups not because of assembly errors, but because these genes are actual paralogs. In contrast, the number of genes in false multicopy orthogroups (the orthogroups where genes' duplications are because of assembly errors) is not included in AG.

+The main concept in calculate_AG is **"AG"**, which is the metric of gene assembly quality used by Mabs. "AG" is short for "the number of Accurately assembled Genes". It is calculated as the sum of the following two values:
+a) The number of genes in **single-copy BUSCO orthogroups**.
+b) The number of genes in **true multicopy orthogroups**. "True multicopy" means that there is more than one gene in these orthogroups not because of assembly errors, but because these genes are actual paralogs. In contrast, the number of genes in **false multicopy orthogroups** (the orthogroups where genes' duplications are because of assembly errors) is not included in AG.

AG, in my opinion, may be a better metric of gene assembly quality than BUSCO results, because BUSCO does not differentiate true multicopy orthogroups and false multicopy orthogroups, combining them into a single "D" category.
-A frequent cause of false multicopy orthogroups are haplotypic duplications, when two alleles of a gene are erroneously assembled as paralogs.
-calculate_AG differentiates true multicopy orthogroups from false multicopy orthogroups based on gene coverage, since if a duplication is an assembly error, the gene coverage should be decreased.
+A frequent cause of false multicopy orthogroups are **haplotypic duplications**, when two alleles of a gene are erroneously assembled as paralogs.
+calculate_AG differentiates true multicopy orthogroups from false multicopy orthogroups based on gene coverage, since if a duplication is an assembly error, the gene coverage should be **decreased**.

Basically, the larger AG is, the better the assembly is.

@@ -105,39 +105,37 @@ For more options, run
The main file produced by calculate_AG is ./AG_calculation_results/AG.txt . It contains a single number which is the AG. In addition, calculate_AG produces figures gene_coverage_distribution.svg and gene_coverage_distribution.png which look like this:
![image with a relatively bad assembly](http://mikeshelk.site/Diff/Files_for_GitHub/Mabs/a_relatively_bad_assembly.png)
-This type of diagrams is called sinaplot, see https://cran.r-project.org/web/packages/sinaplot/vignettes/SinaPlot.html . The sinaplot produced by calculate_AG helps to evaluate the assembly quality visually. While the coverage distribution of genes from single-copy orthogroups is unimodal, the coverage distribution of genes from multicopy orthogroups can be bimodal because genes that were erroneously duplicated have twice as low coverage as they should have. In the perfect assembly, the coverage distribution of genes from multicopy orthogroups is identical to the coverage distribution of genes from single-copy orthogroups. The picture above is for a rather bad assembly. Below is the picture made by calculate_AG for a better assembly of the same genome:
+This type of diagrams is called sinaplot, see https://cran.r-project.org/web/packages/sinaplot/vignettes/SinaPlot.html . The sinaplot produced by calculate_AG helps to evaluate the assembly quality visually. While the coverage distribution of genes from single-copy orthogroups is unimodal, the coverage distribution of genes from multicopy orthogroups can be bimodal because genes that were erroneously duplicated have **twice as low** coverage as they should have. In the perfect assembly, the coverage distribution of genes from multicopy orthogroups is identical to the coverage distribution of genes from single-copy orthogroups. The picture above is for a rather bad assembly. Below is the picture made by calculate_AG for a better assembly of the same genome:
![image with a relatively good assembly](http://mikeshelk.site/Diff/Files_for_GitHub/Mabs/a_relatively_good_assembly.png)


-The recommended usage of calculate_AG is to compare the quality of assemblies of a single genome made by different genome assemblers, or made by a single assembler with different parameters. Besides the value of AG (in the file ./AG_calculation_results/AG.txt), calculate_AG also provides the exact numbers of genes in single-copy orthogroups, in true multicopy orthogroups, and in false multicopy orthogroups; the corresponding values can be found at the end of the file ./AG_calculation_results/logs.txt. +The right distribution usually has fewer genes than the left (because BUSCO orthogroups are usually single-copy). However, Calculate_AG draws these two distribution such that they have the same area (but different density of points), to make a visual comparison of their shapes easier.

+The recommended usage of calculate_AG is to compare the quality of assemblies of a single genome made by different genome assemblers, or made by a single assembler with different parameters. Besides the value of AG (in the file ./AG_calculation_results/AG.txt), calculate_AG also writes the exact numbers of genes in single-copy orthogroups, in true multicopy orthogroups, and in false multicopy orthogroups; the corresponding values can be found at the end of the file ./AG_calculation_results/log.txt.

## Questions and Answers -1. Is Mabs useful?
-In my experience, Mabs-hifiasm is currently the best tool for genome assembly using PacBio HiFi reads.
-For Oxford Nanopore reads, I strongly recommend to try [NextDenovo](https://github.com/Nextomics/NextDenovo). In my experience, NextDenovo often makes better assemblies than Flye and Mabs-flye. -2. Should assemblies produced by Mabs be polished afterwards?
+1. Should assemblies produced by Mabs be polished afterwards?
The assemblies made by Mabs-hifiasm are accurate already. The assemblies made by Mabs-flye require polishing by accurate reads. "Accurate reads" are reads of Illumina, MGI, or PacBio HiFi. Good programs for polishing are, for example, [HyPo](https://github.com/kensung-lab/hypo), [POLCA](https://github.com/alekseyzimin/masurca), [Racon](https://github.com/lbcb-sci/racon). -3. How to assemble a genome using high-accuracy Nanopore reads?
+2. How to assemble a genome using high-accuracy Nanopore reads?
If you have Nanopore reads with really high accuracy (≥99%), I advise trying both Mabs-hifiasm and Mabs-flye. -4. What is the program "Modified_hifiasm" used by Mabs?
+3. What is the program "Modified_hifiasm" used by Mabs?
Modified_hifiasm is a special version of Hifiasm, where I added an option "--only-primary". With this option, Modified_hifiasm stops after creating the file with the primary assembly. Usage of Modified_hifiasm makes Mabs-hifiasm faster than when using the original Hifiasm. -5. Is it worth using Mabs if I don't expect a high number of haplotypic duplications?
+4. Is it worth using Mabs if I don't expect a high number of haplotypic duplications?
Though the primary purpose of Mabs is creation of assemblies with few haplotypic duplications, it may be useful even if you don't expect many haplotypic duplications. Since Mabs optimizes parameters of Hifiasm or Flye to maximize the gene assembly quality, in most cases it will produce assemblies better than or equal to Hifiasm or Flye. -6. Can Mabs be used to assemble metagenomes?
+5. Can Mabs be used to assemble metagenomes?
No. When evaluating which genes were assembled correctly and which were assembled incorrectly, Mabs relies on their coverage. In a metagenomic sequencing different genomes have different coverage, which makes Mabs useless. -7. Can Mabs be used to assemble haploid genomes, for example bacterial?
+6. Can Mabs be used to assemble haploid genomes, for example bacterial?
Yes. Though, I don't expect Mabs to be much better than Hifiasm and Flye for haploid genomes since haploid genome assemblies cannot have haplotypic duplications. -8. Can Mabs-hifiasm perform a trio binning assembly like Hifiasm?
+7. Can Mabs-hifiasm perform a trio binning assembly like Hifiasm?
Yes. You'll need to make "pat.yak" and "mat.yak" files as described in the readme of Hifiasm (https://github.com/chhylp123/hifiasm) and then provide them to Mabs via "--additional_hifiasm_parameters [-1 pat.yak -2 mat.yak]". -9. Should additional programs for haplotypic duplications removal (such as Purge_dups) be applied to assemblies made by Mabs?
+8. Should additional programs for haplotypic duplications removal (such as Purge_dups) be applied to assemblies made by Mabs?
In my experience, you can improve assemblies made by Mabs-flye by Purge_dups. However, in my experience, Purge_dups has detrimendal effect on assemblies made by Mabs-hifiasm. Still, you can try and see for yourself. -10. The option "--download_busco_dataset" fails to download a BUSCO dataset. What should I do?
+9. The option "--download_busco_dataset" fails to download a BUSCO dataset. What should I do?
This can happen if http://mikeshelk.site and, consequently, http://mikeshelk.site/Data/BUSCO_datasets/Latest/ is currently not accessible for some reason. To deal with this problem, manually download a file from http://busco-data.ezlab.org/v5/data/lineages/ and provide it to Mabs via the option "--local_busco_dataset". -11. What does "Mabs" mean?
+10. What does "Mabs" mean?
Funny to say, but "Mabs" means "Miniasm-based Assembler which maximizes Busco Score". That's because:
a) Mabs 1 was based on Miniasm instead of Hifiasm and Flye.
Miniasm takes as input a set of read overlaps produced by a program like Minimap2. Provided a file with overlaps, Miniasm performs assembly very quickly. The prominent speed of Miniasm allows exploring the parameter space more thoroughly than when using Hifiasm or Flye, which are 1-2 orders of magnitude slower. However, I later realized that the algorithm of Miniasm is inferior to the algorithms of Hifiasm and Flye, and even a more thorough exploration of a parameter space usually doesn't make Miniasm assemblies better than assemblies of Hifiasm and Flye. Therefore, I created Mabs 2 that uses Hifiasm and Flye. Mabs 1 worked in a 4-dimensional parameter space (optimized 4 different parameters of Miniasm), while Mabs 2 works in a 1-dimensional parameter space.

b) "Busco Score" is because very early versions of Mabs simply maximized BUSCO's "S" (the number of single-copy genes). However, I quickly realized that maximization of S may lead to collapsing of close paralogs, because it transfers them from the "multicopy" category to the "single-copy" category, thus increasing S. To deal with this problem, I started to classify multicopy genes into true multicopy (TM) and false multicopy (FM), and devised AG as a target for maximization, which is a sum of S and TM. -12. How to cite Mabs?
+11. How to cite Mabs?
Cite the preprint [https://www.biorxiv.org/content/10.1101/2022.12.19.521016v2](https://www.biorxiv.org/content/10.1101/2022.12.19.521016v2).
In addition, if you used Mabs-hifiasm you may cite the article about Hifiasm ([https://pubmed.ncbi.nlm.nih.gov/33526886/](https://pubmed.ncbi.nlm.nih.gov/33526886/)) and if you used Mabs-flye you may cite the article about Flye ([https://pubmed.ncbi.nlm.nih.gov/30936562/](https://pubmed.ncbi.nlm.nih.gov/30936562/)) since Mabs is heavily based on these programs.