Merge branch 'master' of https://github.com/arangrhie/merfin

arangrhie · Jul 14, 2021 · f048b5b · f048b5b
2 parents 98028ee + 67a75b7
commit f048b5b
Showing 1 changed file with 128 additions and 88 deletions.
diff --git a/README.md b/README.md
@@ -1,6 +1,6 @@
 # Merfin
 
-k-mer-based assembly and variant calling evaluation for improved consensus accuracy.
+Improved variant filtering and polishing via k-mer validation
 
 ## Installation
 
@@ -12,21 +12,39 @@ cd merfin/src
 make -j 12
 ```
 
-`meryl` and `meryl-utility` are installed as submodules.
+During installation, `meryl` and `meryl-utility` are cloned as submodules, and `meryl`, `merfin` will be installed under `merfin/build/bin/`.
+
+* Recommended: [GenomeScope 2.0](https://github.com/tbenavi1/genomescope2.0) for polishing and assembly evaluation
 
 ## Running Merfin
 
-Merfin can be used to assess collapsed or duplicated region of the assembly (`-hist`, `-dump`) or to evaluate variant calls (`-vmer`). QV estimates for all scaffolds will also be generated with `-hist` and `-dump`.
+Merfin can be used to:
+* filter any variant calls for accurate genotyping (`-filter`: reference is from another individual, i.e. GRCh38)
+* assess collapsed or duplicated region of the assembly (`-hist` or `-dump`)
+* QV* for all scaffolds and the assembly (`-hist`)
+* K* completeness (`-completeness`)
+* filter variant calls for polishing (`-polish`: reference is from the same individual)
+
+
+### Determine kmer copy numbers
+
+Except for the `-filter` mode, a haploid/diploid peak estimate must be provided (`-peak`).
+
 
-### Determine kmer copy numbers ###
+This can be obtained from the kmer histogram or computed using tools such as [GenomeScope 2.0](https://github.com/tbenavi1/genomescope2.0) (kcov value).
 
-In all cases a haploid/diploid peak estimate must be provided (`-peak`), either from the kmer histogram, or computed using [Genomescope 2.0](https://github.com/gf777/genomescope2.0) (kcov).
 
-As a rule of thumb, the `-peak` should be:
-- haploid, if the reference used for read mapping and variant calling contains both the primary and the haplotigs, or both haplotypes of a trio
-- diploid (i.e. twice the haploid peak), for haploid representations of diploid genomes
+As a rule of thumb, the `-peak` should be chosen from:
+- haploid peak: partially-phased (assembly has both primary and alternate haplotigs) or fully-phased assemblies (i.e. trio-binning, trio-hifiasm, ...).
+- diploid (i.e. twice the haploid peak): pseudo-haploid assembly (haploid representation of a diploid genome)
 
-Optionally, a custom lookup table of kmer multiplicities with fitted copy numbers and probabilities can be provided (`-lookup`). The lookup table is generated when running our modified version of [Genomescope 2.0](https://github.com/gf777/genomescope2.0). This is recommended, and can significantly improve the accuracy of all analyses:
+
+In addition to the `-peak`, we recommend to provide a lookup table of kmer multiplicities with fitted copy numbers and probabilities (`-prob`).
+
+The lookup table can be generated with `--fitted_hist` option we added to [GenomeScope 2.0](https://github.com/tbenavi1/genomescope2.0). The probability is estimated for 0 to 4-copy kmer multiplicity and is prioritized over the estimates from `-peak`. Providing the fitted probability significantly improves the accuracy of all analyses.
+
+
+Once downloaded and installed, run GenomeScope using the following options:
 
 ```
 Rscript genomescope.R -i <kmer_histogram>  -k <k_size> -o <output_folder> -p <ploidy> --fitted_hist [ploidy] [verbose]
@@ -36,29 +54,32 @@ ploidy          haploid/diploid (default = 2)
 --fitted_hist     generates fitted hist plot and lookup_table.txt
 ```
 
-### Assess collapses/duplications ###
+### Assess collapses and duplications
 
-The output of `-dump` can be further converted to `.Wig/.bw` tracks for visualization on IGV with:
+The output of `-dump` can be further converted to `.wig/.bw` tracks for visualization on IGV or UCSC Genome Browser with:
 
-```
-awk 'BEGIN{print "track autoScale=on"}{if($1!=chr){chr=$1; print "variableStep chrom="chr" span=1"};if($3!=0){printf $2+1"\t"$5"\n"}}' $dump_output > $dump_output.Wig
-#Generate chromosome sizes:
+```shell
+awk 'BEGIN{print "track autoScale=on"}{if($1!=chr){chr=$1; print "variableStep chrom="chr" span=1"};if($3!=0){printf $2+1"\t"$5"\n"}}' $dump_output > $dump_output.wig
+# Generate chromosome sizes:
 awk '$0 ~ ">" {if (NR > 1) {print c;} c=0;printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' genome.fasta > chr.sizes
-#Convert to bigWig:
+# Convert to bigWig:
 wigToBigWig $dump_output.Wig chr.sizes $dump_output.bw
 ```
 
+
 #### Assess correlation between data types
-If you have multiple data types (e.g. HiFi and Illumina) you can also assess the K* agreement between them genome-wide generating a cartesian plot. Use the same output of `-dump` with the scripts under `scripts/cartesian_plot`.
+For multiple data types (e.g. HiFi and Illumina), the genome-wide K* agreement can be assessed by generating a cartesian plot.
+Use the same output of `-dump` with the scripts under `scripts/cartesian_plot`.
+
 
-### Assess per base QV ###
-Merfin will quickly produce [Merqury](https://github.com/marbl/merqury) QV estimates for each scaffold and genome-wide averages when `-hist` is used. Merqury QV estimate consider only kmers missing from the read sets. In addition, Merfin produces a QV* estimate, which accounts also for kmers that are seen in excess with respect to their expected multiplicity predicted from the reads.
+### Assess per base QV
+Merfin quickly produces [Merqury](https://github.com/marbl/merqury) QV estimates for each scaffold and genome-wide averages when `-hist` is used. Merqury QV estimate consider only kmers missing from the read sets. In addition, Merfin produces a QV* estimate, which accounts also for kmers that are seen in excess with respect to their expected multiplicity predicted from the reads.
 
-These analyses can be further refined when the lookup table is provided (`-lookup`, further details under `scripts/lookup`), in which case 0 to 4-copy kmer multiplicity estimates are corrected using [Genomescope 2.0](http://qb.cshl.edu/genomescope/genomescope2.0/) kmer frequency modelling to account for the real kmer distribution. Missing kmers then include plausible low frequency kmers. 0 to 4-copy kmer multiplicity estimates are weighted for the probability that the multiplicity estimate was correct. 
+These analyses can be further refined when the lookup table is provided (`-prob`),  Missing kmers then include plausible low frequency kmers. 0 to 4-copy kmer multiplicity estimates are weighted for the probability that the multiplicity estimate was correct. 
 
-### Filter variant calls for polishing ###
+### Filter variant calls for polishing
 
-The input `.vcf` can be supplied with the `-vmer` option. The ouput will include only the variants that passed the Merfin screening.
+The input `.vcf` can be supplied with the `-polish` option. The includes variants that passed the Merfin screening.
 
 Once the filtered `.vcf` is generated, the assembly can be polished with:
 
@@ -68,63 +89,99 @@ bcftools index $merfin_output.polish.vcf.gz
 bcftools consensus $merfin_output.polish.vcf.gz -f assembly.fasta -H 1 > polished_assembly.fasta # -H 1 applies only first allele from GT at each position
 ```
 
-Two set of similar scripts for further parallelization on HPC (slurm) are available under `scripts/parallel1` and `scripts/parallel2`.
-
 Merfin is still under active development. Feel free to reach out to us if you have any question.
 
-### Helper ###
+### Best practices
+
+A more detailed wiki will be posted soon!
+
+### Helper
 
 ```
 cd ../build/bin/
 ./merfin
 
-usage: ./merfin <report-type>      \
+usage: merfin <report-type>        \
          -sequence <seq.fasta>     \
-         -seqmers  <seq.meryl>     \
          -readmers <read.meryl>    \
          -peak     <haploid_peak>  \
-         -lookup   <lookup_table>  \
+         -prob     <lookup_table>  \
          -vcf      <input.vcf>     \
-         -output   <output>           
+         -output   <output>        
 
   Predict the kmer consequences of variant calls <input.vcf> given the consensus sequence <seq.fasta>
   and lookup the k-mer multiplicity in the consensus sequence <seq.meryl> and in the reads <read.meryl>.
 
   Input -sequence and -vcf files can be FASTA or FASTQ; uncompressed, gz, bz2 or xz compressed
 
-  Each input database can be filtered by value.  More advanced filtering
+  Each readmers can be filtered by value.  More advanced filtering
   requires a new database to be constructed using meryl.
-    -min   m    Ignore kmers with value below m
-    -max   m    Ignore kmers with value above m
-    -threads t  Multithreading for meryl lookup table construction, dump and hist.
+    -min     m     Ignore kmers with value below m
+    -max     m     Ignore kmers with value above m
+    -threads t     Multithreading for meryl lookup table construction, dump and hist.
 
   Memory usage can be limited, within reason, by sacrificing kmer lookup
   speed.  If the lookup table requires more memory than allowed, the program
   exits with an error.
-  -memory1 m   Don't use more than m GB memory for loading seqmers
-  -memory2 m   Don't use more than m GB memory for loading readmers
-    
-  -lookup optional input vector of probabilities.
+    -memory  m     Don't use more than m GB memory for loading mers
+
+  For k* based evaluation and polishing, -peak is required with optional -prob.
+    -peak    m     Required input to hard set copy 1 and infer multiplicity to copy number (recommended).
+    -prob    file  Optional input vector of probabilities. Adjust multiplicity to copy number
+                   in case both -prob and -peak are provided, -prob takes higher priority
+                   than -peak for multiplicity listed in the vector table.
+
+  By default, <seq.fasta>.meryl will be generated unless -seqmers is provided.
+    -seqmers seq.meryl  Optional input for pre-built sequence meryl db
 
   Exactly one report type must be specified.
 
+
+  -filter
+   Filter variants within distance k and their combinations by missing k-mers.
+   Assumes the reference (-sequence) is from a different individual.
+   Required: -sequence, -readmers, -vcf, and -output
+   Optional: -comb <N>  set the max N of combinations of variants to be evaluated (default: 15)
+             -nosplit   without this options combinations larger than N are split
+             -debug     output a debug log, into <output>.THREAD_ID.debug.gz
+
+   Output: <output>.filter.vcf : variants chosen.
+
+
+  -polish
+   Score each variant, or variants within distance k and their combinations by k*.
+   Assumes the reference (-sequence) is from the same individual.
+
+   Required: -sequence, -readmers, -peak, -vcf, and -output
+   Optional: -comb <N>    set the max N of combinations of variants to be evaluated (default: 15)
+             -nosplit     without this options combinations larger than N are split
+             -prob <file> use probabilities to adjust multiplicity to copy number (recommended)
+             -debug       output a debug log, into <output>.THREAD_ID.debug.gz
+
+   Output: <output>.polish.vcf : variants chosen.
+     use bcftools view -Oz <output>.polish.vcf and bcftools consensus -H 1 -f <seq.fata> to polish.
+     first ALT in heterozygous alleles are usually better supported by avg. |k*|.
+
+
   -hist
    Generate a 0-centered k* histogram for sequences in <input.fasta>.
-   Positive k* values are expected collapsed copies.
-   Negative k* values are expected expanded  copies.
-   Closer to 0 means the expected and found k-mers are well balenced, 1:1.
-   Reports QV at the end, in stderr.
-   Required: -sequence, -seqmers, -readmers, -peak, and -output.
-   Optional: -lookup <probabilities> use probabilities to adjust multiplicity to copy number
+     Positive k* values are expected collapsed copies.
+     Negative k* values are expected expanded  copies.
+     Closer to 0 means the expected and found k-mers are well balenced, 1:1.
+
+   Required: -sequence, -readmers, -peak, and -output.
+   Optional: -prob <file>  use probabilities to adjust multiplicity to copy number (recommended)
 
    Output: k* <tab> frequency
+           Reports QV at the end, in stderr.
 
 
   -dump
    Dump readK, asmK, and k* per bases (k-mers) in <input.fasta>.
-   Required: -sequence, -seqmers, -readmers, -peak, and -output
-   Optional: -skipMissing will skip the missing kmer sites to be printed
-             -lookup <probabilities> use probabilities to adjust multiplicity to copy number
+
+   Required: -sequence, -readmers, -peak, and -output
+   Optional: -skipMissing  skip the missing kmer sites to be printed
+             -prob <file>  use probabilities to adjust multiplicity to copy number (recommended)
 
    Output: seqName <tab> seqPos <tab> readK <tab> asmK <tab> k*
       seqName    - name of the sequence this kmer is from
@@ -134,51 +191,30 @@ usage: ./merfin <report-type>      \
       k*         - 0-centered k* value
 
 
-  -vmer
-   Score each variant, or variants within distance k and their combinations by k*.
-   Required: -sequence, -seqmers, -readmers, -peak, -vcf, and -output
-   Optional: -comb <N>  set the max N of combinations of variants to be evaluated (default: 15)
-             -nosplit   without this options combinations larger than N are split
-             -by-kstar  output variants by kstar. *experimental*
-                        if chosen, use bcftools to compress and index, and consensus -H 1 -f <seq.fata> to polish.
-                        first ALT in heterozygous alleles are better supported by avg. |k*|.
-             -lookup <probabilities> use probabilities to adjust multiplicity to copy number
-
-   Output files: <output>.debug and <output>.polish.vcf
-    <output>.debug : some useful info for debugging.
-                     seqName <tab> varMerStart <tab> varMerEnd <tab> varMerSeq <tab> score <tab> path
-      varMerID    - unique numbering, starting from 0
-      varMerRange - seqName:start-end. position (0-based) of the variant (s), including sequences upstream and downstream of k-1 bp
-      varMerSeq   - combination of variant sequence to evalute
-      numMissings - total number of missing kmers
-      min k*      - minimum of all |k*| for non-missing kmers. -1 when all kmers are missing.
-      max k*      - maximum of all |k*| for non-missing kmers. -1 when all kmers are missing.
-      avg k*      - average of all |k*| for non-missing kmers. -1 when all kmers are missing.
-      median k*   - median  of all |k*| for non-missing kmers. -1 when all kmers are missing.
-      record      - vcf record with <tab> replaced to <space>. only non-reference alleles are printed with GT being 1/1.
-
-    <output>.polish.vcf : variants chosen.
-     use bcftools view -Oz <output>.polish.vcf and bcftools consensus -H 2 -f <seq.fata> to polish.
-     ALT alleles are favored with more support compared to the REF allele.
-```
-
-Example run:
+  -completeness
+   Compute kmer completeness using expected copy numbers for all kmers.
+
+
+  Optional output from -debug in -filter and -polish:
+   <output>.THREAD_ID.debug.gz : some useful info for debugging.
+      seqName <tab> varMerStart <tab> varMerEnd <tab> varMerSeq <tab> score <tab> path
+      varMerID                - unique numbering, starting from 0
+      varMerRange             - seqName:start-end. position (0-based) of the variant (s),
+                                including sequences upstream and downstream of k-1 bp
+      varMerSeq               - combination of variant sequence to evalute
+      numMissings             - total number of missing kmers
+      min k*                  - minimum of all |k*| for non-missing kmers. -1 when all kmers are missing.
+      max k*                  - maximum of all |k*| for non-missing kmers. -1 when all kmers are missing.
+      median k*               - median  of all |k*| for non-missing kmers. -1 when all kmers are missing.
+      avg k*                  - average of all |k*| for non-missing kmers. -1 when all kmers are missing.
+      avg ref-alt k*          - difference between reference and alternate average k*.
+      delta kmer multiplicity - cumulative sum of kmer multiplicity variation.
+                                positive values imply recovered kmers, while
+                                negative values imply overrepresented kmers introduced.
+      record                  - vcf record with <tab> replaced to <space>.
+                                only non-reference alleles are printed with GT being 1/1.
 ```
-// Get histogram and QV specific to chrX - QV is correct. Histogram will be biased for collapses
-merfin -hist -memory1 2 -memory2 16 -sequence chrX.fasta -seqmers chrX.meryl -readmers chrX.read.meryl/ -peak 104 -output out.chrX.hist
 
-// Get histogram and QV specific to chrX - Histogram is correct. QV will be biased
-merfin -hist -memory1 2 -memory2 16 -sequence chrX.fasta -seqmers asm.meryl -readmers chrX.read.meryl/ -peak 104 -output out.chrX.hist
-
-// Get histogram and QV for the full assembly - Histogram and QV correct
-merfin -hist -memory1 12 -memory2 120 -sequence asm.fasta -seqmers asm.meryl -readmers read.meryl/ -peak 104 -output out.chrX.hist
-
-// Dump readK, asmK, k* for each position of chrX
-merfin -dump -memory1 2 -memory2 16 -sequence chrX.fasta -seqmers asm.meryl -readmers chrX.read.meryl/ -peak 104 -output out.dump.gz
-
-// Score each variant call and sort
-merfin -vmer -memory1 2 -memory2 16 -sequence chrX.fasta -seqmers asm.meryl -readmers chrX.read.meryl/ -peak 104 -vcf chrX.tiny.vcf -output out.dump.gz
-```
 
 ## Acknowledgements
 This code was developed as part of the T2T consortium chm13-polishing working group by the following individuals:
@@ -187,3 +223,7 @@ This code was developed as part of the T2T consortium chm13-polishing working gr
 * Brian Walenz
 * Sergey Koren
 * Adam Phillippy
+
+With special thanks for their support to integrate `--fitted_hist` option in GenomeScope 2.0:
+* Rhyker Ranallo-Benavidez
+* Michael Schatz