Skip to content

Commit

Permalink
Update help message
Browse files Browse the repository at this point in the history
  • Loading branch information
arangrhie committed Dec 2, 2022
1 parent 39806a7 commit 6da8424
Show file tree
Hide file tree
Showing 2 changed files with 64 additions and 3 deletions.
16 changes: 13 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,15 @@ Merfin can be used to:
* assess collapsed or duplicated region of the assembly (`-hist` or `-dump`)
* QV* for all scaffolds and the assembly (`-hist`)
* K* completeness (`-completeness`)
* filter variant calls for polishing (`-polish`: reference is from the same individual)
* filter variant calls for polishing
- Reference is from the same individual: `-polish` (uses `k*`)
- Reference is partially from the same individual, or copy-number estimates are unstable. This mode disables `k*`.
- `-better`: Almost identical to `-polish`, with `k*` disabled (deprecated)
- `-loose` : Remove variants only when the num. missing (error) k-mers increase. Neutral alternate paths that score equally to the reference path are *included*.
- `-strict`: Include variants only when the num. missing (error) k-mers decrease. Neutral alternative paths that score equally to the reference path are *excluded*

### Updates
* 2022-12-02 `-better`, `-loose`, `-strict` modes used for polishing the T2T-HG002XY assemblies are added


### Determine kmer copy numbers
Expand Down Expand Up @@ -93,7 +101,7 @@ bcftools index $merfin_output.polish.vcf.gz
bcftools consensus $merfin_output.polish.vcf.gz -f assembly.fasta -H 1 > polished_assembly.fasta # -H 1 applies only first allele from GT at each position
```

Merfin is still under active development. Feel free to reach out to us if you have any question.
The `-better`, `-loose`, `-strict` modes were developed for polishing the T2T-HG002XY chromosome, which the reference for aligning reads was created with T2T-CHM13v1.1 autosomes. Our recommendation is to use `-loose` mode in case the variant call set is highly curated. More details can be found in [this preprint](https://doi.org/10.1101/2022.12.01.518724)


### Helper
Expand Down Expand Up @@ -235,4 +243,6 @@ With special thanks for their support to integrate `--fitted_hist` option in Gen
* Michael Schatz

## Citation
Formenti, G., Rhie, A., Walenz, B.P. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods (2022). https://doi.org/10.1038/s41592-022-01445-y
- Formenti, G., Rhie, A., Walenz, B.P. et al. Merfin: improved variant filtering, assembly evaluation and polishing via k-mer validation. Nat Methods (2022). https://doi.org/10.1038/s41592-022-01445-y

- `-better` `-loose` `-strict` mode: Rhie A, Nurk S, Cechova M, Hoyt S, Taylor DJ et al., The complete sequence of a human Y chromosome. bioRxiv (2022) https://doi.org/10.1101/2022.12.01.518724
51 changes: 51 additions & 0 deletions src/merfin/merfin.C
Original file line number Diff line number Diff line change
Expand Up @@ -233,6 +233,57 @@ main(int32 argc, char **argv) {
fprintf(stderr, " Required: -sequence, -readmers, -peak, -vcf, and -output\n");
fprintf(stderr, " Optional: -comb <N> set the max N of combinations of variants to be evaluated (default: 15)\n");
// fprintf(stderr, " -keep-het keep het calls (default: only hom calls are evaluated)\n");
fprintf(stderr, " -nosplit without this options combinations larger than N are split\n");
fprintf(stderr, " -prob <file> use probabilities to adjust multiplicity to copy number (recommended)\n");
fprintf(stderr, " -debug output a debug log, into <output>.THREAD_ID.debug.gz\n");
fprintf(stderr, "\n");
fprintf(stderr, " Output: <output>.polish.vcf : variants chosen.\n");
fprintf(stderr, " use bcftools view -Oz <output>.polish.vcf and bcftools consensus -H 1 -f <seq.fata> to polish.\n");
fprintf(stderr, " first ALT in heterozygous alleles are usually better supported by avg. |k*|.\n");
fprintf(stderr, "\n\n");
fprintf(stderr, " -loose (least conservative)\n");
fprintf(stderr, " Score each variant, or variants within distance k and their combinations without k*.\n");
fprintf(stderr, " Assumes the reference (-sequence) is partially from the same individual.\n");
fprintf(stderr, " Remove variants only when the num. missing (error) k-mers increase.\n");
fprintf(stderr, " Neutral alternative paths that score equally to the reference path are included.\n");
fprintf(stderr, " If multiple candidate paths tie, path with most ALT calls gets chosen.\n");
fprintf(stderr, "\n");
fprintf(stderr, " Required: -sequence, -readmers, -peak, -vcf, and -output\n");
fprintf(stderr, " Optional: -comb <N> set the max N of combinations of variants to be evaluated (default: 15)\n");
fprintf(stderr, " -nosplit without this options combinations larger than N are split\n");
fprintf(stderr, " -prob <file> use probabilities to adjust multiplicity to copy number (recommended)\n");
fprintf(stderr, " -debug output a debug log, into <output>.THREAD_ID.debug.gz\n");
fprintf(stderr, "\n");
fprintf(stderr, " Output: <output>.polish.vcf : variants chosen.\n");
fprintf(stderr, " use bcftools view -Oz <output>.polish.vcf and bcftools consensus -H 1 -f <seq.fata> to polish.\n");
fprintf(stderr, " first ALT in heterozygous alleles are usually better supported by avg. |k*|.\n");
fprintf(stderr, "\n\n");
fprintf(stderr, " -strict (most conservative)\n");
fprintf(stderr, " Score each variant, or variants within distance k and their combinations without k*.\n");
fprintf(stderr, " Assumes the reference (-sequence) is partially from the same individual.\n");
fprintf(stderr, " Include variants only when the num. missing (error) k-mers decrease.\n");
fprintf(stderr, " Neutral alternative paths that score equally to the reference path are excluded.\n");
fprintf(stderr, " If multiple candidate paths tie, path with least ALT calls gets chosen.\n");
fprintf(stderr, "\n");
fprintf(stderr, " Required: -sequence, -readmers, -peak, -vcf, and -output\n");
fprintf(stderr, " Optional: -comb <N> set the max N of combinations of variants to be evaluated (default: 15)\n");
fprintf(stderr, " -nosplit without this options combinations larger than N are split\n");
fprintf(stderr, " -prob <file> use probabilities to adjust multiplicity to copy number (recommended)\n");
fprintf(stderr, " -debug output a debug log, into <output>.THREAD_ID.debug.gz\n");
fprintf(stderr, "\n");
fprintf(stderr, " Output: <output>.polish.vcf : variants chosen.\n");
fprintf(stderr, " use bcftools view -Oz <output>.polish.vcf and bcftools consensus -H 1 -f <seq.fata> to polish.\n");
fprintf(stderr, " first ALT in heterozygous alleles are usually better supported by avg. |k*|.\n");
fprintf(stderr, "\n\n");
fprintf(stderr, " -better (legacy, nearly identical to -polish without k*)\n");
fprintf(stderr, " Score each variant, or variants within distance k and their combinations without k*.\n");
fprintf(stderr, " Assumes the reference (-sequence) is partially from the same individual.\n");
fprintf(stderr, " Include variants only when the num. missing (error) k-mers decrease.\n");
fprintf(stderr, " Neutral alternative paths that score equally to the reference path are excluded.\n");
fprintf(stderr, " If multiple candidate paths tie, the longest path is chosen.\n");
fprintf(stderr, "\n");
fprintf(stderr, " Required: -sequence, -readmers, -peak, -vcf, and -output\n");
fprintf(stderr, " Optional: -comb <N> set the max N of combinations of variants to be evaluated (default: 15)\n");
fprintf(stderr, " -nosplit without this options combinations larger than N are split\n");
fprintf(stderr, " -prob <file> use probabilities to adjust multiplicity to copy number (recommended)\n");
fprintf(stderr, " -debug output a debug log, into <output>.THREAD_ID.debug.gz\n");
Expand Down

0 comments on commit 6da8424

Please sign in to comment.