-
Notifications
You must be signed in to change notification settings - Fork 5
Best practices for Merfin
Merfin is a k-mer based tool for variant call filtering, polishing, and assembly evaluation. Provided here are the best-practices with examples from our paper.
Count k-mers from the sequencing reads using Meryl. A more detailed description is available at Merqury.
Merfin works best with Illumina reads. If only HiFi reads are available, that could be used as an alternate option however less recommended due to its known homopolymer or microsatellite repeat sequencing biases.
After constructing the k-mer db, we recommend excluding k-mers with frequency 1 in high coverage (>30x) sequencing reads for computational acceleration and removal of erroneous k-mers.
# Construct k-mer db
meryl count k=$k reads.fastq.gz output reads.meryl
# Collect histogram for GenomeScope
meryl histogram reads.meryl > reads.hist
# Exclude frequency = 1 k-mers
meryl greater-than 1 reads.meryl output reads.gt1.meryl
A small example from HG002 that was described in our paper can be found here along with a smaller test example.
The variant call sets were submitted to PrecisionFDA challenge II, in which the raw (default) calls were obtained through GATK variant calling pipeline. Variant callings were performed on Illumina reads (ill.vcf) and combined Illumina, HiFi, and ONT reads (multi.vcf). Post-filtered with Merfin are also provided.
For test-running Merfin, download:
- chr20.fasta.gz
- HG002.k21.gt1.meryl.tar.gz
- test.vcf.gz
Once downloaded, decompress the meryl db. For example, using pigz:
pigz -d HG002.k21.gt1.meryl.tar.gz
tar -xf HG002.k21.gt1.meryl.tar
Finally, run Merfin:
# Calls from Illumina
merfin -filter -sequence chr20.fasta.gz \
-readmers HG002.k21.gt1.meryl \
-vcf ill.vcf.gz \
-output test.merfin
This generates chr20.fasta.gz.meryl as seqmers and loads both readmers and seqmers. Alternatively, the seqmers could be generated beforehand and re-used:
meryl count k=21 chr20.fasta.gz output chr20.meryl
merfin -filter -sequence chr20.fasta.gz \
-seqmers chr20.meryl \
-readmers HG002.k21.gt1.meryl \
-vcf ill.vcf.gz \
-output test.merfin
It takes ~1 hour to process the dataset with 24 cpus and ~30 GB of memory. Note parallelization takes place per sequence entry (i.e. chromosome, scaffolds or contigs).
Here, we are taking the intermediate Archocentrus centrarchus genome assembly presented in one of the VGP assemblies we polished as an example.
The data can be downloaded from here. We need:
- fArcCen1.21.meryl.tar.gz
- gs2/lookup_table.txt
- gs2/model.txt (or linear_plot.png. optional, for the
kcov
orkmercov
) - arrow/fArcCen1_s4.fasta
- arrow/fArcCen1_s4.reshaped.vcf.gz
As we are going to use the sequence k-mer database multiple times, let’s count the k-mers first:
meryl count k=21 fArcCen1_s4.fasta output fArcCen1_s4.fasta.meryl
This is optional, and the fArcCen1_s4.fasta.meryl
can be provided as -seqmers
in the following examples instead of -sequence
.
Dump readK (Kr), asmK (KC), and K* per bases (k-mers) in the sequence.
merfin -dump \
-sequence fArcCen1_s4.fasta \
-readmers fArcCen1.21.meryl \
-peak 10.8 \
-prob lookup_table.txt \
-output test.dump
It takes ~45 minutes to process with 24 cpus and a maximum of 110 GB of memory.
-
For advanced users: it is one option to use
-skipMissing
for not displaying erroneous k-mers in the track.
The output test.dump contains 5 columns:
seqName - name of the sequence this k-mer is from
seqPos - start position (0-based) of the k-mer in the sequence
readK - normalized read copies (read multiplicity / peak)
asmK - assembly copies as found in <seq.meryl>
K* - 0-centered K* value
This file can be converted to a .wig
or .bw
format with:
dump_output=test.dump
asm_fa=fArcCen1_s4.fasta
# Convert to .wig
awk 'BEGIN {print "track autoScale=on"} { if ($1!=chr) { chr=$1; print "variableStep chrom="chr" span=1"}; if ( $3!=0 ) { printf $2+1"\t"$5"\n" } }' $dump_output > $dump_output.wig
# Generate chromosome sizes:
awk '$0 ~ ">" {if (NR > 1) { print c; } c=0; printf substr($0,2,100) "\t"; } $0 !~ ">" {c+=length($0);} END { print c; }' $asm_fa > $asm_fa.sizes
# Convert to bigWig:
wigToBigWig $dump_output.wig $asm_fa.sizes $dump_output.bw
The wigToBigWig
is part of the UCSC Kent Tools.
This generates a 0-centered K* histogram for sequence.
merfin -hist \
-sequence fArcCen1_s4.fasta \
-readmers fArcCen1.21.meryl \
-peak 10.8 \
-prob lookup_table.txt \
-output test.hist 2> test.hist.log
It takes ~10 minutes to process with 24 cpus and a maximum of 64 GB of memory.
The output contains two columns:
K* value <tab> frequency
- Positive k* values are expected collapsed copies.
- Negative k* values are expected expanded copies.
- Closer to 0 means the expected and found k-mers are well balanced, 1:1.
QV* is printed at the end of the log; test.hist.log
.
K-mers not found in reads (missing) : 64922262
K-mers overly represented in assembly: 248125467.40
K-mers found in the assembly: 1984603394
Missing QV: 28.01
Merfin QV*: 20.89
Compute k-mer completeness using expected copy numbers.
merfin -completeness \
-seqmers fArcCen1_s4.fasta.meryl \
-readmers fArcCen1.21.meryl \
-prob lookup_table.txt \
-peak 10.8 2> test.completeness
It takes ~10 minutes to process with 24 cpus and a maximum of 69.1 GB of memory.
The completeness is printed at the end of test.completeness
:
TOTAL readK: 2069659896.00
TOTAL undrcpy: 436891737.00000
COMPLETENESS: 0.78891
Each values mean:
TOTAL readK: total kmers in reads
TOTAL undrcpy: number of kmers under the expected copy number
COMPLETENESS: completeness
Score each variant, or variants within distance k and their combinations by K*.
Merfin expects a proper GT field to be present under the first SAMPLE column.
In this example, we use the output from Arrow. By default, Arrow is missing a GT field, so we added it using this script:
bash reshape_arrow.sh arrow.vcf
Through experiments for haplotype switching, we found the heterozygous calls were not so useful in partially phased pseudo-haplotype assemblies and recommended removing them before supplying to Merfin.
This can be done with bcftools:
bcftools view -Oz -i ‘(GT="AA" || GT="Aa")' in.vcf > out.vcf.gz
In addition, spurious variant calls could be removed with low stringency filters, i.e. QUAL>1
, which will improve computation time, especially in freebayes outputs.
With the input .vcf
prepared, we are ready to run Merfin:
merfin -polish \
-sequence fArcCen1_s4.fasta \
-readmers fArcCen1.21.meryl \
-peak 10.8 \
-prob lookup_table.txt \
-vcf fArcCen1_s4.reshaped.vcf.gz \
-output test
It takes ~1.7 hours to process with 24 cpus and a maximum of 68.0 GB of memory used.
Alternatively, we can run Merfin to only include k-mers with frequency > 1, which ensures at least two k-mers (possibly from two reads) support the alternate variant.
Again, we can filter the readmers:
meryl greater-than 1 fArcCen1.21.meryl output fArcCen1.21.gt1.meryl
And use it to run Merfin:
merfin -polish \
-sequence fArcCen1_s4.fasta \
-readmers fArcCen1.21.gt1.meryl \
-peak 10.8 \
-prob lookup_table.txt \
-vcf fArcCen1_s4.reshaped.vcf.gz \
-output test.gt1
It takes ~1.5 hours to process with 24 cpus and a maximum of 24.3 GB of memory.
Finally, we can generate a consensus fasta files with bcftools
:
bcftools view -Oz test.merfin.vcf > test.merfin.vcf.gz
bcftools index test.merfin.vcf.gz
bcftools consensus -H1 --chain -f fArcCen1_s4.fasta test.merfin.vcf.gz > test.merfin.fasta
And restart the evaluation.
Happy Merfin!