Skip to content

Releases: waveygang/wfmash

wfmash 0.8.2 - Pasticcione

13 Apr 16:34
Compare
Choose a tag to compare

Buildable Source Tarball: wfmash-v0.8.2.tar.gz

This introduces:

  • updates in how wfmash is compiled/built to ensure greater inter-system compatibility;
  • adaptive penalties for the alignment, with more permissive wflambda/wflign parameters.

wfmash 0.8.1 - Divergenza

28 Mar 12:40
Compare
Choose a tag to compare

Buildable Source Tarball: wfmash-v0.8.1.tar.gz

This introduces:

  • fixed a bug in mapping filtering for short sequences;
  • default segment size (-s) at 10 kbps;
  • fixed alignment penalties regardless of the requested mapping identity (-p): this strongly reduces the runtime and lead to much more compressed representations of the alignments between sequences.

pensiero divergente

23 Mar 16:29
@ekg ekg
Compare
Choose a tag to compare

Buildable Source Tarball: wfmash-v0.8.0.tar.gz

wfmash is now substantially better at mapping and alignment at very high sequence divergences. This involves many changes relative to v0.7.0.

mashmap3

The mapping module has been largely rewritten to allow for mappings to span large structural variation. We now apply multiple merging passes in 2D over the query/target mapping matrix (mashmap2 used a 1D approach in the query). The first unites mappings found within 2x the segment length (wfmash -s). Subsequently, multiple rounds of greedy merging and plane-sweep filtering merge the closest mappings on a near diagonal within a given chaining gap (wfmash -c). We finally filter the mappings at 5x segment length (wfmash -l) rather than 3x in previous releases.

The updated mapping merging also allows us to make a sparser first mapping step, as segment mapping drop-outs can be spanned using this approach. This allows us to use relatively sparse minimizer selection, which reduces the number of candidate (usually erroneous) mappings to consider.

We have also applied world minimizers, which are unbiased and faster to compute than window minimizers. To ensure efficient performance, we implement a much stronger filter on repetitive minimizers, filtering out the top 0.5% of most-frequent minimizers, which is now configurable with wfmash -H.

divergence-adaptive wflign

This release also features improvements to the base level alignment that are essential for sensitive alignment at high divergence. We now rest more heavily on the wflign matrix, which leads to a more complete exploration of alignment possibilities. Alignment parameters---such as dynamic programming scoring (for WFA), maximum sketch distance to evaluate a local alignment, and max allowed alignment score---are now set based on a function of the mashmap-based identity.

testing approach

To develop this release, we tested on sequence collections with up to 30% divergence. We ensured that adjustments worked on a series of test cases drawn from humans, yeast, e. coli, potato, and fish, including a scale-up test to an all-vs-all alignment of 45 fish assemblies.

user considerations

In contrast to previous versions, wfmash v0.8.0 is less sensitive to particular segment length settings. The meaning of -p, or the minimum pairwise identity of the mappings, is also somewhat softened, because mappings can now span very large gaps, up to --chain-gap which defaults to 100x the segment length. Very long segment lengths of 50-100kb are probably less necessary, and we're seeing good performance at 5kb to 20kb segment lengths.

The increase in the minimum mapping length filter (from 3x to 5x segment length) reflects increased sensitivity and also potential errors caused by these changes.

An additional concern is that users seeking to map against extremely repetitive sequences may need to set -H lower. Increasing -s can also span gaps caused by repeats and derive alignments for them. Alignments that focus strongly on repetitive regions may still need special parameter tuning. The default settings are now focused on obtaining reasonable homology maps for pangenome and pan-clade alignment problems.

visualization of wflign alignment matrix

Parameter tuning was assisted with visualizations of the wflign (high-order, over 256bp wfmash -W-length segments) alignment matrix. These show regions compared using kmer jaccards in gray, attempted successful alignments in green, and blue for failed alignments.

Two 1Mbp regions of yeast genomes:

image

... and the full alignment matrix (pafplot):

image

Two fish chromosomes at ~25% pairwise divergence in aligned regions (wfmash -p 70 -s 20k).

image

And a few alignments through human lipoprotein A (LPA):

image

image

image

wfmash 0.7.0 - Educazione

09 Sep 16:06
a438d5d
Compare
Choose a tag to compare

Buildable Source Tarball: wfmash-v0.7.0.tar.gz

This release introduces a huge amount of updates:

  • the mapping parameters (window size and kmer size) are adaptive with respect to the requested segment identity;
  • the alignment parameters (mismatch/gap penalties and the max mash distance heuristic) are adaptive with respect to the estimated identity for each mapping region;
  • WFA was updated to the last WFAv2, which includes important memory usage optimizations;
  • wflign / wflambda are upgraded to WFAv2, leading to a strong reduction of the memory usage;
  • alignment accuracy is improved during the patching, thanks to the reduced memory usage of the new WFAv2;
  • robin-hood structures are applied, to improve runtimes;
  • matches and (part of the) mis-matches are cached in wflign, improving the runtime by paying little memory overhead;
  • pure-WFA alignment is performed for short sequences (and short mapping regions in long sequences);
  • ends-free WFA for head/tail patching, replacing edlib;
  • fixed a reduction bug in the WFA library;
  • input PAF from other aligners are supported;

wfmash 0.6.1 - Handy

26 Jul 10:17
3b786e0
Compare
Choose a tag to compare

Buildable Source Tarball: wfmash-v0.6.1.tar.gz

This (little) release includes:

  • handy parameters (#89);
  • a buildable source tarball;
  • a little compiling fix.

sparsify and use low-memory WFA

16 Jul 17:13
@ekg ekg
Compare
Choose a tag to compare

Here, we sparsify the wflign problem, and then patch through the gaps using a low-memory version of WFA (cheers @smarco !)

sensitive mapping and stable wflign-ing

17 May 09:52
@ekg ekg
37b9e71
Compare
Choose a tag to compare

A number of changes in wfmash have completed the alignment patching in wflign, rendering it stable and memory-thrifty enough to safely apply to large genomes. The mapping in general has also been improved by targeting a smaller windowSize parameter, and capping it at 256 to not generate confusion when mapping large segments.

wavefront inception: the alignment patching

22 Apr 22:33
@ekg ekg
1e10586
Compare
Choose a tag to compare

With this version, alignments are patched with WFA (for unaligned regions where the short axis is up to 8kb) and edlib (for very short unaligned stretches). Edlib in semiglobal mode is used to patch up the heads and tails of the alignments. Previous versions have significant dropouts in alignments, but with these changes the issue is largely resolved.

wavefront inception: the trace-merging

15 Jan 14:09
@ekg ekg
8ff9d7d
Compare
Choose a tag to compare

This point release updates wflign to emit a single merged alignment for each mapping. The output is compact and ready for eventual adaptation to SAM output.

wavefront inception

11 Jan 13:55
@ekg ekg
dd8799a
Compare
Choose a tag to compare

wfmash is now sync'ed with edyeet and an update to wflign lets us use WFA to obtain base-level alignment with affine gap costs. This is more biologically plausible than edit-distance based alignment provided by edilb.

Alignment runtime increases by 2-3x, depending on divergence rate given by -p[%], --map-pct-id=[%], with higher thresholds experiencing lower relative slowdown.

wfmash uses both wavefronts and mash distance (locality sensitive hashing) in two contexts. For mapping, it uses MashMap2's algorithm. For base-level alignment, it uses wflign, which is WFλ with λ = WFA guided heuristically with mash distance.