Skip to content

Commit

Permalink
Merge pull request #4 from ekg/wavefront-inception
Browse files Browse the repository at this point in the history
Wavefront inception
  • Loading branch information
ekg authored Jan 11, 2021
2 parents 5c6ae5d + 5a0768f commit dd8799a
Show file tree
Hide file tree
Showing 218 changed files with 46,794 additions and 459 deletions.
7 changes: 4 additions & 3 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -19,10 +19,11 @@ mashmap
mashmap-align
wfmash

#build directories
src/common/wflign/build
build

#Others
autoParam
Makefile
configure
*.cache
*~
\#*
Expand Down
45 changes: 45 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
cmake_minimum_required(VERSION 3.2 FATAL_ERROR)
project(wflign VERSION 0.0.1)

include(GNUInstallDirs)
include(CheckCXXCompilerFlag)

set(CMAKE_CXX_STANDARD 17)
set(CMAKE_CXX_STANDARD_REQUIRED ON) # Falling back to different standard it not allowed.
set(CMAKE_CXX_EXTENSIONS OFF) # Make sure no compiler-specific features are used.
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)
set(CMAKE_BUILD_TYPE Release)
set(CMAKE_C_FLAGS "${CMAKE_C_FLAGS} -O3 -mcx16 -march=native -g")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -mcx16 -march=native -g")

set(CMAKE_ARCHIVE_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
set(CMAKE_LIBRARY_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/lib)
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY ${CMAKE_BINARY_DIR}/bin)

#add_subdirectory(src/common/wflign/deps/WFA EXCLUDE_FROM_ALL)
#add_subdirectory(src/common/wflign/deps/wflambda EXCLUDE_FROM_ALL)
add_subdirectory(src/common/wflign EXCLUDE_FROM_ALL)

add_executable(wfmash
src/yeet/yeet_main.cpp) #$(CXX) $(CXXFLAGS) $(CPPFLAGS) -Lsrc/common/wflign/build/lib -Isrc/common/wflign/deps -Isrc/common/wflign/deps/wflambda -Isrc/common/wflign/deps/patchmap -Isrc/common $(SOURCE_1) -o wfmash @mathlib@ -lstdc++ -lz -lm -lpthread -ledlib -lwflambda -lwflign

target_include_directories(wfmash PRIVATE
src
src/common/wflign/deps
src/common/wflign/deps/wflambda
src/common/wflign/deps/patchmap
src/common/wflign/deps/WFA
src/common)

target_link_libraries(wfmash
gsl
gslcblas
m
pthread
libwflign_static
edlib
wflambda
wfa
z)

install(TARGETS wfmash DESTINATION bin)
48 changes: 0 additions & 48 deletions INSTALL.txt

This file was deleted.

20 changes: 20 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
The MIT License (MIT)

Copyright (c) 2020 Erik Garrison

Permission is hereby granted, free of charge, to any person obtaining a copy of
this software and associated documentation files (the "Software"), to deal in
the Software without restriction, including without limitation the rights to
use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of
the Software, and to permit persons to whom the Software is furnished to do so,
subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS
FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR
COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER
IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN
CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
47 changes: 0 additions & 47 deletions LICENSE.txt

This file was deleted.

26 changes: 0 additions & 26 deletions Makefile.in

This file was deleted.

52 changes: 20 additions & 32 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,18 @@

_A DNA sequence read mapper based on mash distances and the wavefront alignment algorithm._

`wfmash` is a fork of [MashMap](https://github.com/marbl/MashMap) that implements base-level alignment using the wavefront alignment algorithm [WFA](https://github.com/smarco/WFA).
It completes an alignment module in MashMap and extends it to enable multithreaded operation.
A single command-line interface simplfies usage.
The [PAF](https://github.com/lh3/miniasm/blob/master/PAF.md) output format is harmonized and made equivalent to that in [minimap2](https://github.com/lh3/minimap2), and has been validated as input to [seqwish](https://github.com/ekg/seqwish).
`wfmash` is also a fork of `edyeet`, which uses edlib to obtain an edit-distance based alignment, which is fast but may not be appropriate for many biological applications.
`wfmash` is a fork of [MashMap](https://github.com/marbl/MashMap) that implements base-level alignment using [WFA](https://github.com/Martinsos/WFA), via the [`wflign`](https://github.com/ekg/wflign) tiled wavefront global alignment algorithm.
It completes MashMap with a high-performance alignment module capable of computing base-level alignments for very large sequences.

## process

Each query sequence is broken into pieces defined by `-s[N], --segment-length=[N]`.
Each query sequence is broken into non-overlapping pieces defined by `-s[N], --segment-length=[N]`.
These segments are then mapped using MashMap's sliding minhash mapping algorithm and subsequent filtering steps.
To reduce memory, a temporary file is used to store initial mappings.
Each mapping location is then used as a target for alignment using WFA.

The resulting alignments always contain extended CIGARs in the `cg:Z:*` tag.
Approximate mapping (equivalent to `MashMap`) can be obtained with `-m, --approx-map`.

Mapping merging is disabled by default, as aligning merged approximate mappings with WFA under reasonable identity bounds can generate very long runtimes.
However, merging can be useful in some settings and is enabled with `-M, --merge-mappings`.
Approximate mapping (equivalent to `MashMap2`) can be obtained with `-m, --approx-map`.

Sketching, mapping, and alignment are all run in parallel using a configurable number of threads.
The number of threads must be set manually, using `-t`, and defaults to 1.
Expand All @@ -35,31 +29,15 @@ Seven parameters shape the length, number, identity, and alignment divergence of

The first three affect the structure of the mashmap2 mappings:

* `-s[N], --segment-length=[N]` is the length of the mapped and aligned segment
* `-s[N], --segment-length=[N]` is the length of the mapped and aligned segment (when `-N` is not set)
* `-N, --no-split` avoids splitting queries into segments, and instead maps them in their full length
* `-p[%], --map-pct-id=[%]` is the percentage identity minimum in the _mapping_ step
* `-n[N], --n-secondary=[N]` is the maximum number of mappings (and alignments) to report for each segment

### alignment settings

The last four essential parameters control the WFA alignment process and filter its output.

WF-min and WF-diff prune unlikely solutions from the set in consideration:

* `-l[N], --wf-min=[N]` the number of wavefronts is required to trigger reduction
* `-d[N], --wf-diff=[N]` prune wavefronts whose are more than WF-diff cells (on the diagonal) behind the max wavefront

The exact WFA may be computed if desired, which requires more time and memory but is equivalent to affine Needleman-Wunsch.
(Note that WFA already has adaptive features due to its formulation.)

* `-e, --exact-wfa` compute the exact WFA, don't use adaptive wavefront reduction

An alignment identity filter can be used to remove very low-quality alignments:

* `-a[N], --align-wf-id=[N]` is a minimum identity metric used to filter alignments (defaults to `-p` if unset)
* `-n[N], --n-secondary=[N]` is the maximum number of mappings (and alignments) to report for each segment above `segment-length` (the number of mappings for sequences shorter than the segment length is defined by `-S[N], --n-short-secondary=[N]`, and defaults to 1)

### all-to-all mapping

During all-to-all mapping, `-X` can additionally help us by removing self mappings from the reported set.
Together, these settings allow us to precisely define an alignment space to consider.
During all-to-all mapping, `-X` can additionally help us by removing self mappings from the reported set, and `-Y` extends this capability to prevent mapping between sequences with the same name prefix.

## examples

Expand All @@ -84,13 +62,21 @@ wfmash -X query.fa query.fa >aln.paf
## sequence indexing

`wfmash` provides a progress log that estimates time to completion.

This depends on determining the total query sequence length.
To prevent lags when starting a mapping process, users should apply `samtools index` to index query and target FASTA sequences.
The `.fai` indexes are then used to quickly compute the sum of query lengths.

## installation

Follow [`INSTALL.txt`](INSTALL.txt) to compile and install wfmash.
The build is orchestrated with cmake:

```
cmake -H. -Bbuild && cmake --build build -- -j 16
```

The `wfmash` binary will be in `build/bin`.
To clean up, just remove the build directory.

## <a name=“publications”></a>publications

Expand All @@ -99,3 +85,5 @@ Follow [`INSTALL.txt`](INSTALL.txt) to compile and install wfmash.
- **Chirag Jain, Sergey Koren, Alexander Dilthey, Adam M. Phillippy, and Srinivas Aluru**. ["A Fast Adaptive Algorithm for Computing Whole-Genome Homology Maps"](https://doi.org/10.1093/bioinformatics/bty597). *Bioinformatics (ECCB issue)*, 2018.

- **Chirag Jain, Alexander Dilthey, Sergey Koren, Srinivas Aluru, and Adam M. Phillippy**. ["A fast approximate algorithm for mapping long reads to large reference databases."](https://link.springer.com/chapter/10.1007/978-3-319-56970-3_5) In *International Conference on Research in Computational Molecular Biology*, Springer, Cham, 2017.

- **Martin Šošić and Mile Šikić** ["Edlib: a C/C ++ library for fast, exact sequence alignment using edit distance"](https://doi.org/10.1093/bioinformatics/btw753), *Bioinformatics*, 2017.
1 change: 0 additions & 1 deletion bootstrap.sh

This file was deleted.

57 changes: 0 additions & 57 deletions configure.ac

This file was deleted.

6 changes: 2 additions & 4 deletions scripts/paf2dotplot
Original file line number Diff line number Diff line change
Expand Up @@ -818,7 +818,7 @@ sub WriteGP ($$)
}
else {
$xrange = 0;
print GFILE "set bmargin 5\n";
print GFILE "set bmargin 15\n";
print GFILE "set xtics rotate \( \\\n";
foreach $xlabel ( sort { $rref->{$a}[0] <=> $rref->{$b}[0] } @refk ) {
$xrange += $rref->{$xlabel}[1];
Expand All @@ -827,7 +827,6 @@ sub WriteGP ($$)
print GFILE " \"$dir$xlabel\" $tic.0, \\\n";
}
print GFILE " \"\" $xrange.0 \\\n\)\n";
$xlabel = "REF";
}

#-- set tics, determine labels, ranges (qry)
Expand All @@ -841,7 +840,7 @@ sub WriteGP ($$)
}
else {
$yrange = 0;
print GFILE "set lmargin 5\n";
print GFILE "set lmargin 35\n";
print GFILE "set ytics \( \\\n";
foreach $ylabel ( sort { $qref->{$a}[0] <=> $qref->{$b}[0] } @qryk ) {
$yrange += $qref->{$ylabel}[1];
Expand All @@ -850,7 +849,6 @@ sub WriteGP ($$)
print GFILE " \"$dir$ylabel\" $tic.0, \\\n";
}
print GFILE " \"\" $yrange.0 \\\n\)\n";
$ylabel = "QRY";
}

#-- determine borders
Expand Down
Loading

0 comments on commit dd8799a

Please sign in to comment.