diff --git a/.travis.yml b/.travis.yml
new file mode 100644
index 00000000..b355f1d8
--- /dev/null
+++ b/.travis.yml
@@ -0,0 +1,15 @@
+language: c++
+
+os: linux
+
+dist: bionic
+
+compiler: gcc
+
+before_install:
+- sudo apt-get install -y valgrind
+
+script:
+- make
+- export PATH=$PWD/bin:$PATH
+- git clone https://github.com/frederic-mahe/swarm-tests.git && cd swarm-tests && bash ./run_all_tests.sh | tee tests.log && ! grep -q FAIL tests.log
diff --git a/README.md b/README.md
index dc23c5f7..c748594e 100644
--- a/README.md
+++ b/README.md
@@ -1,3 +1,5 @@
+[![Build Status](https://travis-ci.org/torognes/swarm.svg?branch=swarm3)](https://travis-ci.org/torognes/swarm)
+
 # swarm
 
 A robust and fast clustering method for amplicon-based studies.
@@ -16,7 +18,18 @@ To help users, we describe
 starting from raw fastq files, clustering with **swarm** and producing
 a filtered OTU table.
 
-swarm 2.0 introduces several novelties and improvements over swarm
+swarm 3.0 introduces:
+* a much faster default algorithm,
+* a reduced memory footprint,
+* binaries for Windows x86-64, GNU/Linux ARM 64, and GNU/Linux POWER8,
+* an updated, hardened, and thoroughly tested code.
+
+Please note that:
+* strict dereplication of input sequences is now mandatory,
+* \-\-seeds option (\-w) now outputs results sorted by decreasing
+  abundance, and then by alphabetical order of sequence labels.
+
+swarm 2.0 introduced several novelties and improvements over swarm
 1.0:
 * built-in breaking phase now performed automatically,
 * possibility to output OTU representatives in fasta format (option
@@ -24,13 +37,13 @@ swarm 2.0 introduces several novelties and improvements over swarm
 * fast algorithm now used by default for *d* = 1 (linear time
   complexity),
 * a new option called *fastidious* that refines *d* = 1 results and
-  reduces the number of small OTUs,
+  reduces the number of small OTUs.
 
 ## Common misconceptions
 
 **swarm** is a single-linkage clustering method, with some superficial
-  similarities with other clustering methods (e.g.,
-  [Huse et al, 2010](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909393/)). **swarm**'s
+  similarities with other clustering methods (e.g., [Huse et al,
+  2010](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909393/)). **swarm**'s
   novelty is its iterative growth process and the use of sequence
   abundance values to delineate OTUs. **swarm** properly delineates
   large OTUs (high recall), and can distinguish OTUs with as little as
@@ -76,8 +89,8 @@ cgtcgtcgtcgtcgt
 
 where sequence identifiers are unique and end with a value indicating
 the number of occurrences of the sequence (e.g., `_1000`). Alternative
-format is possible with the option `-z`, please see the
-[user manual](https://github.com/torognes/swarm/blob/master/man/swarm_manual.pdf). Swarm
+format is possible with the option `-z`, please see the [user
+manual](https://github.com/torognes/swarm/blob/master/man/swarm_manual.pdf). Swarm
 **requires** each fasta entry to present a number of occurrences to
 work properly. That crucial information can be produced during the
 [dereplication](#dereplication-mandatory) step.
@@ -87,7 +100,7 @@ Use `swarm -h` to get a short help, or see the
   for a complete description of input/output formats and command line
   options.
 
-The memory footprint of **swarm** is roughly 1.6 times the size of the
+The memory footprint of **swarm** is roughly 0.6 times the size of the
 input fasta file. When using the fastidious option, memory footprint
 can increase significantly. See options `-c` and `-y` to control and
 cap swarm's memory consumption.
@@ -210,15 +223,10 @@ from two different sets have the same hash code, it means that the
 sequences they represent are identical.
 
 If for some reason your fasta entries don't have abundance values, and
-you still want to run swarm, you can easily add fake abundance values:
-
-```sh
-sed '/^>/ s/$/_1/' amplicons.fasta > amplicons_with_abundances.fasta
-```
-
-Alternatively, you may specify a default abundance value with
-**swarm**'s `--append-abundance` (`-a`) option to be used when
-abundance information is missing from a sequence.
+you still want to run swarm (not recommended), you can specify a
+default abundance value with **swarm**'s `--append-abundance` (`-a`)
+option to be used when abundance information is missing from a
+sequence.
 
 
 ### Launch swarm ###
@@ -305,15 +313,6 @@ rm "${AMPLICONS}"
 ```
 
 
-## Troubleshooting ##
-
-If **swarm** exits with an error message saying `This program
-requires a processor with SSE2`, your computer is too old to run
-**swarm** (or based on a non x86-64 architecture). **swarm** only runs
-on CPUs with the SSE2 instructions, i.e. most Intel and AMD CPUs
-released since 2004.
-
-
 ## Citation ##
 
 To cite **swarm**, please refer to:
@@ -333,7 +332,7 @@ You are welcome to:
 
 * submit suggestions and bug-reports at: https://github.com/torognes/swarm/issues
 * send a pull request on: https://github.com/torognes/swarm/
-* compose a friendly e-mail to: Frédéric Mahé <mahe@rhrk.uni-kl.de> and Torbjørn Rognes <torognes@ifi.uio.no>
+* compose a friendly e-mail to: Frédéric Mahé <frederic.mahe@cirad.fr> and Torbjørn Rognes <torognes@ifi.uio.no>
 
 
 ## Third-party pipelines ##
@@ -356,7 +355,7 @@ You are welcome to:
 If you want to try alternative free and open-source clustering
 methods, here are some links:
 
-* [VSEARCH](https://github.com/torognes/vsearch)
+* [vsearch](https://github.com/torognes/vsearch)
 * [Oligotyping](http://merenlab.org/projects/oligotyping/)
 * [DNAclust](http://dnaclust.sourceforge.net/)
 * [Sumaclust](http://metabarcoding.org/sumatra)
@@ -365,6 +364,11 @@ methods, here are some links:
 
 ## Version history ##
 
+### version 3.0 ###
+
+**swarm** 3.0 is much faster when _d_ = 1, and consumes less memory.
+Strict dereplication is now mandatory.
+
 ### version 2.2.2 ###
 
 **swarm** 2.2.2 fixes a bug causing Swarm to wait forever in very rare
diff --git a/man/swarm.1 b/man/swarm.1
index a22101bb..28fa5b08 100644
--- a/man/swarm.1
+++ b/man/swarm.1
@@ -1,5 +1,5 @@
 .\" ============================================================================
-.TH swarm 1 "December 12, 2017" "version 2.2.2" "USER COMMANDS"
+.TH swarm 1 "October 24, 2019" "version 3.0.0" "USER COMMANDS"
 .\" ============================================================================
 .SH NAME
 swarm \(em find clusters of nearly-identical nucleotide amplicons
@@ -110,8 +110,9 @@ results obtained during the clustering process allows \fBswarm\fR to
 avoid most of the amplicon comparisons needed in a naïve approach. To
 speed up the remaining amplicon comparisons, \fBswarm\fR implements an
 extremely fast Needleman-Wunsch algorithm making use of the Streaming
-SIMD Extensions (SSE2) of modern x86-64 CPUs. If SSE2 instructions are
-not available, \fBswarm\fR exits with an error message.
+SIMD Extensions (SSE2) of modern x86-64 CPUs, or NEON instructions of
+ARM-64 CPUs. If SSE2 instructions are not available, \fBswarm\fR exits
+with an error message.
 .PP
 \fBswarm\fR can read nucleotide amplicons in fasta format from a
 normal file or from the standard input (using a pipe or a
@@ -138,7 +139,19 @@ defined as a string of [ACGT] or [ACGU] symbols (case insensitive, 'U'
 is replaced with 'T' internally), starting after the end of the header
 line and ending before the next header line or the file end;
 \fBswarm\fR silently removes newline symbols ('\\n' or '\\r') and
-exits with an error message if any other symbol is present.
+exits with an error message if any other symbol is present. Lastly, if
+sequences are not all unique, i.e. were not properly dereplicated,
+swarm will exit with an error message.
+.PP
+Clusters are written to output files (specified with \-i, \-o, \-s and
+\-u) by decreasing abundance of their seed sequences, and then by
+alphabetical order of seed sequence labels. An exception to that is
+the \-w (\-\-seeds) output, which is sorted by decreasing \fIcluster
+abundance\fR (sum of abundances of all sequences in the cluster), and
+then by alphabetical order of seed sequence labels. This is
+particularly useful for post-clustering steps, such as \fIde novo\fR
+chimera detection, that require clusters to be sorted by decreasing
+abundances.
 .\" ----------------------------------------------------------------------------
 .SS General options
 .TP 9
@@ -286,7 +299,7 @@ in situations where writing to \fIstandard error\fR is problematic
 output clustering results to \fIfilename\fR. Results consist of a list
 of OTUs, one OTU per line. An OTU is a list of amplicon headers
 separated by spaces. That output format can be modified by the option
-\-\-mothur (\-r). Default is to write to standard output.
+\-\-mothur (\-r). Default is to write to \fIstandard output\fR.
 .TP
 .B \-r\fP,\fB\ \-\-mothur
 output clustering results in a format compatible with Mothur. That
@@ -305,7 +318,7 @@ total abundance of amplicons in the OTU,
 .IP \n+[step].
 label of the initial seed (header without abundance annotations),
 .IP \n+[step].
-initial seed abundance,
+abundance of the initial seed,
 .IP \n+[step].
 number of amplicons with an abundance of 1 in the OTU,
 .IP \n+[step].
@@ -363,13 +376,15 @@ output OTU representative sequences to \fIfilename\fR in fasta
 format. The abundance value of each OTU representative is the sum of
 the abundances of all the amplicons in the OTU. Fasta headers are
 formated as follows: '>label_\fIinteger\fR',
-or '>label;size=\fIinteger\fR;' if the \-z option is used.
+or '>label;size=\fIinteger\fR;' if the \-z option is used, and
+sequences are uppercased. Sequences are sorted by decreasing
+abundance, and then by alphabetical order of sequence labels.
 .TP
 .B \-z\fP,\fB\ \-\-usearch\-abundance
 accept amplicon abundance values in usearch/vsearch's style
 (>label;size=\fIinteger\fR[;]). That option influences the abundance
-annotation style used in swarm's standard output (\-o), as well as the
-ouput of options \-r, \-u and \-w.
+annotation style used in swarm's \fIstandard output\fR (\-o), as well
+as the output of options \-r, \-u and \-w.
 .LP
 .\" ----------------------------------------------------------------------------
 .SS Pairwise alignment advanced options
@@ -410,7 +425,7 @@ zcat myfile.fasta.gz | \\
         \-t 4 \\
         \-f \\
         \-w myfile.representatives.fasta \\
-        \-o myfile.swarms
+        \-o /dev/null
 .RE
 .EE
 .\" ============================================================================
@@ -475,7 +490,7 @@ License along with this program.  If not, see
 .\" ============================================================================
 .SH SEE ALSO
 \fBswipe\fR, an extremely fast Smith-Waterman database search tool by
-Torbjørn Rognes (available from
+Torbjørn Rognes (available at
 .UR https://github.com/torognes/swipe
 .UE ).
 .PP
@@ -492,8 +507,17 @@ New features and important modifications of \fBswarm\fR (short lived
 or minor bug releases are not mentioned):
 .RS
 .TP
+.BR v3.0.0\~ "released October 24, 2019"
+Version 3.0.0 introduces a faster algorithm for \fId\fR = 1, and a
+reduced memory footprint. Swarm has been ported to Windows x86-64,
+GNU/Linux ARM 64, and GNU/Linux POWER8. Internal code has been
+modernized, hardened, and thoroughly tested. Strict dereplication of
+input sequences is now mandatory. The \-\-seeds option (\-w) now
+outputs results sorted by decreasing abundance, and then by
+alphabetical order of sequence labels.
+.TP
 .BR v2.2.2\~ "released December 12, 2017"
-Version 2.2.2 fixes a bug that would cause Swarm to wait forever in
+Version 2.2.2 fixes a bug that would cause swarm to wait forever in
 very rare cases when multiple threads were used.
 .TP
 .BR v2.2.1\~ "released October 27, 2017"
@@ -527,7 +551,7 @@ bug only applies when \fId\fR > 1.
 .BR v2.1.10\~ "released December 22, 2016"
 Version 2.1.10 fixes two bugs related to gap penalties of alignments.
 The first bug may lead to wrong aligments and similarity percentages
-reported in UCLUST (.uc) files. The second bug makes Swarm use a
+reported in UCLUST (.uc) files. The second bug makes swarm use a
 slightly higher gap extension penalty than specified. The default gap
 extension penalty used have actually been 4.5 instead of 4.
 .TP
@@ -679,10 +703,10 @@ not. Only basic SSE2 instructions are now required to run \fBswarm\fR.
 .TP
 .BR v1.2.4\~ "released January 30, 2014"
 Version 1.2.4 introduces an option \-\-break\-swarms to output all
-pairs of amplicons with \fId\fR differences to standard error. That
-option is used by the companion script `swarm_breaker.py` to refine
-\fBswarm\fR results. The syntax of the inline assembly code is changed
-for compatibility with more compilers.
+pairs of amplicons with \fId\fR differences to \fIstandard
+error\fR. That option is used by the companion script
+`swarm_breaker.py` to refine \fBswarm\fR results. The syntax of the
+inline assembly code is changed for compatibility with more compilers.
 .TP
 .BR v1.2\~ "released May 16, 2013"
 Version 1.2 greatly improves speed by using alignment-free comparisons
diff --git a/man/swarm_manual.pdf b/man/swarm_manual.pdf
index c0b705c6..a460296e 100644
Binary files a/man/swarm_manual.pdf and b/man/swarm_manual.pdf differ
diff --git a/scripts/amplicon_contingency_table.py b/scripts/amplicon_contingency_table.py
index dc40bac1..1753f21b 100644
--- a/scripts/amplicon_contingency_table.py
+++ b/scripts/amplicon_contingency_table.py
@@ -1,15 +1,13 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 """
     Read all fasta files and build a sorted amplicon contingency
-    table. Usage: python amplicon_contingency_table.py samples_*.fas
+    table. Usage: python3 amplicon_contingency_table.py samples_*.fas
 """
 
-from __future__ import print_function
-
-__author__ = "Frédéric Mahé <mahe@rhrk.uni-kl.fr>"
-__date__ = "2016/03/12"
-__version__ = "$Revision: 2.1"
+__author__ = "Frédéric Mahé <frederic.mahe@cirad.fr>"
+__date__ = "2019/09/24"
+__version__ = "$Revision: 3.0"
 
 import os
 import sys
@@ -35,7 +33,7 @@ def fasta_parse():
         sample = os.path.basename(fasta_file)
         sample = os.path.splitext(sample)[0]
         samples[sample] = samples.get(sample, 0) + 1
-        with open(fasta_file, "rU") as fasta_file:
+        with open(fasta_file, "r") as fasta_file:
             for line in fasta_file:
                 if line.startswith(">"):
                     amplicon, abundance = line.strip(">;\n").split(separator)
@@ -65,7 +63,7 @@ def main():
     all_amplicons, amplicons2samples, samples = fasta_parse()
 
     # Sort amplicons by decreasing abundance (and by amplicon name)
-    sorted_all_amplicons = sorted(all_amplicons.iteritems(),
+    sorted_all_amplicons = sorted(iter(all_amplicons.items()),
                                   key=operator.itemgetter(1, 0))
     sorted_all_amplicons.reverse()
 
diff --git a/scripts/graph_plot.py b/scripts/graph_plot.py
index 8fedeea7..3922ad46 100644
--- a/scripts/graph_plot.py
+++ b/scripts/graph_plot.py
@@ -1,29 +1,24 @@
-#!/usr/bin/env python
+#!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 """
     Visualize the internal structure of a swarm (color vertices by
-    abundance). Requires the module igraph and python 2.7+.
-
-    Limitations: amplicons grafted with the fastidious option will be
-    discarded and will not be visualized.
+    abundance). Requires the module igraph and python 3.
 """
 
-from __future__ import print_function
-
-__author__ = "Frédéric Mahé <mahe@rhrk.uni-kl.fr>"
-__date__ = "2016/11/09"
-__version__ = "$Revision: 3.1"
+__author__ = "Frédéric Mahé <frederic.mahe@cirad.fr>"
+__date__ = "2019/09/24"
+__version__ = "$Revision: 4.0"
 
 import sys
 import os.path
 from igraph import Graph, plot
 from optparse import OptionParser
 
-#*****************************************************************************#
+# *************************************************************************** #
 #                                                                             #
 #                                  Functions                                  #
 #                                                                             #
-#*****************************************************************************#
+# *************************************************************************** #
 
 
 def option_parse():
@@ -76,7 +71,7 @@ def parse_files(swarms, internal_structure, OTU, drop):
     """
     # List amplicon ids and abundances
     amplicons = list()
-    with open(swarms, "rU") as swarms:
+    with open(swarms, "r") as swarms:
         for i, swarm in enumerate(swarms):
             if i == OTU - 1:
                 # Deal with ";size=" in a rather clumsy way... but it works
@@ -100,7 +95,7 @@ def parse_files(swarms, internal_structure, OTU, drop):
 
     # List pairwise relations
     relations = list()
-    with open(internal_structure, "rU") as internal_structure:
+    with open(internal_structure, "r") as internal_structure:
         print("Parsing amplicon relationships", file=sys.stdout)
         for line in internal_structure:
             # Get the first four elements of the line
@@ -138,7 +133,7 @@ def build_graph(amplicons, relations):
 
     amplicon_ids = [amplicon[0] for amplicon in amplicons]
     abundances = [int(amplicon[1]) for amplicon in amplicons]
-    minimum, maximum = min(abundances), max(abundances)
+    maximum = max(abundances)
 
     # Determine canvas size
     if len(abundances) < 500:
@@ -214,11 +209,11 @@ def main():
     return
 
 
-#*****************************************************************************#
+# *************************************************************************** #
 #                                                                             #
 #                                     Body                                    #
 #                                                                             #
-#*****************************************************************************#
+# *************************************************************************** #
 
 if __name__ == '__main__':
 
diff --git a/src/Makefile b/src/Makefile
index 4bf7773e..867ddfa8 100644
--- a/src/Makefile
+++ b/src/Makefile
@@ -1,54 +1,86 @@
 # SWARM
 #
-# Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
-# 
+# Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
+#
 # This program is free software: you can redistribute it and/or modify
 # it under the terms of the GNU Affero General Public License as
 # published by the Free Software Foundation, either version 3 of the
 # License, or (at your option) any later version.
-# 
+#
 # This program is distributed in the hope that it will be useful,
 # but WITHOUT ANY WARRANTY; without even the implied warranty of
 # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 # GNU Affero General Public License for more details.
-# 
+#
 # You should have received a copy of the GNU Affero General Public License
 # along with this program.  If not, see <http://www.gnu.org/licenses/>.
-# 
-# Contact: Torbjorn Rognes <torognes@ifi.uio.no>, 
-# Department of Informatics, University of Oslo, 
+#
+# Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
+# Department of Informatics, University of Oslo,
 # PO Box 1080 Blindern, NO-0316 Oslo, Norway
 
 # Makefile for SWARM
 
 # Profiling options
-#COMMON=-pg -g
-COMMON=-g
+# PROFILING=-pg
+PROFILING=
+
+# Machine specific
+MACHINE=$(shell uname -m)
+ifeq ($(MACHINE), x86_64)
+	ARCHOPT = -march=x86-64 -mtune=generic -std=c++11
+	EXTRAOBJ = ssse3.o
+else ifeq ($(MACHINE), aarch64)
+	ARCHOPT = -march=armv8-a+simd -mtune=generic \
+	          -flax-vector-conversions -std=c++11
+	EXTRAOBJ =
+else ifeq ($(MACHINE), ppc64le)
+	ARCHOPT = -mcpu=power8 -std=gnu++11
+	EXTRAOBJ =
+endif
 
-LIBS=-lpthread
-LINKFLAGS=$(COMMON)
+# OS specific
+ifeq ($(CXX), x86_64-w64-mingw32-g++)
+	LIBS = -lpthread -lpsapi
+	WARNOPT =
+	LINKOPT = -static
+else
+	LIBS = -lpthread
+	WARNOPT = -pedantic
+	LINKOPT =
+endif
 
-CXX=g++
-WARNINGS=-Wall -Wsign-compare -Wextra -pedantic -Wno-long-long
-CXXFLAGS=$(COMMON) $(WARNINGS) -O3 -msse2 -mtune=core2 -Icityhash
+WARNINGS = -Wall -Wextra $(WARNOPT) \
+#	-Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic
+
+COMMON=$(PROFILING) -g -flto -O3 $(ARCHOPT)
+
+LINKFLAGS=$(COMMON) $(LINKOPT)
+
+CXXFLAGS=$(COMMON) $(WARNINGS)
 
 PROG=swarm
 
 OBJS=swarm.o db.o search8.o search16.o nw.o matrix.o util.o scan.o \
-	algo.o algod1.o qgram.o ssse3.o derep.o arch.o cityhash/city.o
+	algo.o algod1.o qgram.o derep.o arch.o city.o \
+	zobrist.o bloompat.o bloomflex.o variants.o hashtable.o \
+	$(EXTRAOBJ)
 
-DEPS=Makefile swarm.h bitmap.h bloom.h cityhash/config.h cityhash/city.h \
-	threads.h
+DEPS=Makefile swarm.h city.h citycrc.h \
+	threads.h zobrist.h bloompat.h bloomflex.h variants.h hashtable.h
 
 all : $(PROG)
 
-swarm : $(OBJS)
+swarm : $(OBJS) $(DEPS)
 	$(CXX) $(LINKFLAGS) -o $@ $(OBJS) $(LIBS)
 	mkdir -p ../bin
 	cp -a swarm ../bin
 
 clean :
-	rm -rf swarm *.o *~ ../bin/ gmon.out cityhash/*.o ../man/*~ ../*~
+	rm -rf swarm *.o *~ gmon.out
+
+.o : .cc $(DEPS)
+	$(CXX) $(CXXFLAGS) -c -o $@ $<
 
 ssse3.o : ssse3.cc $(DEPS)
 	$(CXX) $(CXXFLAGS) -mssse3 -c -o $@ $<
diff --git a/src/algo.cc b/src/algo.cc
index 3a8af887..68a421af 100644
--- a/src/algo.cc
+++ b/src/algo.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -23,20 +23,18 @@
 
 #include "swarm.h"
 
-#define BITS 8
+static uint64_t count_comparisons_8;
+static uint64_t count_comparisons_16;
 
-static unsigned long count_comparisons_8;
-static unsigned long count_comparisons_16;
-
-static unsigned long targetcount;
-static unsigned long * targetindices;
-static unsigned long * targetampliconids;
-static unsigned long * scores;
-static unsigned long * diffs;
-static unsigned long * alignlengths;
-static unsigned long * qgramamps;
-static unsigned long * qgramdiffs;
-static unsigned long * qgramindices;
+static uint64_t targetcount;
+static uint64_t * targetindices;
+static uint64_t * targetampliconids;
+static uint64_t * scores;
+static uint64_t * diffs;
+static uint64_t * alignlengths;
+static uint64_t * qgramamps;
+static uint64_t * qgramdiffs;
+static uint64_t * qgramindices;
 
 static struct ampliconinfo_s
 {
@@ -47,95 +45,130 @@ static struct ampliconinfo_s
   unsigned radius; /* actual diff from initial seed */
 } * amps;
 
-static unsigned long swarmed;
-static unsigned long seeded;
+static uint64_t swarmed;
+static uint64_t seeded;
+
+struct swarminfo_t
+{
+  uint64_t mass;
+  unsigned int seed;
+  int dummy; /* alignment padding only */
+};
+
+int compare_mass_seed(const void * a, const void * b);
+
+int compare_mass_seed(const void * a, const void * b)
+{
+  const swarminfo_t * x = static_cast<const struct swarminfo_t *>(a);
+  const swarminfo_t * y = static_cast<const struct swarminfo_t *>(b);
+
+  uint64_t m = x->mass;
+  uint64_t n = y->mass;
+
+  if (m > n)
+    return -1;
+  else if (m < n)
+    return +1;
+  else
+    return strcmp(db_getheader(x->seed), db_getheader(y->seed));
+}
 
 void algo_run()
 {
+  search_begin();
+
   count_comparisons_8 = 0;
   count_comparisons_16 = 0;
 
 #ifdef VERBOSE
-  unsigned long searches = 0;
-  unsigned long estimates = 0;
+  uint64_t searches = 0;
+  uint64_t estimates = 0;
 #endif
 
-  unsigned long largestswarm = 0;
+  uint64_t largestswarm = 0;
 
-  unsigned long maxgenerations = 0;
+  uint64_t maxgenerations = 0;
 
-  unsigned long amplicons = db_getsequencecount();
-  unsigned long longestamplicon = db_getlongestsequence();
+  uint64_t amplicons = db_getsequencecount();
+  uint64_t longestamplicon = db_getlongestsequence();
 
   db_qgrams_init();
 
   qgram_diff_init();
 
-  amps = (struct ampliconinfo_s *) xmalloc(amplicons * sizeof(struct ampliconinfo_s));
-
-  targetampliconids = (unsigned long *) xmalloc(amplicons * 
-                                                sizeof(unsigned long));
-  targetindices = (unsigned long *) xmalloc(amplicons * sizeof(unsigned long));
-  scores = (unsigned long *) xmalloc(amplicons * sizeof(unsigned long));
-  diffs = (unsigned long *) xmalloc(amplicons * sizeof(unsigned long));
-  alignlengths = (unsigned long *) xmalloc(amplicons * sizeof(unsigned long));
-
-  qgramamps = (unsigned long *) xmalloc(amplicons * sizeof(unsigned long));
-  qgramdiffs = (unsigned long *) xmalloc(amplicons * sizeof(unsigned long));
-  qgramindices = (unsigned long *) xmalloc(amplicons * sizeof(unsigned long));
-
-  unsigned long * hits = (unsigned long *) xmalloc(amplicons *
-                                                   sizeof(unsigned long));
-
-  unsigned long diff_saturation = MIN(255 / penalty_mismatch,
-                                      255 / (penalty_gapopen + 
-                                             penalty_gapextend));
-
-  unsigned char * dir = 0;
-  unsigned long * hearray = 0;
+  amps = static_cast<struct ampliconinfo_s *>
+    (xmalloc(amplicons * sizeof(struct ampliconinfo_s)));
+
+  targetampliconids = static_cast<uint64_t *>
+    (xmalloc(amplicons * sizeof(uint64_t)));
+  targetindices = static_cast<uint64_t *>
+    (xmalloc(amplicons * sizeof(uint64_t)));
+  scores = static_cast<uint64_t *>
+    (xmalloc(amplicons * sizeof(uint64_t)));
+  diffs = static_cast<uint64_t *>
+    (xmalloc(amplicons * sizeof(uint64_t)));
+  alignlengths = static_cast<uint64_t *>
+    (xmalloc(amplicons * sizeof(uint64_t)));
+
+  qgramamps = static_cast<uint64_t *>
+    (xmalloc(amplicons * sizeof(uint64_t)));
+  qgramdiffs = static_cast<uint64_t *>
+    (xmalloc(amplicons * sizeof(uint64_t)));
+  qgramindices = static_cast<uint64_t *>
+    (xmalloc(amplicons * sizeof(uint64_t)));
+
+  uint64_t * hits = static_cast<uint64_t *>
+    (xmalloc(amplicons * sizeof(uint64_t)));
+
+  uint64_t diff_saturation
+    = static_cast<uint64_t>(MIN(255 / penalty_mismatch,
+                                255 / (penalty_gapopen +
+                                       penalty_gapextend)));
+
+  unsigned char * dir = nullptr;
+  uint64_t * hearray = nullptr;
 
   if (uclustfile)
     {
-      dir = (unsigned char *) xmalloc(longestamplicon*longestamplicon);
-      hearray = (unsigned long *) xmalloc(2 * longestamplicon *
-                                          sizeof(unsigned long));
+      dir = static_cast<unsigned char *>
+        (xmalloc(longestamplicon*longestamplicon));
+      hearray = static_cast<uint64_t *>
+        (xmalloc(2 * longestamplicon * sizeof(uint64_t)));
     }
 
   /* set ampliconid for all */
-  for(unsigned long i=0; i<amplicons; i++)
-    {
-      amps[i].ampliconid = i;
-    }
+  for(unsigned int i=0; i<amplicons; i++)
+    amps[i].ampliconid = i;
 
   /* always search in 8 bit mode unless resolution is very high */
-  
-  unsigned long bits;
 
-  if ((unsigned long)opt_differences <= diff_saturation)
+  int bits;
+
+  if (static_cast<uint64_t>(opt_differences) <= diff_saturation)
     bits = 8;
   else
     bits = 16;
- 
+
   seeded = 0;
   swarmed = 0;
 
-  unsigned long swarmid = 0;
-  
+  unsigned int swarmid = 0;
+
   progress_init("Clustering:       ", amplicons);
   while (seeded < amplicons)
     {
 
       /* process each initial seed */
-      
+
       swarmid++;
 
-      unsigned long swarmsize = 0;
-      unsigned long amplicons_copies = 0;
-      unsigned long singletons = 0;
-      unsigned long hitcount = 0;
-      unsigned long maxradius = 0;
-      unsigned long maxgen = 0;
-      unsigned long seedindex;
+      uint64_t swarmsize = 0;
+      uint64_t amplicons_copies = 0;
+      uint64_t singletons = 0;
+      uint64_t hitcount = 0;
+      uint64_t maxradius = 0;
+      uint64_t maxgen = 0;
+      uint64_t seedindex;
 
       seedindex = seeded;
       seeded++;
@@ -143,11 +176,11 @@ void algo_run()
       amps[seedindex].swarmid = swarmid;
       amps[seedindex].generation = 0;
       amps[seedindex].radius = 0;
-     
-      unsigned long seedampliconid = amps[seedindex].ampliconid;
+
+      uint64_t seedampliconid = amps[seedindex].ampliconid;
       hits[hitcount++] = seedampliconid;
-      
-      unsigned long abundance = db_getabundance(seedampliconid);
+
+      uint64_t abundance = db_getabundance(seedampliconid);
       amplicons_copies += abundance;
       if (abundance == 1)
         singletons++;
@@ -160,9 +193,9 @@ void algo_run()
 
       targetcount = 0;
 
-      unsigned long listlen = 0;
+      uint64_t listlen = 0;
 
-      for(unsigned long i=0; i < amplicons-swarmed; i++)
+      for(uint64_t i=0; i < amplicons-swarmed; i++)
         {
           unsigned ampid = amps[swarmed+i].ampliconid;
           if ((opt_no_otu_breaking) || (db_getabundance(ampid) <= abundance))
@@ -177,23 +210,23 @@ void algo_run()
 #ifdef VERBOSE
       estimates += listlen;
 #endif
-      
-      for(unsigned long i=0; i < listlen; i++)
+
+      for(uint64_t i=0; i < listlen; i++)
         {
-          unsigned poolampliconid = qgramamps[i];
-          long diff = qgramdiffs[i];
-          amps[swarmed+i].diffestimate = diff;
-          if (diff <= opt_differences)
+          uint64_t poolampliconid = qgramamps[i];
+          uint64_t diff = qgramdiffs[i];
+          amps[swarmed+i].diffestimate = static_cast<unsigned int>(diff);
+          if (diff <= static_cast<uint64_t>(opt_differences))
             {
               targetindices[targetcount] = swarmed+i;
               targetampliconids[targetcount] = poolampliconid;
               targetcount++;
             }
         }
-      
+
       if (targetcount > 0)
         {
-          search_do(seedampliconid, targetcount, targetampliconids, 
+          search_do(seedampliconid, targetcount, targetampliconids,
                     scores, diffs, alignlengths, bits);
 #ifdef VERBOSE
           searches++;
@@ -204,11 +237,15 @@ void algo_run()
           else
             count_comparisons_16 += targetcount;
 
-          for(unsigned long t=0; t<targetcount; t++)
+          for(uint64_t t=0; t<targetcount; t++)
             {
 #if 0
-              printf("seed: %lu target: %lu score: %lu "
-                     "diffs: %lu alignlen: %lu bits: %lu\n",
+              printf("seed: %" PRIu64
+                     " target: %" PRIu64
+                     " score: %" PRIu64
+                     " diffs: %" PRIu64
+                     " alignlen: %" PRIu64
+                     " bits: %" PRIu64 "\n",
                      seedampliconid,
                      targetampliconids[t],
                      scores[t],
@@ -217,19 +254,19 @@ void algo_run()
                      bits);
 #endif
 
-              unsigned diff = diffs[t];
+              uint64_t diff = diffs[t];
 
-              if (diff <= (unsigned long) opt_differences)
+              if (diff <= static_cast<uint64_t>(opt_differences))
                 {
-                  unsigned i = targetindices[t];
+                  uint64_t i = targetindices[t];
 
                   /* move the target (i) to the position (swarmed)
                      of the first unswarmed amplicon in the pool */
-                  
+
                   if (swarmed < i)
                     {
                       struct ampliconinfo_s temp = amps[i];
-                      for(unsigned j=i; j>swarmed; j--)
+                      for(uint64_t j=i; j>swarmed; j--)
                         {
                           amps[j] = amps[j-1];
                         }
@@ -240,7 +277,7 @@ void algo_run()
                   amps[swarmed].generation = 1;
                   if (maxgen < 1)
                     maxgen = 1;
-                  amps[swarmed].radius = diff;
+                  amps[swarmed].radius = static_cast<unsigned int>(diff);
                   if (diff > maxradius)
                     maxradius = diff;
 
@@ -249,11 +286,15 @@ void algo_run()
 
                   if (opt_internal_structure)
                     {
-                      fprint_id_noabundance(internal_structure_file, seedampliconid);
+                      fprint_id_noabundance(internal_structure_file,
+                                            seedampliconid);
                       fprintf(internal_structure_file, "\t");
-                      fprint_id_noabundance(internal_structure_file, poolampliconid);
-                      fprintf(internal_structure_file, "\t%u", diff);
-                      fprintf(internal_structure_file, "\t%lu\t1", swarmid);
+                      fprint_id_noabundance(internal_structure_file,
+                                            poolampliconid);
+                      fprintf(internal_structure_file, "\t%" PRIu64, diff);
+                      fprintf(internal_structure_file,
+                              "\t%u\t1",
+                              swarmid);
                       fprintf(internal_structure_file, "\n");
                     }
 
@@ -266,7 +307,7 @@ void algo_run()
 
                   swarmed++;
                 }
-            }  
+            }
 
 
           while (seeded < swarmed)
@@ -276,11 +317,11 @@ void algo_run()
 
               unsigned subseedampliconid;
               unsigned subseedradius;
-          
-              unsigned long subseedindex;
-              unsigned long subseedgeneration;
-              unsigned long subseedabundance;
-          
+
+              uint64_t subseedindex;
+              uint64_t subseedgeneration;
+              uint64_t subseedabundance;
+
               subseedindex = seeded;
               subseedampliconid = amps[subseedindex].ampliconid;
               subseedradius = amps[subseedindex].radius;
@@ -288,15 +329,16 @@ void algo_run()
               subseedabundance = db_getabundance(subseedampliconid);
 
               seeded++;
-          
+
               targetcount = 0;
-          
-              unsigned long subseedlistlen=0;
-              for(unsigned long i=swarmed; i<amplicons; i++)
+
+              uint64_t subseedlistlen=0;
+              for(uint64_t i=swarmed; i<amplicons; i++)
                 {
-                  unsigned long targetampliconid = amps[i].ampliconid;
-                  if ((amps[i].diffestimate <= subseedradius + opt_differences) &&
-                      ((opt_no_otu_breaking) || 
+                  uint64_t targetampliconid = amps[i].ampliconid;
+                  if ((amps[i].diffestimate <=
+                       subseedradius + opt_differences) &&
+                      ((opt_no_otu_breaking) ||
                        (db_getabundance(targetampliconid)
                         <= subseedabundance)))
                     {
@@ -306,24 +348,24 @@ void algo_run()
                     }
                 }
 
-              qgram_diff_fast(subseedampliconid, subseedlistlen, qgramamps, 
+              qgram_diff_fast(subseedampliconid, subseedlistlen, qgramamps,
                               qgramdiffs);
 
 #ifdef VERBOSE
               estimates += subseedlistlen;
 #endif
 
-              for(unsigned long i=0; i < subseedlistlen; i++)
-                if ((long)qgramdiffs[i] <= opt_differences)
+              for(uint64_t i=0; i < subseedlistlen; i++)
+                if (qgramdiffs[i] <= static_cast<uint64_t>(opt_differences))
                   {
                     targetindices[targetcount] = qgramindices[i];
                     targetampliconids[targetcount] = qgramamps[i];
                     targetcount++;
                   }
-          
+
               if (targetcount > 0)
                 {
-                  search_do(subseedampliconid, targetcount, targetampliconids, 
+                  search_do(subseedampliconid, targetcount, targetampliconids,
                             scores, diffs, alignlengths, bits);
 #ifdef VERBOSE
                   searches++;
@@ -333,15 +375,15 @@ void algo_run()
                     count_comparisons_8 += targetcount;
                   else
                     count_comparisons_16 += targetcount;
-            
-                  for(unsigned long t=0; t<targetcount; t++)
+
+                  for(uint64_t t=0; t<targetcount; t++)
                     {
-                      unsigned diff = diffs[t];
-              
-                      if (diff <= (unsigned long) opt_differences)
+                      uint64_t diff = diffs[t];
+
+                      if (diff <= static_cast<uint64_t>(opt_differences))
                         {
-                          unsigned i = targetindices[t];
-                
+                          uint64_t i = targetindices[t];
+
                           /* find correct position in list */
 
                           /* move the target (i) to the position (swarmed)
@@ -350,8 +392,8 @@ void algo_run()
                              but unseeded part of the list, so that the
                              swarmed amplicons are ordered by id */
 
-                          unsigned long targetampliconid = amps[i].ampliconid;
-                          unsigned pos = swarmed;
+                          uint64_t targetampliconid = amps[i].ampliconid;
+                          uint64_t pos = swarmed;
 
                           while ((pos > seeded) &&
                                  (amps[pos-1].ampliconid > targetampliconid) &&
@@ -361,7 +403,7 @@ void algo_run()
                           if (pos < i)
                             {
                               struct ampliconinfo_s temp = amps[i];
-                              for(unsigned j=i; j>pos; j--)
+                              for(uint64_t j=i; j>pos; j--)
                                 {
                                   amps[j] = amps[j-1];
                                 }
@@ -369,10 +411,12 @@ void algo_run()
                             }
 
                           amps[pos].swarmid = swarmid;
-                          amps[pos].generation = subseedgeneration + 1;
+                          amps[pos].generation
+                            = static_cast<unsigned int>(subseedgeneration + 1);
                           if (maxgen < amps[pos].generation)
                             maxgen = amps[pos].generation;
-                          amps[pos].radius = subseedradius + diff;
+                          amps[pos].radius
+                            = static_cast<unsigned int>(subseedradius + diff);
                           if (amps[pos].radius > maxradius)
                             maxradius = amps[pos].radius;
 
@@ -381,11 +425,15 @@ void algo_run()
 
                           if (opt_internal_structure)
                             {
-                              fprint_id_noabundance(internal_structure_file, subseedampliconid);
+                              fprint_id_noabundance(internal_structure_file,
+                                                    subseedampliconid);
                               fprintf(internal_structure_file, "\t");
-                              fprint_id_noabundance(internal_structure_file, poolampliconid);
-                              fprintf(internal_structure_file, "\t%u", diff);
-                              fprintf(internal_structure_file, "\t%lu\t%lu", swarmid, subseedgeneration + 1);
+                              fprint_id_noabundance(internal_structure_file,
+                                                    poolampliconid);
+                              fprintf(internal_structure_file, "\t%" PRIu64, diff);
+                              fprintf(internal_structure_file,
+                                      "\t%u\t%" PRIu64,
+                                      swarmid, subseedgeneration + 1);
                               fprintf(internal_structure_file, "\n");
                             }
 
@@ -398,11 +446,11 @@ void algo_run()
 
                           swarmed++;
                         }
-                    }  
+                    }
                 }
             }
         }
-      
+
       if (swarmsize > largestswarm)
         largestswarm = swarmsize;
 
@@ -412,43 +460,44 @@ void algo_run()
 
       if (uclustfile)
         {
-          fprintf(uclustfile, "C\t%lu\t%lu\t*\t*\t*\t*\t*\t",
+          fprintf(uclustfile, "C\t%u\t%" PRIu64 "\t*\t*\t*\t*\t*\t",
                   swarmid-1, swarmsize);
           fprint_id(uclustfile, seedampliconid);
           fprintf(uclustfile, "\t*\n");
-          
-          fprintf(uclustfile, "S\t%lu\t%lu\t*\t*\t*\t*\t*\t",
+
+          fprintf(uclustfile, "S\t%u\t%u\t*\t*\t*\t*\t*\t",
                   swarmid-1, db_getsequencelen(seedampliconid));
           fprint_id(uclustfile, seedampliconid);
           fprintf(uclustfile, "\t*\n");
           fflush(uclustfile);
 
-          for(unsigned long i=1; i<hitcount; i++)
+          for(uint64_t i=1; i<hitcount; i++)
             {
-              unsigned long hit = hits[i];
-              
+              uint64_t hit = hits[i];
+
               char * dseq = db_getsequence(hit);
-              char * dend = dseq + db_getsequencelen(hit);
+              int64_t dlen = db_getsequencelen(hit);
               char * qseq = db_getsequence(seedampliconid);
-              char * qend = qseq + db_getsequencelen(seedampliconid);
+              int64_t qlen = db_getsequencelen(seedampliconid);
 
-              unsigned long nwscore = 0;
-              unsigned long nwdiff = 0;
-              char * nwalignment = NULL;
-              unsigned long nwalignmentlength = 0;
+              int64_t nwscore = 0;
+              int64_t nwdiff = 0;
+              char * nwalignment = nullptr;
+              int64_t nwalignmentlength = 0;
 
-              nw(dseq, dend, qseq, qend, 
+              nw(dseq, dlen, qseq, qlen,
                  score_matrix_63, penalty_gapopen, penalty_gapextend,
                  & nwscore, & nwdiff, & nwalignmentlength, & nwalignment,
-                 dir, hearray, 0, 0);
-              
-              double percentid = 100.0 * (nwalignmentlength - 
-                                          nwdiff) / nwalignmentlength;
-              
-              fprintf(uclustfile, "H\t%lu\t%lu\t%.1f\t+\t0\t0\t%s\t",
-                      swarmid-1, db_getsequencelen(hit), percentid, 
+                 dir, reinterpret_cast<int64_t *>(hearray), 0, 0);
+
+              double percentid
+                = 100.0 * static_cast<double>(nwalignmentlength - nwdiff)
+                / static_cast<double>(nwalignmentlength);
+
+              fprintf(uclustfile, "H\t%u\t%u\t%.1f\t+\t0\t0\t%s\t",
+                      swarmid-1, db_getsequencelen(hit), percentid,
                       nwdiff > 0 ? nwalignment : "=");
-              
+
               fprint_id(uclustfile, hit);
               fprintf(uclustfile, "\t");
               fprint_id(uclustfile, seedampliconid);
@@ -456,34 +505,36 @@ void algo_run()
               fflush(uclustfile);
 
               if (nwalignment)
-                free(nwalignment);
+                xfree(nwalignment);
             }
 
         }
-      
+
 
       if (statsfile)
         {
           abundance = db_getabundance(seedampliconid);
 
-          fprintf(statsfile, "%lu\t%lu\t", swarmsize, amplicons_copies);
+          fprintf(statsfile, "%" PRIu64 "\t%" PRIu64 "\t",
+                  swarmsize, amplicons_copies);
           fprint_id_noabundance(statsfile, seedampliconid);
-          fprintf(statsfile, "\t%lu\t%lu\t%lu\t%lu\n", 
+          fprintf(statsfile,
+                  "\t%" PRIu64 "\t%" PRIu64 "\t%" PRIu64 "\t%" PRIu64 "\n",
                   abundance, singletons, maxgen, maxradius);
         }
       progress_update(seeded);
     }
   progress_done();
-  
+
   if (uclustfile)
     {
-      free(dir);
-      free(hearray);
+      xfree(dir);
+      xfree(hearray);
     }
 
 
   /* output results */
-  
+
   if (amplicons > 0)
     {
       char sep_amplicons;
@@ -494,7 +545,8 @@ void algo_run()
           /* mothur list file output */
           sep_amplicons = ',';
           sep_swarms = '\t';
-          fprintf(outfile, "swarm_%ld\t%lu\t", opt_differences, swarmid);
+          fprintf(outfile, "swarm_%" PRId64 "\t%u\t",
+                  opt_differences, swarmid);
         }
       else
         {
@@ -504,11 +556,11 @@ void algo_run()
         }
 
       fprint_id(outfile, amps[0].ampliconid);
-      long previd = amps[0].swarmid;
+      int64_t previd = amps[0].swarmid;
 
-      for (unsigned long i=1; i<amplicons; i++)
+      for (uint64_t i=1; i<amplicons; i++)
         {
-          long id = amps[i].swarmid;
+          int64_t id = amps[i].swarmid;
           if (id == previd)
             fputc(sep_amplicons, outfile);
           else
@@ -525,52 +577,62 @@ void algo_run()
 
   if ((opt_seeds) && (amplicons > 0))
     {
-      progress_init("Writing seeds:    ", amplicons);
-
-      unsigned long mass = 0;
+      uint64_t swarmcount = 0;
+      progress_init("Sorting seeds:    ", amplicons);
+      struct swarminfo_t * swarminfo = static_cast<struct swarminfo_t *>
+        (xmalloc(swarmed * sizeof(struct swarminfo_t)));
+      uint64_t mass = 0;
       unsigned previd = amps[0].swarmid;
       unsigned seed = amps[0].ampliconid;
       mass += db_getabundance(seed);
-
-      for (unsigned long i=1; i<amplicons; i++)
+      for (uint64_t i=1; i<amplicons; i++)
         {
           unsigned id = amps[i].swarmid;
-
           if (id != previd)
             {
-              fprintf(fp_seeds, ">");
-              fprint_id_with_new_abundance(fp_seeds, seed, mass);
-              fprintf(fp_seeds, "\n");
-              db_fprintseq(fp_seeds, seed, 0);
-
+              swarminfo[swarmcount].seed = seed;
+              swarminfo[swarmcount].mass = mass;
+              swarmcount++;
               mass = 0;
               seed = amps[i].ampliconid;
             }
-
           mass += db_getabundance(amps[i].ampliconid);
           previd = id;
           progress_update(i);
         }
+      swarminfo[swarmcount].seed = seed;
+      swarminfo[swarmcount].mass = mass;
+      swarmcount++;
+      qsort(swarminfo, swarmcount, sizeof(swarminfo_t), compare_mass_seed);
+      progress_done();
 
-      fprintf(fp_seeds, ">");
-      fprint_id_with_new_abundance(fp_seeds, seed, mass);
-      fprintf(fp_seeds, "\n");
-      db_fprintseq(fp_seeds, seed, 0);
+      progress_init("Writing seeds:    ", swarmcount);
+      for (uint64_t i = 0; i < swarmcount; i++)
+        {
+          uint64_t swarm_mass = swarminfo[i].mass;
+          unsigned int swarm_seed = swarminfo[i].seed;
 
+          fprintf(fp_seeds, ">");
+          fprint_id_with_new_abundance(fp_seeds, swarm_seed, swarm_mass);
+          fprintf(fp_seeds, "\n");
+          db_fprintseq(fp_seeds, swarm_seed, 0);
+          progress_update(i);
+        }
+      xfree(swarminfo);
       progress_done();
     }
 
 
-  free(qgramdiffs);
-  free(qgramamps);
-  free(qgramindices);
-  free(hits);
-  free(alignlengths);
-  free(diffs);
-  free(scores);
-  free(targetindices);
-  free(targetampliconids);
-  free(amps);
+  xfree(qgramdiffs);
+  xfree(qgramamps);
+  xfree(qgramindices);
+  xfree(hits);
+  xfree(alignlengths);
+  xfree(diffs);
+  xfree(scores);
+  xfree(targetindices);
+  xfree(targetampliconids);
+  xfree(amps);
 
   db_qgrams_done();
 
@@ -578,34 +640,35 @@ void algo_run()
 
   fprintf(logfile, "\n");
 
-  fprintf(logfile, "Number of swarms:  %lu\n", swarmid);
+  fprintf(logfile, "Number of swarms:  %u\n", swarmid);
 
-  fprintf(logfile, "Largest swarm:     %lu\n", largestswarm);
+  fprintf(logfile, "Largest swarm:     %" PRIu64 "\n", largestswarm);
 
-  fprintf(logfile, "Max generations:   %lu\n", maxgenerations);
+  fprintf(logfile, "Max generations:   %" PRIu64 "\n", maxgenerations);
 
 #ifdef VERBOSE
   fprintf(logfile, "\n");
 
-  fprintf(logfile, "Estimates:         %lu\n", estimates);
+  fprintf(logfile, "Estimates:         %" PRIu64 "\n", estimates);
 
-  fprintf(logfile, "Searches:          %lu\n", searches);
+  fprintf(logfile, "Searches:          %" PRIu64 "\n", searches);
 
   fprintf(logfile, "\n");
 
-  fprintf(logfile, "Comparisons (8b):  %lu (%.2lf%%)\n",
-          count_comparisons_8, (200.0 * count_comparisons_8 / 
+  fprintf(logfile, "Comparisons (8b):  %" PRIu64 " (%.2lf%%)\n",
+          count_comparisons_8, (200.0 * count_comparisons_8 /
                                 amplicons / (amplicons+1)));
 
-  fprintf(logfile, "Comparisons (16b): %lu (%.2lf%%)\n",
+  fprintf(logfile, "Comparisons (16b): %" PRIu64 " (%.2lf%%)\n",
           count_comparisons_16, (200.0 * count_comparisons_16 /
                                  amplicons / (amplicons+1)));
 
-  fprintf(logfile, "Comparisons (tot): %lu (%.2lf%%)\n",
+  fprintf(logfile, "Comparisons (tot): %" PRIu64 " (%.2lf%%)\n",
           count_comparisons_8 + count_comparisons_16,
           (200.0 * (count_comparisons_8 + count_comparisons_16) /
            amplicons / (amplicons+1)));
 #endif
 
+    search_end();
 }
 
diff --git a/src/algod1.cc b/src/algod1.cc
index 402cf22f..025916c4 100644
--- a/src/algod1.cc
+++ b/src/algod1.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -29,517 +29,173 @@
 
 #include "swarm.h"
 
-#define HASH hash_cityhash64
-#define HASHFILLFACTOR 0.5
-#define POWEROFTWO
-//#define HASHSTATS
-
 /* Information about each amplicon */
 
 static struct ampinfo_s
 {
-  int swarmid;
-  int parent;
-  int generation; 
-  int next;       /* amp id of next amplicon in swarm */
-  int graft_cand; /* amp id of potential grafting parent (fastidious) */
-} * ampinfo = 0;
+  unsigned int swarmid;
+  unsigned int parent;
+  unsigned int generation;
+  unsigned int next;       /* amp id of next amplicon in swarm */
+  unsigned int graft_cand; /* amp id of potential grafting parent (fastid.) */
+  unsigned int link_start;
+  unsigned int link_count;
+} * ampinfo = nullptr;
 
 /* Information about each swarm (OTU) */
 
 static struct swarminfo_s
 {
-  int seed; /* amplicon id of the initial seed of this swarm */
-  int last; /* amplicon id of the last seed in this swarm */
-  int size; /* total number of amplicons in this swarm */
-  int singletons; /* number of amplicons with abundance 1 */
-  int maxgen; /* the generation of the amplicon farthest from the seed */
-  long mass; /* the sum of abundances of amplicons in this swarm */
-  long sumlen; /* sum of length of amplicons in swarm */
+  uint64_t mass; /* the sum of abundances of amplicons in this swarm */
+  uint64_t sumlen; /* sum of length of amplicons in swarm */
+  unsigned int seed; /* amplicon id of the initial seed of this swarm */
+  unsigned int last; /* amplicon id of the last seed in this swarm */
+  unsigned int size; /* total number of amplicons in this swarm */
+  unsigned int singletons; /* number of amplicons with abundance 1 */
+  unsigned int maxgen; /* the generation of the amplicon farthest from seed */
   bool attached; /* this is a small swarm attached to a large (fastidious) */
-} * swarminfo = 0;
+  char dummy[3]; /* alignment padding only */
+} * swarminfo = nullptr;
 
-static long swarminfo_alloc = 0;
+static struct graft_cand
+{
+  unsigned int parent;
+  unsigned int child;
+} * graft_array = nullptr;
 
 /* Information about potential grafts */
-static long graft_candidates = 0;
+static int64_t graft_candidates = 0;
 static pthread_mutex_t graft_mutex;
- 
-#define NO_SWARM (-1)
 
-static int current_swarm_tail;
+#define NO_SWARM (UINT_MAX)
+
+static unsigned int current_swarm_tail = 0;
 
-static unsigned long hash_tablesize = 0;
+static uint64_t swarminfo_alloc = 0;
 
 /* overall statistics */
-static int maxgen = 0;
-static int largest = 0;
-static long swarmcount_adjusted = 0;
+static unsigned int maxgen = 0;
+static unsigned int largest = 0;
+static uint64_t swarmcount_adjusted = 0;
 
 /* per swarm statistics */
-static unsigned long singletons = 0;
-static unsigned long abundance_sum = 0; /* = mass */
-static int swarmsize = 0;
-static int swarm_maxgen = 0;
-static unsigned long swarm_sumlen = 0;
-
-static struct thread_info_s
-{
-  pthread_t pthread;
-  pthread_mutex_t workmutex;
-  pthread_cond_t workcond;
-  int work;
-  unsigned char * varseq;
-  int seed;
-  unsigned long mut_start;
-  unsigned long mut_length;
-  int * hits_data;
-  int hits_alloc;
-  int hits_count;
-} * ti;
-
-static pthread_attr_t attr;
-
-#ifdef HASHSTATS
-unsigned long probes = 0;
-unsigned long hits = 0;
-unsigned long success = 0;
-unsigned long tries  = 0;
-unsigned long bingo = 0;
-unsigned long collisions = 0;
-#endif
-
-static int hash_shift;
-static unsigned long hash_mask;
-static unsigned char * hash_occupied = 0;
-static unsigned long * hash_values = 0;
-static int * hash_data = 0;
+static unsigned int singletons = 0;
+static uint64_t abundance_sum = 0; /* = mass */
+static unsigned int swarmsize = 0;
+static unsigned int swarm_maxgen = 0;
+static uint64_t swarm_sumlen = 0;
 
-static int * global_hits_data = 0;
-static int global_hits_alloc = 0;
-static int global_hits_count = 0;
+static unsigned int * global_hits_data = nullptr;
+static unsigned int global_hits_alloc = 0;
+static unsigned int global_hits_count = 0;
 
-static unsigned long threads_used = 0;
-
-inline unsigned int hash_getindex(unsigned long hash)
-{
-#ifdef POWEROFTWO
-  return hash & hash_mask;
-#else
-  return hash % hash_tablesize;
-#endif
-}
+static unsigned int longestamplicon = 0;
 
-inline unsigned int hash_getnextindex(unsigned int j)
-{
-#ifdef POWEROFTWO
-  return (j+1) & hash_mask;
-#else
-  return (j+1) % hash_tablesize;
-#endif
-}
+static unsigned int amplicons = 0;
 
-void hash_alloc(unsigned long amplicons)
-{
-  hash_tablesize = 1;
-  hash_shift = 0;
-  while (amplicons > HASHFILLFACTOR * hash_tablesize)
-    {
-      hash_tablesize <<= 1;
-      hash_shift++;
-    }
-  hash_mask = hash_tablesize - 1;
-  
-  hash_occupied =
-    (unsigned char *) xmalloc((hash_tablesize + 63) / 8);
-  memset(hash_occupied, 0, (hash_tablesize + 63) / 8);
+static pthread_mutex_t heavy_mutex;
+static uint64_t heavy_variants = 0;
+static uint64_t heavy_progress = 0;
+static uint64_t heavy_amplicon_count = 0;
+static unsigned int heavy_amplicon = 0;
 
-  hash_values =
-    (unsigned long *) xmalloc(hash_tablesize * sizeof(unsigned long));
+static pthread_mutex_t light_mutex;
+static uint64_t light_variants = 0;
+static uint64_t light_progress = 0;
+static uint64_t light_amplicon_count = 0;
+static unsigned int light_amplicon = 0;
 
-  hash_data =
-    (int *) xmalloc(hash_tablesize * sizeof(int));
-}
+static uint64_t network_alloc = 1024 * 1024;
+static unsigned int * network = nullptr;
+static unsigned int network_count = 0;
+static pthread_mutex_t network_mutex;
+static unsigned int network_amp = 0;
 
-void hash_free()
-{
-  free(hash_occupied);
-  free(hash_values);
-  free(hash_data);
-}
+static struct bloom_s * bloom_a = nullptr; // Bloom filter for amplicons
 
-inline void hash_set_occupied(unsigned int j)
-{
-  hash_occupied[j >> 3] |= (1 << (j & 7));
-}
+static struct bloomflex_s * bloom_f = nullptr; // Huge Bloom filter for fastidious
 
-inline int hash_is_occupied(unsigned int j)
+void attach(unsigned int seed, unsigned int amp);
+void add_graft_candidate(unsigned int seed, unsigned int amp);
+int compare_grafts(const void * a, const void * b);
+unsigned int attach_candidates(unsigned int amplicon_count);
+bool hash_check_attach(char * seq,
+                       unsigned int seqlen,
+                       struct var_s * var,
+                       unsigned int seed);
+void check_heavy_var(struct bloomflex_s * bloom,
+                     char * varseq,
+                     unsigned int seed,
+                     uint64_t * m,
+                     uint64_t * v,
+                     struct var_s * variant_list,
+                     struct var_s * variant_list2);
+void check_heavy_thread(int64_t t);
+uint64_t mark_light_var(struct bloomflex_s * bloom,
+                        unsigned int seed,
+                        struct var_s * variant_list);
+void mark_light_thread(int64_t t);
+void check_variants(unsigned int seed,
+                    var_s * variant_list,
+                    unsigned int * hits_data,
+                    unsigned int * hits_count);
+void network_thread(int64_t t);
+void process_seed(unsigned int seed);
+int compare_amp(const void * a, const void * b);
+int compare_mass(const void * a, const void * b);
+
+
+inline bool check_amp_identical(unsigned int amp1,
+                                unsigned int amp2)
 {
-  return hash_occupied[j >> 3] & (1 << (j & 7));
-}
+  unsigned int amp1_seqlen = db_getsequencelen(amp1);
+  unsigned int amp2_seqlen = db_getsequencelen(amp2);
 
-inline void hash_set_value(unsigned int j, unsigned long hash)
-{
-  hash_values[j] = hash;
-}
+  if (amp1_seqlen == amp2_seqlen)
+    {
+      char * amp1_sequence = db_getsequence(amp1);
+      char * amp2_sequence = db_getsequence(amp2);
 
-inline int hash_compare_value(unsigned int j, unsigned long hash)
-{
-  return (hash_values[j] == hash);
+      return memcmp(amp1_sequence,
+                    amp2_sequence,
+                    nt_bytelength(amp1_seqlen)) == 0;
+    }
+  else
+    return false;
 }
 
-inline void hash_insert(int amp,
-                        unsigned char * key,
-                        unsigned long keylen)
+inline void hash_insert(unsigned int amp)
 {
-  unsigned long hash = HASH(key, keylen);
-  unsigned int j = hash_getindex(hash);
-  
   /* find the first empty bucket */
-  while (hash_is_occupied(j))
-    j = hash_getnextindex(j);
-  
-  hash_set_occupied(j);
-  hash_set_value(j, hash);
-  hash_data[j] = amp;
-}
-
-void find_variant_matches(unsigned long thread,
-                          unsigned char * seq,
-                          unsigned long seqlen,
-                          int seed)
-{
-  unsigned long max_abundance;
-
-  if (opt_no_otu_breaking)
-    max_abundance = ULONG_MAX;
-  else
-    max_abundance = db_getabundance(seed);
-
-  /* compute hash and corresponding hash table index */
-
-  unsigned long hash = HASH(seq, seqlen);
-  unsigned int j = hash_getindex(hash);
-
-  /* find matching buckets */
-
-#ifdef HASHSTATS
-  tries++;
-  probes++;
-#endif
-
+  uint64_t hash = db_gethash(amp);
+  uint64_t j = hash_getindex(hash);
+  bool duplicate = false;
   while (hash_is_occupied(j))
     {
-#ifdef HASHSTATS
-      hits++;
-#endif
-      if (hash_compare_value(j, hash))
-        {
-#ifdef HASHSTATS
-          success++;
-#endif
-          
-          /* check if not already swarmed */
-          int amp = hash_data[j];
-          struct ampinfo_s * bp = ampinfo + amp;
-          if ((bp->swarmid == NO_SWARM) && 
-              (db_getabundance(amp) <= max_abundance))
-            {
-              unsigned long ampseqlen = db_getsequencelen(amp);
-              unsigned char * ampseq = (unsigned char *) db_getsequence(amp);
-              
-              /* make sure sequences are identical even though hashes are */
-              if ((ampseqlen == seqlen) && (!memcmp(ampseq, seq, seqlen)))
-                {
-#ifdef HASHSTATS
-                  bingo++;
-#endif
-
-                  struct thread_info_s * tip = ti + thread;
-
-                  if (tip->hits_count + 1 > tip->hits_alloc)
-                    {
-                      tip->hits_alloc <<= 1;
-                      tip->hits_data = (int*)realloc(tip->hits_data,
-                                                     tip->hits_alloc *
-                                                     sizeof(int));
-                    }
-
-                  tip->hits_data[tip->hits_count++] = amp;
-                }
-#ifdef HASHSTATS
-              else
-                {
-                  collisions++;
-                  
-                  fprintf(logfile, "Hash collision between ");
-                  fprint_id_noabundance(logfile, seed);
-                  fprintf(logfile, " and ");
-                  fprint_id_noabundance(logfile, amp);
-                  fprintf(logfile, ".\n");
-                }
-#endif
-            }
-        }
+      if (hash_compare_value(j, hash) &&
+          check_amp_identical(amp, hash_get_data(j)))
+        duplicate = true;
       j = hash_getnextindex(j);
-#ifdef HASHSTATS
-      probes++;
-#endif
-    }
-}
-
-void generate_variants(unsigned long thread,
-                       int seed,
-                       unsigned long start,
-                       unsigned long len)
-{
-  /* 
-     Generate all possible variants involving mutations from position start
-     and extending len nucleotides. Insertions in front of those positions
-     are included, but not those after. Positions are zero-based.
-     The range may extend beyond the length of the sequence indicating
-     that inserts at the end of the sequence should be generated.
-
-     The last thread will handle insertions at the end of the sequence,
-     as well as identical sequences (no mutations).
-  */
-
-  unsigned char * varseq = ti[thread].varseq;
-
-  unsigned char * seq = (unsigned char*) db_getsequence(seed);
-  unsigned long seqlen = db_getsequencelen(seed);
-  unsigned long end = MIN(seqlen,start+len);
-
-  ti[thread].hits_count = 0;
-
-  /* make an exact copy */
-  memcpy(varseq, seq, seqlen);
-  
-#if 1
-  /* identical non-variant */
-  if (thread == threads_used - 1)
-    find_variant_matches(thread, varseq, seqlen, seed);
-#endif
-
-  /* substitutions */
-  for(unsigned int i=start; i<end; i++)
-    {
-      for (int v=1; v<5; v++)
-        if (v != seq[i])
-          {
-            varseq[i] = v;
-            find_variant_matches(thread, varseq, seqlen, seed);
-          }
-      varseq[i] = seq[i];
-    }
-
-  /* deletions */
-  memcpy(varseq, seq, start);
-  if (start < seqlen-1)
-    memcpy(varseq+start, seq+start+1, seqlen-start-1);
-  for(unsigned int i=start; i<end; i++)
-    {
-      if ((i==0) || (seq[i] != seq[i-1]))
-        {
-          find_variant_matches(thread, varseq, seqlen-1, seed);      
-        }
-      varseq[i] = seq[i];
-    }
-  
-  /* insertions */
-  memcpy(varseq, seq, start);
-  memcpy(varseq+start+1, seq+start, seqlen-start);
-  for(unsigned int i=start; i<start+len; i++)
-    {
-      for(int v=1; v<5; v++)
-        {
-          if((i==seqlen) || (v != seq[i]))
-            {
-              varseq[i] = v;
-              find_variant_matches(thread, varseq, seqlen+1, seed);
-            }
-        }
-      if (i<seqlen)
-        varseq[i] = seq[i];
-    }
-}
-
-void * worker(void * vp)
-{
-  long t = (long) vp;
-  struct thread_info_s * tip = ti + t;
-
-  pthread_mutex_lock(&tip->workmutex);
-
-  /* loop until signalled to quit */
-  while (tip->work >= 0)
-    {
-      /* wait for work available */
-      while (tip->work == 0)
-        pthread_cond_wait(&tip->workcond, &tip->workmutex);
-      if (tip->work > 0)
-        {
-          generate_variants(t, tip->seed, tip->mut_start, tip->mut_length);
-          tip->work = 0;
-          pthread_cond_signal(&tip->workcond);
-        }
     }
 
-  pthread_mutex_unlock(&tip->workmutex);
-  return 0;
-}
-
-void threads_init()
-{
-  pthread_attr_init(&attr);
-  pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
-        
-  /* allocate memory for thread info, incl the variant sequences */
-  unsigned long longestamplicon = db_getlongestsequence();
-  ti = (struct thread_info_s *) 
-    xmalloc(opt_threads * sizeof(struct thread_info_s));
-  
-  /* init and create worker threads */
-  for(long t=0; t<opt_threads; t++)
-    {
-      struct thread_info_s * tip = ti + t;
-      tip->varseq = (unsigned char*) xmalloc(longestamplicon+1);
-      tip->hits_alloc = 7 * longestamplicon + 4;
-      tip->hits_data = (int*) xmalloc(tip->hits_alloc * sizeof(int));
-      tip->work = 0;
-      pthread_mutex_init(&tip->workmutex, NULL);
-      pthread_cond_init(&tip->workcond, NULL);
-      if (pthread_create(&tip->pthread, &attr, worker, (void*)(long)t))
-        fatal("Cannot create thread");
-    }
-}
+  if (duplicate)
+    duplicates_found++;
 
-void threads_done()
-{
-  /* finish and clean up worker threads */
-  for(long t=0; t<opt_threads; t++)
-    {
-      struct thread_info_s * tip = ti + t;
-      
-      /* tell worker to quit */
-      pthread_mutex_lock(&tip->workmutex);
-      tip->work = -1;
-      pthread_cond_signal(&tip->workcond);
-      pthread_mutex_unlock(&tip->workmutex);
-
-      /* wait for worker to quit */
-      if (pthread_join(tip->pthread, NULL))
-        fatal("Cannot join thread");
-
-      pthread_cond_destroy(&tip->workcond);
-      pthread_mutex_destroy(&tip->workmutex);
-      free(tip->varseq);
-      free(tip->hits_data);
-    }
-
-  free(ti);
-
-  pthread_attr_destroy(&attr);
-}
+  hash_set_occupied(j);
+  hash_set_value(j, hash);
+  hash_set_data(j, amp);
 
-void add_amp_to_swarm(int amp)
-{
-  /* add to swarm */
-  ampinfo[current_swarm_tail].next = amp;
-  current_swarm_tail = amp;
+  bloom_set(bloom_a, hash);
 }
 
-void process_seed(int subseed)
-{
-  unsigned long seqlen = db_getsequencelen(subseed);
-
-  threads_used = opt_threads;
-  if (threads_used > seqlen + 1)
-    threads_used = seqlen+1;
-
-  /* prepare work for the threads */
-  unsigned long start = 0;
-  for(unsigned long t=0; t<threads_used; t++)
-    {
-      struct thread_info_s * tip = ti + t;
-      unsigned long length =
-        (seqlen - start + threads_used - t) / (threads_used - t);
-      tip->seed = subseed;
-      tip->mut_start = start;
-      tip->mut_length = length;
-      start += length;
-      
-      pthread_mutex_lock(&tip->workmutex);
-      tip->work = 1;
-      pthread_cond_signal(&tip->workcond);
-      pthread_mutex_unlock(&tip->workmutex);
-    }
-
-  /* wait for threads to finish their work */
-  for(unsigned int t=0; t<threads_used; t++)
-    {
-      struct thread_info_s * tip = ti + t;
-      pthread_mutex_lock(&tip->workmutex);
-      while (tip->work > 0)
-        pthread_cond_wait(&tip->workcond, &tip->workmutex);
-      pthread_mutex_unlock(&tip->workmutex);
-    }
-
-  /* join hits from the threads */
-
-  for(unsigned int t=0; t<threads_used; t++)
-    {
-      if (global_hits_count + ti[t].hits_count > global_hits_alloc)
-        {
-          while (global_hits_count + ti[t].hits_count > global_hits_alloc)
-            global_hits_alloc <<= 1;
-          global_hits_data = (int*)xrealloc(global_hits_data,
-                                            global_hits_alloc * sizeof(int));
-        }
-      for(int i=0; i < ti[t].hits_count; i++)
-        {
-          long amp = ti[t].hits_data[i];
-
-          /* add to list for this generation */
-          global_hits_data[global_hits_count++] = amp;
-
-          /* update info */
-          ampinfo[amp].swarmid = ampinfo[subseed].swarmid;
-          ampinfo[amp].generation = ampinfo[subseed].generation + 1;
-          ampinfo[amp].parent = subseed;
-        }
-    }
-}
 
-void update_stats(int amp)
-{
-  /* update swarm stats */
-  struct ampinfo_s * bp = ampinfo + amp;
+/******************** FASTIDIOUS START ********************/
 
-  swarmsize++;
-  if (bp->generation > swarm_maxgen)
-    swarm_maxgen = bp->generation;
-  unsigned long abundance = db_getabundance(amp);
-  abundance_sum += abundance;
-  if (abundance == 1)
-    singletons++;
-  swarm_sumlen += db_getsequencelen(amp);
-}
 
-void attach(int seed, int amp)
+void attach(unsigned int seed, unsigned int amp)
 {
   /* graft light swarm (amp) on heavy swarm (seed) */
 
-#if 0
-  fprintf(logfile, 
-          "\nGrafting light swarm with amplicon %d on "
-          "heavy swarm with amplicon %d (swarm ids: %d %d)\n",
-          amp,
-          seed,
-          ampinfo[amp].swarmid,
-          ampinfo[seed].swarmid);
-#endif
-
   swarminfo_s * hp = swarminfo + ampinfo[seed].swarmid;
   swarminfo_s * lp = swarminfo + ampinfo[amp].swarmid;
 
@@ -564,7 +220,7 @@ void attach(int seed, int amp)
   swarmcount_adjusted--;
 }
 
-void add_graft_candidate(int seed, int amp)
+void add_graft_candidate(unsigned int seed, unsigned int amp)
 {
   pthread_mutex_lock(&graft_mutex);
   graft_candidates++;
@@ -573,16 +229,10 @@ void add_graft_candidate(int seed, int amp)
   pthread_mutex_unlock(&graft_mutex);
 }
 
-struct graft_cand
-{
-  int parent;
-  int child;
-} * graft_array;
-
 int compare_grafts(const void * a, const void * b)
 {
-  struct graft_cand * x = (struct graft_cand *) a;
-  struct graft_cand * y = (struct graft_cand *) b;
+  const struct graft_cand * x = static_cast<const struct graft_cand *>(a);
+  const struct graft_cand * y = static_cast<const struct graft_cand *>(b);
   if (x->parent < y->parent)
     return -1;
   else if (x->parent > y->parent)
@@ -596,24 +246,24 @@ int compare_grafts(const void * a, const void * b)
       return 0;
 }
 
-int attach_candidates(int amplicons)
+unsigned int attach_candidates(unsigned int amplicon_count)
 {
   /* count pairs */
-  int pair_count = 0;
-  for(int i=0; i < amplicons; i++)
+  unsigned int pair_count = 0;
+  for(unsigned int i=0; i < amplicon_count; i++)
     if (ampinfo[i].graft_cand != NO_SWARM)
       pair_count++;
 
-  int grafts = 0;
+  unsigned int grafts = 0;
   progress_init("Grafting light swarms on heavy swarms", pair_count);
 
   /* allocate memory */
-  graft_array = (struct graft_cand *) 
-    xmalloc(pair_count * sizeof(struct graft_cand));
+  graft_array = static_cast<struct graft_cand *>
+    (xmalloc(pair_count * sizeof(struct graft_cand)));
 
   /* fill in */
-  int j = 0;
-  for(int i=0; i < amplicons; i++)
+  unsigned int j = 0;
+  for(unsigned int i=0; i < amplicon_count; i++)
     if (ampinfo[i].graft_cand != NO_SWARM)
       {
         graft_array[j].parent = ampinfo[i].graft_cand;
@@ -625,10 +275,10 @@ int attach_candidates(int amplicons)
   qsort(graft_array, pair_count, sizeof(struct graft_cand), compare_grafts);
 
   /* attach in order */
-  for(int i=0; i < pair_count; i++)
+  for(unsigned int i=0; i < pair_count; i++)
     {
-      int parent = graft_array[i].parent;
-      int child  = graft_array[i].child;
+      unsigned int parent = graft_array[i].parent;
+      unsigned int child  = graft_array[i].child;
 
       if (swarminfo[ampinfo[child].swarmid].attached)
         {
@@ -644,18 +294,20 @@ int attach_candidates(int amplicons)
       progress_update(i+1);
     }
   progress_done();
-  free(graft_array);
+  xfree(graft_array);
   return grafts;
 }
 
 bool hash_check_attach(char * seq,
-                       unsigned long seqlen,
-                       int seed)
+                       unsigned int seqlen,
+                       struct var_s * var,
+                       unsigned int seed)
 {
-  /* compute hash and corresponding hash table index */
+  /* seed is the original large swarm seed */
 
-  unsigned long hash = HASH((unsigned char*)seq, seqlen);
-  unsigned int j = hash_getindex(hash);
+  /* compute hash and corresponding hash table index */
+  uint64_t hash = var->hash;
+  uint64_t j = hash_getindex(hash);
 
   /* find matching buckets */
 
@@ -664,266 +316,349 @@ bool hash_check_attach(char * seq,
       if (hash_compare_value(j, hash))
         {
           /* check that mass is below threshold */
-          int amp = hash_data[j];
+          unsigned int amp = hash_get_data(j);
 
-          struct swarminfo_s * smallp = swarminfo + ampinfo[amp].swarmid;
-          
-          if (smallp->mass < opt_boundary)
+          /* make absolutely sure sequences are identical */
+          char * ampseq = db_getsequence(amp);
+          unsigned int ampseqlen = db_getsequencelen(amp);
+          if (check_variant(seq, seqlen, var, ampseq, ampseqlen))
             {
-              unsigned long ampseqlen = db_getsequencelen(amp);
-              unsigned char * ampseq = (unsigned char *) db_getsequence(amp);
-
-              /* make absolutely sure sequences are identical */
-              if ((ampseqlen == seqlen) && (!memcmp(ampseq, seq, seqlen)))
-                {
-                  add_graft_candidate(seed, amp);
-                  return 1;
-                }
+              add_graft_candidate(seed, amp);
+              return true;
             }
         }
       j = hash_getnextindex(j);
     }
-  return 0;
+  return false;
 }
 
-#if 0
+inline uint64_t check_heavy_var_2(char * seq,
+                                  unsigned int seqlen,
+                                  unsigned int seed,
+                                  struct var_s * variant_list)
+{
+  /* Check second generation microvariants of the heavy swarm amplicons
+     and see if any of them are identical to a light swarm amplicon. */
 
-/* never used */
+  uint64_t matches = 0;
+  unsigned int variant_count = 0;
 
-long expected_variant_count(char * seq, int len)
-{
-  int c = 0;
-  for(int i=1; i<len; i++)
-    if (seq[i] != seq[i-1])
-      c++;
-  return 6*len+5+c;
-}
+  uint64_t hash = zobrist_hash(reinterpret_cast<unsigned char *>(seq), seqlen);
+  generate_variants(seq, seqlen, hash,
+                    variant_list, & variant_count, false);
 
-#endif
+  for(unsigned int i=0; i < variant_count; i++)
+    if (bloom_get(bloom_a, variant_list[i].hash) &&
+        hash_check_attach(seq, seqlen, variant_list + i, seed))
+      matches++;
+
+  return matches;
+}
 
-long fastidious_mark_small_var(BloomFilter * bloom,
-                               char * varseq,
-                               int seed)
+void check_heavy_var(struct bloomflex_s * bloom,
+                     char * varseq,
+                     unsigned int seed,
+                     uint64_t * m,
+                     uint64_t * v,
+                     struct var_s * variant_list,
+                     struct var_s * variant_list2)
 {
   /*
-    add all microvariants of seed to Bloom filter
-
-    bloom is a BloomFilter in which to enter the variants
-    buffer is a buffer large enough to hold all sequences + 1 insertion
+    bloom is a bloom filter in which to check the variants
+    varseq is a buffer large enough to hold all sequences + 1 insertion
     seed is the original seed
+    m is where to store number of matches
+    v is where to store number of variants
+    variant_list and variant_list2 are lists to hold the 1st and 2nd
+    generation of microvariants
   */
 
-  long variants = 0;
-
-  unsigned char * seq = (unsigned char*) db_getsequence(seed);
-  unsigned long seqlen = db_getsequencelen(seed);
+  /*
+    Generate microvariants of the heavy swarm amplicons, forming
+    "virtual" amplicons. Check with the bloom filter if any
+    of these are identical to the microvariants of the
+    light swarm amplicons. If there is a match we have a potential
+    link. To find which light amplicon it could link to, we have
+    to generate the second generation microvariants and check
+    these against the light swarm amplicons.
+  */
 
-  /* make an exact copy */
-  memcpy(varseq, seq, seqlen);
+  unsigned int variant_count = 0;
+  uint64_t matches = 0;
 
-  /* substitutions */
-  for(unsigned int i=0; i<seqlen; i++)
-    {
-      for (int v=1; v<5; v++)
-        if (v != seq[i])
-          {
-            varseq[i] = v;
-            bloom->set(varseq, seqlen);
-            variants++;
-          }
-      varseq[i] = seq[i];
-    }
+  char * sequence = db_getsequence(seed);
+  unsigned int seqlen = db_getsequencelen(seed);
+  uint64_t hash = db_gethash(seed);
+  generate_variants(sequence, seqlen, hash,
+                    variant_list, & variant_count, false);
 
-  /* deletions */
-  if (seqlen > 1)
-    memcpy(varseq, seq+1, seqlen-1);
-  for(unsigned int i=0; i<seqlen; i++)
+  for(unsigned int i = 0; i < variant_count; i++)
     {
-      if ((i==0) || (seq[i] != seq[i-1]))
+      struct var_s * var = variant_list + i;
+      if (bloomflex_get(bloom, var->hash))
         {
-          bloom->set(varseq, seqlen-1);
-          variants++;
+          unsigned int varlen = 0;
+          generate_variant_sequence(sequence, seqlen,
+                                    var, varseq, & varlen);
+          matches += check_heavy_var_2(varseq,
+                                       varlen,
+                                       seed,
+                                       variant_list2);
         }
-      varseq[i] = seq[i];
     }
 
-  /* insertions */
-  memcpy(varseq+1, seq, seqlen);
-  for(unsigned int i=0; i<seqlen+1; i++)
+  *m = matches;
+  *v = variant_count;
+}
+
+void check_heavy_thread(int64_t t)
+{
+  (void) t;
+
+  struct var_s * variant_list = static_cast<struct var_s *>
+    (xmalloc(sizeof(struct var_s) * (7 * longestamplicon + 4)));
+  struct var_s * variant_list2 = static_cast<struct var_s *>
+    (xmalloc(sizeof(struct var_s) * (7 * (longestamplicon+1) + 4)));
+
+  size_t size = 8 * ((db_getlongestsequence() + 2 + 31) / 32);
+  char * buffer1 = static_cast<char *>(xmalloc(size));
+  pthread_mutex_lock(&heavy_mutex);
+  while ((heavy_amplicon < amplicons) &&
+         (heavy_progress < heavy_amplicon_count))
     {
-      for(int v=1; v<5; v++)
+      unsigned int a = heavy_amplicon++;
+      if (swarminfo[ampinfo[a].swarmid].mass >=
+          static_cast<uint64_t>(opt_boundary))
         {
-          if((i==seqlen) || (v != seq[i]))
-            {
-              varseq[i] = v;
-              bloom->set(varseq, seqlen+1);
-              variants++;
-            }
+          progress_update(++heavy_progress);
+          pthread_mutex_unlock(&heavy_mutex);
+          uint64_t m, v;
+          check_heavy_var(bloom_f, buffer1, a, &m, &v,
+                          variant_list, variant_list2);
+          pthread_mutex_lock(&heavy_mutex);
+          heavy_variants += v;
         }
-      if (i<seqlen)
-        varseq[i] = seq[i];
     }
-#if 0
-  long e = expected_variant_count((char*)seq, seqlen);
-  if (variants != e)
-    fprintf(logfile, "Incorrect number of variants: %ld Expected: %ld\n", variants, e);
-#endif
-  return variants;
+  pthread_mutex_unlock(&heavy_mutex);
+  xfree(buffer1);
+
+  xfree(variant_list2);
+  xfree(variant_list);
 }
 
-long fastidious_check_large_var_2(char * seq,
-                                  size_t seqlen,
-                                  char * varseq,
-                                  int seed)
+uint64_t mark_light_var(struct bloomflex_s * bloom,
+                        unsigned int seed,
+                        struct var_s * variant_list)
 {
-  /* generate second generation variants from seq of length seqlen.
-     Use buffer varseq for variants.
-     The original sequences came from seed */
+  /*
+    add all microvariants of seed to Bloom filter
 
-  long matches = 0;
+    bloom is a BloomFilter in which to enter the variants
+    seed is the original seed
+  */
 
-  /* make an exact copy */
-  memcpy(varseq, seq, seqlen);
+  hash_insert(seed);
 
-  /* substitutions */
-  for(unsigned int i=0; i<seqlen; i++)
-    {
-      for (int v=1; v<5; v++)
-        if (v != seq[i])
-          {
-            varseq[i] = v;
-            if (hash_check_attach(varseq, seqlen, seed))
-              matches++;
-          }
-      varseq[i] = seq[i];
-    }
+  unsigned int variant_count = 0;
+
+  char * sequence = db_getsequence(seed);
+  unsigned int seqlen = db_getsequencelen(seed);
+  uint64_t hash = db_gethash(seed);
+  generate_variants(sequence, seqlen, hash,
+                    variant_list, & variant_count, false);
 
-  /* deletions */
-  if (seqlen > 1)
-    memcpy(varseq, seq+1, seqlen-1);
-  for(unsigned int i=0; i<seqlen; i++)
+  for(unsigned int i = 0; i < variant_count; i++)
+    bloomflex_set(bloom, variant_list[i].hash);
+
+  return variant_count;
+}
+
+void mark_light_thread(int64_t t)
+{
+  (void) t;
+
+  struct var_s * variant_list = static_cast<struct var_s *>
+    (xmalloc(sizeof(struct var_s) * (7 * longestamplicon + 4)));
+
+  pthread_mutex_lock(&light_mutex);
+  while (light_progress < light_amplicon_count)
     {
-      if ((i==0) || (seq[i] != seq[i-1]))
+      unsigned int a = light_amplicon--;
+      if (swarminfo[ampinfo[a].swarmid].mass <
+          static_cast<uint64_t>(opt_boundary))
         {
-          if (hash_check_attach(varseq, seqlen-1, seed))
-            matches++;
+          progress_update(++light_progress);
+          pthread_mutex_unlock(&light_mutex);
+          uint64_t v = mark_light_var(bloom_f, a, variant_list);
+          pthread_mutex_lock(&light_mutex);
+          light_variants += v;
         }
-      varseq[i] = seq[i];
     }
+  pthread_mutex_unlock(&light_mutex);
+
+  xfree(variant_list);
+}
+
+
+/******************** FASTIDIOUS END ********************/
+
+
+inline void find_variant_matches(unsigned int seed,
+                                 var_s * var,
+                                 unsigned int * hits_data,
+                                 unsigned int * hits_count)
+{
+  if (! bloom_get(bloom_a, var->hash))
+    return;
+
+  /* compute hash and corresponding hash table index */
+
+  uint64_t j = hash_getindex(var->hash);
 
-  /* insertions */
-  memcpy(varseq+1, seq, seqlen);
-  for(unsigned int i=0; i<seqlen+1; i++)
+  /* find matching buckets */
+
+  while (hash_is_occupied(j))
     {
-      for(int v=1; v<5; v++)
+      if (hash_compare_value(j, var->hash))
         {
-          if((i==seqlen) || (v != seq[i]))
-            {
-              varseq[i] = v;
-              if (hash_check_attach(varseq, seqlen+1, seed))
-                matches++;
-            }
+          unsigned int amp = hash_get_data(j);
+
+          /* avoid self */
+          if (seed != amp)
+            if (opt_no_otu_breaking ||
+                (db_getabundance(seed) >= db_getabundance(amp)))
+              {
+                char * seed_sequence = db_getsequence(seed);
+                unsigned int seed_seqlen = db_getsequencelen(seed);
+
+                char * amp_sequence = db_getsequence(amp);
+                unsigned int amp_seqlen = db_getsequencelen(amp);
+
+                if (check_variant(seed_sequence, seed_seqlen,
+                                  var,
+                                  amp_sequence, amp_seqlen))
+                  {
+                    hits_data[(*hits_count)++] = amp;
+                    break;
+                  }
+              }
         }
-      if (i<seqlen)
-        varseq[i] = seq[i];
+      j = hash_getnextindex(j);
     }
-  return matches;
 }
 
-void fastidious_check_large_var(BloomFilter * bloom,
-                                char * varseq,
-                                char * buffer2,
-                                int seed,
-                                long * matches_p,
-                                long * variants_p)
+void check_variants(unsigned int seed,
+                    var_s * variant_list,
+                    unsigned int * hits_data,
+                    unsigned int * hits_count)
 {
-  /*
-    bloom is a BloomFilter in which to enter the variants
-    buffer1 is a buffer large enough to hold all sequences + 1 insertion
-    buffer2 is a buffer large enough to hold all sequences + 2 insertions
-    seed is the original seed
-    matches_p is where to store number of matches
-    variants_p is where to store number of variants
-  */
+  unsigned int variant_count = 0;
+  * hits_count = 0;
+
+  char * sequence = db_getsequence(seed);
+  unsigned int seqlen = db_getsequencelen(seed);
+  uint64_t hash = db_gethash(seed);
+  generate_variants(sequence, seqlen, hash,
+                    variant_list, & variant_count, true);
 
-  long variants = 0;
-  long matches = 0;
+  for(unsigned int i = 0; i < variant_count; i++)
+    find_variant_matches(seed, variant_list + i, hits_data, hits_count);
+}
+
+void network_thread(int64_t t)
+{
+  (void) t;
 
-  unsigned char * seq = (unsigned char*) db_getsequence(seed);
-  unsigned long seqlen = db_getsequencelen(seed);
+  unsigned int hits_count = 0;
+  unsigned int * hits_data = static_cast<unsigned int *>
+    (xmalloc((7 * longestamplicon + 5) * sizeof(unsigned int)));
 
-  /* make an exact copy */
-  memcpy(varseq, seq, seqlen);
+  struct var_s * variant_list = static_cast<struct var_s *>
+    (xmalloc((7 * longestamplicon + 5) * sizeof(struct var_s)));
 
-  /* substitutions */
-  for(unsigned int i=0; i<seqlen; i++)
+  pthread_mutex_lock(&network_mutex);
+  while (network_amp < amplicons)
     {
-      for (int v=1; v<5; v++)
-        if (v != seq[i])
-          {
-            varseq[i] = v;
-            variants++;
-            if (bloom->get(varseq, seqlen))
-              matches += fastidious_check_large_var_2(varseq,
-                                                      seqlen,
-                                                      buffer2,
-                                                      seed);
-          }
-      varseq[i] = seq[i];
-    }
+      unsigned int amp = network_amp++;
+      progress_update(amp);
 
-  /* deletions */
-  if (seqlen > 1)
-    memcpy(varseq, seq+1, seqlen-1);
-  for(unsigned int i=0; i<seqlen; i++)
-    {
-      if ((i==0) || (seq[i] != seq[i-1]))
+      pthread_mutex_unlock(&network_mutex);
+
+      hits_count = 0;
+      check_variants(amp, variant_list, hits_data, & hits_count);
+      pthread_mutex_lock(&network_mutex);
+
+      ampinfo[amp].link_start = network_count;
+      ampinfo[amp].link_count = hits_count;
+
+      if (network_count + hits_count > network_alloc)
         {
-          variants++;
-          if (bloom->get(varseq, seqlen-1))
-            matches += fastidious_check_large_var_2(varseq,
-                                                    seqlen-1,
-                                                    buffer2,
-                                                    seed);
+          while (network_count + hits_count > network_alloc)
+            network_alloc += 1024 * 1024;
+          network = static_cast<unsigned int*>
+            (xrealloc(network, network_alloc * sizeof(unsigned int)));
         }
-      varseq[i] = seq[i];
+
+      for(unsigned int i=0; i < hits_count; i++)
+        network[network_count++] = hits_data[i];
+    }
+  pthread_mutex_unlock(&network_mutex);
+
+  xfree(variant_list);
+  xfree(hits_data);
+}
+
+void process_seed(unsigned int seed)
+{
+  /* update swarm stats */
+  struct ampinfo_s * bp = ampinfo + seed;
+
+  swarmsize++;
+  if (bp->generation > swarm_maxgen)
+    swarm_maxgen = bp->generation;
+  uint64_t abundance = db_getabundance(seed);
+  abundance_sum += abundance;
+  if (abundance == 1)
+    singletons++;
+  swarm_sumlen += db_getsequencelen(seed);
+
+  unsigned int s = ampinfo[seed].link_start;
+  unsigned int c = ampinfo[seed].link_count;
+
+  if (global_hits_count + c > global_hits_alloc)
+    {
+      while (global_hits_count + c > global_hits_alloc)
+        global_hits_alloc += 4096;
+      global_hits_data = static_cast<unsigned int *>
+        (xrealloc(global_hits_data, global_hits_alloc * sizeof(unsigned int)));
     }
 
-  /* insertions */
-  memcpy(varseq+1, seq, seqlen);
-  for(unsigned int i=0; i<seqlen+1; i++)
+  for(unsigned int i = 0; i < c; i++)
     {
-      for(int v=1; v<5; v++)
+      unsigned int amp = network[s + i];
+
+      if (ampinfo[amp].swarmid == NO_SWARM)
         {
-          if((i==seqlen) || (v != seq[i]))
-            {
-              varseq[i] = v;
-              variants++;
-              if (bloom->get(varseq, seqlen+1))
-                matches += fastidious_check_large_var_2(varseq,
-                                                        seqlen+1,
-                                                        buffer2,
-                                                        seed);
-            }
+          global_hits_data[global_hits_count++] = amp;
+
+          /* update info */
+          ampinfo[amp].swarmid = ampinfo[seed].swarmid;
+          ampinfo[amp].generation = ampinfo[seed].generation + 1;
+          ampinfo[amp].parent = seed;
         }
-      if (i<seqlen)
-        varseq[i] = seq[i];
     }
-  *matches_p = matches;
-  *variants_p = variants;
-
-#if 0
-  long e = expected_variant_count((char*)seq, seqlen);
-  if (variants != e)
-    fprintf(logfile, "Incorrect number of variants: %ld Expected: %ld\n", variants, e);
-#endif
 }
 
-
 int compare_amp(const void * a, const void * b)
 {
-  int * x = (int*) a;
-  int * y = (int*) b;
+  /*
+    Swarm checks that all amplicon sequences are unique (strictly
+    dereplicated input data), so distinct amplicons with the same
+    sequence are not expected at this stage.
+  */
+
+  const unsigned int * x = static_cast<const unsigned int*>(a);
+  const unsigned int * y = static_cast<const unsigned int*>(b);
   if (*x < *y)
     return -1;
   else if (*x > *y)
@@ -932,112 +667,110 @@ int compare_amp(const void * a, const void * b)
     return 0;
 }
 
-static pthread_mutex_t light_mutex;
-static long light_variants;
-static long light_progress;
-static long light_amplicon_count;
-static int light_amplicon;
-BloomFilter * bloomp;
-
-void mark_light_thread(long t)
+int compare_mass(const void * a, const void * b)
 {
-  (void) t;
+  const swarminfo_s * x = swarminfo + *(static_cast<const unsigned int *>(a));
+  const swarminfo_s * y = swarminfo + *(static_cast<const unsigned int *>(b));
 
-  char * buffer1 = (char*) xmalloc(db_getlongestsequence() + 2);
-  pthread_mutex_lock(&light_mutex);
-  while (light_progress < light_amplicon_count)
-    {
-      int a = light_amplicon--;
-      if (swarminfo[ampinfo[a].swarmid].mass < opt_boundary)
-        {
-          progress_update(++light_progress);
-          pthread_mutex_unlock(&light_mutex);
-          long v = fastidious_mark_small_var(bloomp, buffer1, a);
-          pthread_mutex_lock(&light_mutex);
-          light_variants += v;
-        }
-    }
-  pthread_mutex_unlock(&light_mutex);
-  free(buffer1);
-}
+  uint64_t m = x->mass;
+  uint64_t n = y->mass;
 
-static pthread_mutex_t heavy_mutex;
-static long heavy_variants;
-static long heavy_progress;
-static long heavy_amplicon_count;
-static int heavy_amplicon;
-static long amplicons;
+  if (m > n)
+    return -1;
+  else if (m < n)
+    return +1;
+  else
+    return strcmp(db_getheader(x->seed), db_getheader(y->seed));
+}
 
-void check_heavy_thread(long t)
+inline void add_amp_to_swarm(unsigned int amp)
 {
-  (void) t;
-
-  char * buffer1 = (char*) xmalloc(db_getlongestsequence() + 2);
-  char * buffer2 = (char*) xmalloc(db_getlongestsequence() + 3);
-  pthread_mutex_lock(&heavy_mutex);
-  while ((heavy_amplicon < amplicons) && (heavy_progress < heavy_amplicon_count))
-    {
-      int a = heavy_amplicon++;
-      if (swarminfo[ampinfo[a].swarmid].mass >= opt_boundary)
-        {
-          progress_update(++heavy_progress);
-          pthread_mutex_unlock(&heavy_mutex);
-          long m, v;
-          fastidious_check_large_var(bloomp, buffer1, buffer2, a, &m, &v);
-          pthread_mutex_lock(&heavy_mutex);
-          heavy_variants += v;
-        }
-    }
-  pthread_mutex_unlock(&heavy_mutex);
-  free(buffer2);
-  free(buffer1);
+  /* add to swarm */
+  ampinfo[current_swarm_tail].next = amp;
+  current_swarm_tail = amp;
 }
 
 void algo_d1_run()
 {
-  unsigned long longestamplicon = db_getlongestsequence();
+  longestamplicon = db_getlongestsequence();
   amplicons = db_getsequencecount();
 
-  threads_init();
-
-  ampinfo = (struct ampinfo_s *)
-    xmalloc(amplicons * sizeof(struct ampinfo_s));
+  ampinfo = static_cast<struct ampinfo_s *>
+    (xmalloc(amplicons * sizeof(struct ampinfo_s)));
 
-  global_hits_alloc = longestamplicon * 7 + 4;
-  global_hits_data = (int *) xmalloc(global_hits_alloc * sizeof(int));
+  global_hits_alloc = longestamplicon * 7 + 4 + 1;
+  global_hits_data = static_cast<unsigned int *>
+    (xmalloc(global_hits_alloc * sizeof(unsigned int)));
 
   /* compute hash for all amplicons and store them in a hash table */
-  
+
   hash_alloc(amplicons);
+  bloom_a = bloom_init(hash_get_tablesize());
+
+  duplicates_found = 0;
 
   progress_init("Hashing sequences:", amplicons);
   for(unsigned int i=0; i<amplicons; i++)
     {
-      unsigned long seqlen = db_getsequencelen(i);
-      unsigned char * seq = (unsigned char *) db_getsequence(i);
       struct ampinfo_s * bp = ampinfo + i;
       bp->generation = 0;
       bp->swarmid = NO_SWARM;
       bp->next = NO_SWARM;
       bp->graft_cand = NO_SWARM;
-      hash_insert(i, seq, seqlen);
+      hash_insert(i);
       progress_update(i);
+      if (duplicates_found)
+        break;
+    }
+
+  if (duplicates_found)
+    {
+      fprintf(logfile,
+              "\n\n"
+              "Error: some fasta entries have identical sequences.\n"
+              "Swarm expects dereplicated fasta files.\n"
+              "Such files can be produced with swarm or vsearch:\n"
+              " swarm -d 0 -w derep.fasta -o /dev/null input.fasta\n"
+              "or\n"
+              " vsearch --derep_fulllength input.fasta --sizein --sizeout --output derep.fasta\n");
+      exit(1);
     }
+
   progress_done();
 
-  unsigned char * dir = 0;
-  unsigned long * hearray = 0;
+  unsigned char * dir = nullptr;
+  uint64_t * hearray = nullptr;
 
   if (uclustfile)
     {
-      dir = (unsigned char *) xmalloc(longestamplicon*longestamplicon);
-      hearray = (unsigned long *)
-        xmalloc(2 * longestamplicon * sizeof(unsigned long));
+      dir = static_cast<unsigned char *>
+        (xmalloc(longestamplicon*longestamplicon));
+      hearray = static_cast<uint64_t *>
+        (xmalloc(2 * longestamplicon * sizeof(uint64_t)));
     }
-  
+
+  /* for all amplicons, generate list of matching amplicons */
+
+  network = static_cast<unsigned int*>
+    (xmalloc(network_alloc * sizeof(unsigned int)));
+  network_count = 0;
+
+  pthread_mutex_init(&network_mutex, nullptr);
+  network_amp = 0;
+  progress_init("Building network: ", amplicons);
+  ThreadRunner * network_tr = new ThreadRunner(static_cast<int>(opt_threads),
+                                               network_thread);
+  network_tr->run();
+  delete network_tr;
+  pthread_mutex_destroy(&network_mutex);
+
+  progress_done();
+
   /* for each non-swarmed amplicon look for subseeds ... */
-  long swarmcount = 0;
+
+  unsigned int swarmcount = 0;
   progress_init("Clustering:       ", amplicons);
+
   for(unsigned int seed = 0; seed < amplicons; seed++)
     {
       struct ampinfo_s * ap = ampinfo + seed;
@@ -1050,10 +783,10 @@ void algo_d1_run()
           ap->generation = 0;
           ap->parent = NO_SWARM;
           ap->next = NO_SWARM;
-                              
+
           /* link up this initial seed in the list of swarms */
           current_swarm_tail = seed;
-          
+
           /* initialize swarm stats */
           swarmsize = 0;
           swarm_maxgen = 0;
@@ -1061,8 +794,6 @@ void algo_d1_run()
           singletons = 0;
           swarm_sumlen = 0;
 
-          update_stats(seed);
-          
           /* init list */
           global_hits_count = 0;
 
@@ -1071,31 +802,31 @@ void algo_d1_run()
 
           /* sort hits */
           qsort(global_hits_data, global_hits_count,
-                sizeof(int), compare_amp);
-          
+                sizeof(unsigned int), compare_amp);
+
           /* add subseeds on list to current swarm */
-          for(int i = 0; i < global_hits_count; i++)
+          for(unsigned int i = 0; i < global_hits_count; i++)
             add_amp_to_swarm(global_hits_data[i]);
-          
+
           /* find later generation matches */
-          int subseed = ap->next;
+          unsigned int subseed = ap->next;
           while(subseed != NO_SWARM)
             {
               /* process all subseeds of this generation */
               global_hits_count = 0;
+
               while(subseed != NO_SWARM)
                 {
                   process_seed(subseed);
-                  update_stats(subseed);
                   subseed = ampinfo[subseed].next;
                 }
-              
+
               /* sort all of this generation */
               qsort(global_hits_data, global_hits_count,
-                    sizeof(int), compare_amp);
-              
+                    sizeof(unsigned int), compare_amp);
+
               /* add them to the swarm */
-              for(int i = 0; i < global_hits_count; i++)
+              for(unsigned int i = 0; i < global_hits_count; i++)
                 add_amp_to_swarm(global_hits_data[i]);
 
               /* start with most abundant amplicon of next generation */
@@ -1108,11 +839,9 @@ void algo_d1_run()
           if (swarmcount >= swarminfo_alloc)
             {
               /* allocate memory for more swarms... */
-              swarminfo_alloc += 1000;
-              swarminfo = 
-                (struct swarminfo_s *) xrealloc (swarminfo,
-                                                 swarminfo_alloc *
-                                                 sizeof(swarminfo_s));
+              swarminfo_alloc += 1024;
+              swarminfo = static_cast<struct swarminfo_s *>
+                (xrealloc(swarminfo, swarminfo_alloc * sizeof(swarminfo_s)));
             }
 
           struct swarminfo_s * sp = swarminfo + swarmcount;
@@ -1125,7 +854,7 @@ void algo_d1_run()
           sp->maxgen = swarm_maxgen;
           sp->last = current_swarm_tail;
           sp->attached = false;
-          
+
           /* update overall stats */
           if (swarmsize > largest)
             largest = swarmsize;
@@ -1138,6 +867,11 @@ void algo_d1_run()
     }
   progress_done();
 
+  xfree(global_hits_data);
+
+  xfree(network);
+  network = nullptr;
+
   swarmcount_adjusted = swarmcount;
 
   /* fastidious */
@@ -1146,21 +880,21 @@ void algo_d1_run()
     {
       fprintf(logfile, "\n");
       fprintf(logfile, "Results before fastidious processing:\n");
-      fprintf(logfile, "Number of swarms:  %ld\n", swarmcount);
-      fprintf(logfile, "Largest swarm:     %d\n", largest);
+      fprintf(logfile, "Number of swarms:  %u\n", swarmcount);
+      fprintf(logfile, "Largest swarm:     %u\n", largest);
       fprintf(logfile, "\n");
 
-      long small_otus = 0;
-      long amplicons_in_small_otus = 0;
-      long nucleotides_in_small_otus = 0;
+      uint64_t small_otus = 0;
+      uint64_t amplicons_in_small_otus = 0;
+      uint64_t nucleotides_in_small_otus = 0;
 
-      progress_init("Counting amplicons in heavy and light swarms", 
+      progress_init("Counting amplicons in heavy and light swarms",
                     swarmcount);
 
-      for(long i = 0; i < swarmcount; i++)
+      for(uint64_t i = 0; i < swarmcount; i++)
         {
           struct swarminfo_s * sp = swarminfo + i;
-          if (sp->mass < opt_boundary)
+          if (sp->mass < static_cast<uint64_t>(opt_boundary))
             {
               amplicons_in_small_otus += sp->size;
               nucleotides_in_small_otus += sp->sumlen;
@@ -1170,16 +904,16 @@ void algo_d1_run()
         }
       progress_done();
 
-      long amplicons_in_large_otus = amplicons - amplicons_in_small_otus;
-      long large_otus = swarmcount - small_otus;
+      uint64_t amplicons_in_large_otus = amplicons - amplicons_in_small_otus;
+      uint64_t large_otus = swarmcount - small_otus;
 
-      fprintf(logfile, "Heavy swarms: %ld, with %ld amplicons\n",
+      fprintf(logfile, "Heavy swarms: %" PRIu64 ", with %" PRIu64 " amplicons\n",
               large_otus, amplicons_in_large_otus);
-      fprintf(logfile, "Light swarms: %ld, with %ld amplicons\n",
+      fprintf(logfile, "Light swarms: %" PRIu64 ", with %" PRIu64 " amplicons\n",
               small_otus, amplicons_in_small_otus);
-      fprintf(logfile, "Total length of amplicons in light swarms: %ld\n",
+      fprintf(logfile, "Total length of amplicons in light swarms: %" PRIu64 "\n",
               nucleotides_in_small_otus);
-      
+
       if ((small_otus == 0) || (large_otus == 0))
         {
           fprintf(logfile, "Only light or heavy swarms found - "
@@ -1190,30 +924,43 @@ void algo_d1_run()
           /* m: total size of Bloom filter in bits */
           /* k: number of hash functions */
           /* n: number of entries in the bloom filter */
-          /* here: k=12 and m/n=18, that is 18 bits/entry */
-          
-          long bits = opt_bloom_bits; /* 18 */
-          long k = int(bits * 0.693);    /* 12 */
-          long m = bits * 7 * nucleotides_in_small_otus;
-          
-          long memtotal = arch_get_memtotal();
-          long memused = arch_get_memused();
+          /* here: k=11 and m/n=18, that is 16 bits/entry */
+
+          uint64_t bits = static_cast<uint64_t>(opt_bloom_bits); /* 16 */
+
+          // int64_t k = int(bits * 0.693);    /* 11 */
+          unsigned int k = static_cast<unsigned int>(4 * bits / 10); /* 6 */
+          if (k < 1)
+            k = 1;
+
+          uint64_t m = bits * 7 * nucleotides_in_small_otus;
+
+          uint64_t memtotal = arch_get_memtotal();
+          uint64_t memused = arch_get_memused();
 
           if (opt_ceiling)
             {
-              long memrest = 1024 * 1024 * opt_ceiling - memused;
-              long new_bits = 8 * memrest / (7 * nucleotides_in_small_otus);
+              uint64_t memrest
+                = 1024 * 1024 * static_cast<uint64_t>(opt_ceiling) - memused;
+              uint64_t new_bits = 8 * memrest / (7 * nucleotides_in_small_otus);
               if (new_bits < bits)
                 {
                   if (new_bits < 2)
                     fatal("Insufficient memory remaining for Bloom filter");
                   fprintf(logfile, "Reducing memory used for Bloom filter due to --ceiling option.\n");
                   bits = new_bits;
-                  k = int(bits * 0.693);
+                  // k = int(bits * 0.693);
+                  k = static_cast<unsigned int>(4 * bits / 10);
+                  if (k < 1)
+                    k = 1;
+
                   m = bits * 7 * nucleotides_in_small_otus;
                 }
             }
 
+          if (m < 64)
+            m = 64;
+
           if (memused + m/8 > memtotal)
             {
               fprintf(logfile, "WARNING: Memory usage will probably exceed total amount of memory available.\n");
@@ -1221,13 +968,19 @@ void algo_d1_run()
             }
 
           fprintf(logfile,
-                  "Bloom filter: bits=%ld, m=%ld, k=%ld, size=%.1fMB\n",
-                  bits, m, k, 1.0 * m / (8*1024*1024));
-          
-          bloomp = new BloomFilter(m, k);
-          char * buffer1 = (char*) xmalloc(db_getlongestsequence() + 2);
-          char * buffer2 = (char*) xmalloc(db_getlongestsequence() + 3);
-          
+                  "Bloom filter: bits=%" PRIu64 ", m=%" PRIu64 ", k=%u, size=%.1fMB\n",
+                  bits, m, k, static_cast<double>(m) / (8*1024*1024));
+
+
+          bloom_f = bloomflex_init(m/8, k);
+
+
+          /* Empty the old hash and bloom filter
+             before we reinsert only the light swarm amplicons */
+
+          hash_zap();
+          bloom_zap(bloom_a);
+
           progress_init("Adding light swarm amplicons to Bloom filter",
                         amplicons_in_small_otus);
 
@@ -1235,87 +988,54 @@ void algo_d1_run()
           /* but stop when all amplicons in small otus are processed */
 
           light_variants = 0;
-                        
-#if 1
-          pthread_mutex_init(&light_mutex, NULL);
+
+          pthread_mutex_init(&light_mutex, nullptr);
           light_progress = 0;
           light_amplicon_count = amplicons_in_small_otus;
           light_amplicon = amplicons - 1;
-          ThreadRunner * tr = new ThreadRunner(opt_threads, mark_light_thread);
+          ThreadRunner * tr = new ThreadRunner(static_cast<int>(opt_threads),
+                                               mark_light_thread);
           tr->run();
           delete tr;
           pthread_mutex_destroy(&light_mutex);
-#else
-          int a = amplicons - 1;
-          long x = 0;
-          while (x < amplicons_in_small_otus)
-            {
-              if (swarminfo[ampinfo[a].swarmid].mass < opt_boundary)
-                {
-                  light_variants += fastidious_mark_small_var(bloomp, buffer1, a);
-                  x++;
-                  progress_update(x);
-                }
-              a--;
-            }
-#endif
 
           progress_done();
 
           fprintf(logfile,
-                  "Generated %ld variants from light swarms\n", light_variants);
-                    
+                  "Generated %" PRIu64 " variants from light swarms\n",
+                  light_variants);
+
           progress_init("Checking heavy swarm amplicons against Bloom filter",
                         amplicons_in_large_otus);
-          
+
           /* process amplicons in order from most to least abundant */
           /* but stop when all amplicons in large otus are processed */
 
-          pthread_mutex_init(&graft_mutex, NULL);
+          pthread_mutex_init(&graft_mutex, nullptr);
 
           heavy_variants = 0;
 
-#if 1
-          pthread_mutex_init(&heavy_mutex, NULL);
+          pthread_mutex_init(&heavy_mutex, nullptr);
           heavy_progress = 0;
           heavy_amplicon_count = amplicons_in_large_otus;
           heavy_amplicon = 0;
-          ThreadRunner * heavy_tr = new ThreadRunner(opt_threads, 
-                                                     check_heavy_thread);
+          ThreadRunner * heavy_tr
+            = new ThreadRunner(static_cast<int> (opt_threads),
+                               check_heavy_thread);
           heavy_tr->run();
           delete heavy_tr;
           pthread_mutex_destroy(&heavy_mutex);
-#else
-          long i = 0;
-          
-          for(int a = 0; (a < amplicons) && (i < amplicons_in_large_otus); a++)
-            {
-              int swarmid = ampinfo[a].swarmid;
-              int mass = swarminfo[swarmid].mass;
-              if (mass >= opt_boundary)
-                {
-                  long m, v;
-                  fastidious_check_large_var(bloomp, buffer1, buffer2, 
-                                             a, &m, &v);
-                  heavy_variants += v;
-                  progress_update(++i);
-                }
-            }
-#endif
 
           progress_done();
 
-          free(buffer1);
-          free(buffer2);
-
-          delete bloomp;
+          bloomflex_exit(bloom_f);
 
           pthread_mutex_destroy(&graft_mutex);
 
-          fprintf(logfile, "Heavy variants: %ld\n", heavy_variants);
-          fprintf(logfile, "Got %ld graft candidates\n", graft_candidates);
-          int grafts = attach_candidates(amplicons);
-          fprintf(logfile, "Made %d grafts\n", grafts);
+          fprintf(logfile, "Heavy variants: %" PRIu64 "\n", heavy_variants);
+          fprintf(logfile, "Got %" PRId64 " graft candidates\n", graft_candidates);
+          unsigned int grafts = attach_candidates(amplicons);
+          fprintf(logfile, "Made %u grafts\n", grafts);
           fprintf(logfile, "\n");
         }
     }
@@ -1326,15 +1046,15 @@ void algo_d1_run()
   progress_init("Writing swarms:   ", swarmcount);
 
   if (opt_mothur)
-    fprintf(outfile, "swarm_%ld\t%ld", opt_differences, swarmcount_adjusted);
+    fprintf(outfile, "swarm_%" PRId64 "\t%" PRIu64, opt_differences, swarmcount_adjusted);
 
-  for(int i = 0; i < swarmcount; i++)
+  for(unsigned int i = 0; i < swarmcount; i++)
     {
       if (!swarminfo[i].attached)
         {
-          int seed = swarminfo[i].seed;
-          for (int a = seed;
-               a >= 0;
+          unsigned int seed = swarminfo[i].seed;
+          for (unsigned int a = seed;
+               a != NO_SWARM;
                a = ampinfo[a].next)
             {
               if (opt_mothur)
@@ -1359,7 +1079,7 @@ void algo_d1_run()
 
   if (opt_mothur)
     fputc('\n', outfile);
-  
+
   progress_done();
 
 
@@ -1368,11 +1088,19 @@ void algo_d1_run()
   if (opt_seeds)
     {
       progress_init("Writing seeds:    ", swarmcount);
-      for(int i=0; i < swarmcount; i++)
+
+      unsigned int * sorter = static_cast<unsigned int *>
+        (xmalloc(swarmcount * sizeof(unsigned int)));
+      for(unsigned int i = 0; i < swarmcount; i++)
+        sorter[i] = i;
+      qsort(sorter, swarmcount, sizeof(unsigned int), compare_mass);
+
+      for(unsigned int j=0; j < swarmcount; j++)
         {
+          unsigned int i = sorter[j];
           if (!swarminfo[i].attached)
             {
-              int seed = swarminfo[i].seed;
+              unsigned int seed = swarminfo[i].seed;
               fprintf(fp_seeds, ">");
               fprint_id_with_new_abundance(fp_seeds, seed, swarminfo[i].mass);
               fprintf(fp_seeds, "\n");
@@ -1380,6 +1108,9 @@ void algo_d1_run()
             }
           progress_update(i+1);
         }
+
+      xfree(sorter);
+
       progress_done();
     }
 
@@ -1392,43 +1123,46 @@ void algo_d1_run()
 
       progress_init("Writing structure:", swarmcount);
 
-      for(unsigned int swarmid = 0; swarmid < swarmcount ; swarmid++)
+      for(unsigned int swarmid = 0; swarmid < swarmcount; swarmid++)
         {
           if (!swarminfo[swarmid].attached)
             {
-              int seed = swarminfo[swarmid].seed;
+              unsigned int seed = swarminfo[swarmid].seed;
 
               struct ampinfo_s * bp = ampinfo + seed;
 
-              for (int a = bp->next;
-                   a >= 0;
+              for (unsigned int a = bp->next;
+                   a != NO_SWARM;
                    a = ampinfo[a].next)
                 {
-                  long graft_parent = ampinfo[a].graft_cand;
+                  uint64_t graft_parent = ampinfo[a].graft_cand;
                   if (graft_parent != NO_SWARM)
                     {
-                      fprint_id_noabundance(internal_structure_file, graft_parent);
+                      fprint_id_noabundance(internal_structure_file,
+                                            graft_parent);
                       fprintf(internal_structure_file, "\t");
                       fprint_id_noabundance(internal_structure_file, a);
                       fprintf(internal_structure_file,
-                              "\t%d\t%u\t%d\n",
+                              "\t%d\t%u\t%u\n",
                               2,
                               cluster_no + 1,
                               ampinfo[graft_parent].generation + 1);
                     }
 
-                  long parent = ampinfo[a].parent;
+                  uint64_t parent = ampinfo[a].parent;
                   if (parent != NO_SWARM)
                     {
-                      int diff = 1;
+                      unsigned int diff = 1;
                       if (duplicates_found)
                         {
-                          unsigned long parentseqlen = db_getsequencelen(parent);
-                          unsigned long ampseqlen = db_getsequencelen(a);
+                          uint64_t parentseqlen = db_getsequencelen(parent);
+                          uint64_t ampseqlen = db_getsequencelen(a);
                           if (parentseqlen == ampseqlen)
                             {
-                              unsigned char * parentseq = (unsigned char *) db_getsequence(parent);
-                              unsigned char * ampseq = (unsigned char *) db_getsequence(a);
+                              unsigned char * parentseq = reinterpret_cast
+                                <unsigned char *>(db_getsequence(parent));
+                              unsigned char * ampseq = reinterpret_cast
+                                <unsigned char *>(db_getsequence(a));
                               if (memcmp(parentseq, ampseq, parentseqlen) == 0)
                                 diff = 0;
                             }
@@ -1437,7 +1171,7 @@ void algo_d1_run()
                       fprintf(internal_structure_file, "\t");
                       fprint_id_noabundance(internal_structure_file, a);
                       fprintf(internal_structure_file,
-                              "\t%d\t%u\t%d\n",
+                              "\t%u\t%u\t%u\n",
                               diff,
                               cluster_no + 1,
                               ampinfo[a].generation);
@@ -1460,62 +1194,63 @@ void algo_d1_run()
 
       progress_init("Writing UCLUST:   ", swarmcount);
 
-      for(unsigned int swarmid = 0; swarmid < swarmcount ; swarmid++)
+      for(unsigned int swarmid = 0; swarmid < swarmcount; swarmid++)
         {
           if (!swarminfo[swarmid].attached)
             {
-              int seed = swarminfo[swarmid].seed;
-              
+              unsigned int seed = swarminfo[swarmid].seed;
+
               struct ampinfo_s * bp = ampinfo + seed;
-              
-              fprintf(uclustfile, "C\t%u\t%d\t*\t*\t*\t*\t*\t",
+
+              fprintf(uclustfile, "C\t%u\t%u\t*\t*\t*\t*\t*\t",
                       cluster_no,
                       swarminfo[swarmid].size);
               fprint_id(uclustfile, seed);
               fprintf(uclustfile, "\t*\n");
-              
-              fprintf(uclustfile, "S\t%u\t%lu\t*\t*\t*\t*\t*\t",
+
+              fprintf(uclustfile, "S\t%u\t%u\t*\t*\t*\t*\t*\t",
                       cluster_no,
                       db_getsequencelen(seed));
               fprint_id(uclustfile, seed);
               fprintf(uclustfile, "\t*\n");
-              
-              for (int a = bp->next; 
-                   a >= 0;
+
+              for (unsigned int a = bp->next;
+                   a != NO_SWARM;
                    a = ampinfo[a].next)
                 {
                   char * dseq = db_getsequence(a);
-                  char * dend = dseq + db_getsequencelen(a);
+                  int64_t dlen = db_getsequencelen(a);
                   char * qseq = db_getsequence(seed);
-                  char * qend = qseq + db_getsequencelen(seed);
-                  
-                  unsigned long nwscore = 0;
-                  unsigned long nwdiff = 0;
-                  char * nwalignment = NULL;
-                  unsigned long nwalignmentlength = 0;
-                  
-                  nw(dseq, dend, qseq, qend,
+                  int64_t qlen = db_getsequencelen(seed);
+
+                  int64_t nwscore = 0;
+                  int64_t nwdiff = 0;
+                  char * nwalignment = nullptr;
+                  int64_t nwalignmentlength = 0;
+
+                  nw(dseq, dlen, qseq, qlen,
                      score_matrix_63, penalty_gapopen, penalty_gapextend,
                      & nwscore, & nwdiff, & nwalignmentlength, & nwalignment,
-                     dir, hearray, 0, 0);
-                  
-                  double percentid = 100.0 * (nwalignmentlength - nwdiff) /
-                    nwalignmentlength;
-                  
+                     dir, reinterpret_cast<int64_t *>(hearray), 0, 0);
+
+                  double percentid
+                    = 100.0 * static_cast<double>(nwalignmentlength - nwdiff)
+                    / static_cast<double>(nwalignmentlength);
+
                   fprintf(uclustfile,
-                          "H\t%u\t%lu\t%.1f\t+\t0\t0\t%s\t",
+                          "H\t%u\t%u\t%.1f\t+\t0\t0\t%s\t",
                           cluster_no,
                           db_getsequencelen(a),
-                          percentid, 
+                          percentid,
                           nwdiff > 0 ? nwalignment : "=");
-                  
+
                   fprint_id(uclustfile, a);
                   fprintf(uclustfile, "\t");
                   fprint_id(uclustfile, seed);
                   fprintf(uclustfile, "\n");
-                  
+
                   if (nwalignment)
-                    free(nwalignment);
+                    xfree(nwalignment);
                 }
 
               cluster_no++;
@@ -1530,14 +1265,14 @@ void algo_d1_run()
   if (statsfile)
     {
       progress_init("Writing stats:    ", swarmcount);
-      for(long i = 0; i < swarmcount; i++)
+      for(uint64_t i = 0; i < swarmcount; i++)
         {
           swarminfo_s * sp = swarminfo + i;
           if (!sp->attached)
             {
-              fprintf(statsfile, "%d\t%ld\t", sp->size, sp->mass);
+              fprintf(statsfile, "%u\t%" PRIu64 "\t", sp->size, sp->mass);
               fprint_id_noabundance(statsfile, sp->seed);
-              fprintf(statsfile, "\t%lu\t%d\t%d\t%d\n",
+              fprintf(statsfile, "\t%" PRIu64 "\t%u\t%u\t%u\n",
                       db_getabundance(sp->seed),
                       sp->singletons, sp->maxgen, sp->maxgen);
             }
@@ -1546,35 +1281,31 @@ void algo_d1_run()
       progress_done();
     }
 
-
   fprintf(logfile, "\n");
-  fprintf(logfile, "Number of swarms:  %ld\n", swarmcount_adjusted);
-  fprintf(logfile, "Largest swarm:     %d\n", largest);
-  fprintf(logfile, "Max generations:   %d\n", maxgen);
-
-  threads_done();
+  fprintf(logfile, "Number of swarms:  %" PRIu64 "\n", swarmcount_adjusted);
+  fprintf(logfile, "Largest swarm:     %u\n", largest);
+  fprintf(logfile, "Max generations:   %u\n", maxgen);
 
+  bloom_exit(bloom_a);
   hash_free();
 
-  if(swarminfo)
-    free(swarminfo);
-
-  free(ampinfo);
+  if (swarminfo)
+    xfree(swarminfo);
 
-  free(global_hits_data);
+  xfree(ampinfo);
 
   if (uclustfile)
     {
-      free(dir);
-      free(hearray);
+      xfree(dir);
+      xfree(hearray);
     }
 
 #ifdef HASHSTATS
-  fprintf(logfile, "Tries: %lu\n", tries);
-  fprintf(logfile, "Probes: %lu\n", probes);
-  fprintf(logfile, "Hits: %lu\n", hits);
-  fprintf(logfile, "Success: %lu\n", success);
-  fprintf(logfile, "Bingo: %lu\n", bingo);
-  fprintf(logfile, "Collisions: %lu\n", collisions);
+  fprintf(logfile, "Tries:      %12lu\n", tries);
+  fprintf(logfile, "Bloom m:    %12lu\n", bloom_matches);
+  fprintf(logfile, "Hits:       %12lu\n", hits);
+  fprintf(logfile, "Success:    %12lu\n", success);
+  fprintf(logfile, "Bingo:      %12lu\n", bingo);
+  fprintf(logfile, "Collisions: %12lu\n", collisions);
 #endif
 }
diff --git a/src/arch.cc b/src/arch.cc
index 6a749373..e7dc73e5 100644
--- a/src/arch.cc
+++ b/src/arch.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -23,46 +23,100 @@
 
 #include "swarm.h"
 
-unsigned long arch_get_memused()
+uint64_t arch_get_memused()
 {
+#ifdef _WIN32
+
+  PROCESS_MEMORY_COUNTERS pmc;
+  GetProcessMemoryInfo(GetCurrentProcess(),
+                       &pmc,
+                       sizeof(PROCESS_MEMORY_COUNTERS));
+  return pmc.PeakWorkingSetSize;
+
+#else
+
   struct rusage r_usage;
   getrusage(RUSAGE_SELF, & r_usage);
-  
-#if defined __APPLE__
+
+# ifdef __APPLE__
   /* Mac: ru_maxrss gives the size in bytes */
-  return r_usage.ru_maxrss;
-#else
+  return static_cast<uint64_t>(r_usage.ru_maxrss);
+# else
   /* Linux: ru_maxrss gives the size in kilobytes  */
-  return r_usage.ru_maxrss * 1024;
+  return static_cast<uint64_t>(r_usage.ru_maxrss * 1024);
+# endif
+
 #endif
 }
 
-unsigned long arch_get_memtotal()
+uint64_t arch_get_memtotal()
 {
-#if defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
-
-  long phys_pages = sysconf(_SC_PHYS_PAGES);
-  long pagesize = sysconf(_SC_PAGESIZE);
+#ifdef _WIN32
 
-  if ((phys_pages == -1) || (pagesize == -1))
-    fatal("Cannot determine amount of memory");
-  return pagesize * phys_pages;
+  MEMORYSTATUSEX ms;
+  ms.dwLength = sizeof(MEMORYSTATUSEX);
+  GlobalMemoryStatusEx(&ms);
+  return ms.ullTotalPhys;
 
 #elif defined(__APPLE__)
 
   int mib [] = { CTL_HW, HW_MEMSIZE };
   int64_t ram = 0;
   size_t length = sizeof(ram);
-  if(sysctl(mib, 2, &ram, &length, NULL, 0) == -1)
-    fatal("Cannot determine amount of memory");
-  return ram;
+  if(sysctl(mib, 2, &ram, &length, nullptr, 0) == -1)
+    fatal("Cannot determine amount of RAM");
+  return static_cast<uint64_t>(ram);
+
+#elif defined(_SC_PHYS_PAGES) && defined(_SC_PAGESIZE)
+
+  int64_t phys_pages = sysconf(_SC_PHYS_PAGES);
+  int64_t pagesize = sysconf(_SC_PAGESIZE);
+  if ((phys_pages == -1) || (pagesize == -1))
+    fatal("Cannot determine amount of RAM");
+  return static_cast<uint64_t>(pagesize * phys_pages);
 
 #else
 
   struct sysinfo si;
   if (sysinfo(&si))
-    fatal("Cannot determine amount of memory");
+    fatal("Cannot determine amount of RAM");
   return si.totalram * si.mem_unit;
 
 #endif
 }
+
+void arch_srandom(unsigned int seed)
+{
+  /* initialize pseudo-random number generator */
+  if (seed == 0)
+    {
+#ifdef _WIN32
+      srand(GetTickCount());
+#else
+      int fd = open("/dev/urandom", O_RDONLY);
+      if (fd < 0)
+        fatal("Unable to open /dev/urandom");
+      if (read(fd, & seed, sizeof(seed)) < 0)
+        fatal("Unable to read from /dev/urandom");
+      close(fd);
+      srandom(seed);
+#endif
+    }
+  else
+    {
+#ifdef _WIN32
+      srand(seed);
+#else
+      srandom(seed);
+#endif
+    }
+}
+
+uint64_t arch_random()
+{
+#ifdef _WIN32
+  return static_cast<uint64_t>(rand());
+#else
+  return static_cast<uint64_t>(random());
+#endif
+}
diff --git a/src/bitmap.h b/src/bitmap.h
deleted file mode 100644
index ee6184af..00000000
--- a/src/bitmap.h
+++ /dev/null
@@ -1,76 +0,0 @@
-/*
-    SWARM
-
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
-
-    This program is free software: you can redistribute it and/or modify
-    it under the terms of the GNU Affero General Public License as
-    published by the Free Software Foundation, either version 3 of the
-    License, or (at your option) any later version.
-
-    This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU Affero General Public License for more details.
-
-    You should have received a copy of the GNU Affero General Public License
-    along with this program.  If not, see <http://www.gnu.org/licenses/>.
-
-    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
-    Department of Informatics, University of Oslo,
-    PO Box 1080 Blindern, NO-0316 Oslo, Norway
-*/
-
-class Bitmap
-{
- private:
-  size_t size;      /* size in bits */
-  unsigned char * data;    /* the actual bitmap */
-  
- public:
-
-  explicit Bitmap(size_t _size)
-  {
-    size = _size;
-    data = (unsigned char *) xmalloc((size+7)/8);
-  }
-  
-  ~Bitmap()
-  {
-    if (data)
-      free(data);
-  }
-  
-  bool get(size_t x)
-  {
-    return (data[x >> 3] >> (x & 7)) & 1;
-  }
-
-  void reset_all()
-  {
-    memset(data, 0, (size+7)/8);
-  }
-  
-  void set_all()
-  {
-    memset(data, 255, (size+7)/8);
-  }
-  
-  void reset(size_t x)
-  {
-    //    data[x >> 3] &= ~ (1 << (x & 7));
-    __sync_fetch_and_and(data + (x >> 3), ~(1 << (x & 7)));
-  }
-  
-  void set(size_t x)
-  {
-    //    data[x >> 3] |= 1 << (x & 7);
-    __sync_fetch_and_or(data + (x >> 3), 1 << (x & 7));
-  }
-  
-  void flip(size_t x)
-  {
-    //    data[x >> 3] ^= 1 << (x & 7);
-    __sync_fetch_and_xor(data + (x >> 3), 1 << (x & 7));
-  }
-};
diff --git a/src/bloom.h b/src/bloom.h
deleted file mode 100644
index 12a2a367..00000000
--- a/src/bloom.h
+++ /dev/null
@@ -1,68 +0,0 @@
-/*
-    SWARM
-
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
-
-    This program is free software: you can redistribute it and/or modify
-    it under the terms of the GNU Affero General Public License as
-    published by the Free Software Foundation, either version 3 of the
-    License, or (at your option) any later version.
-
-    This program is distributed in the hope that it will be useful,
-    but WITHOUT ANY WARRANTY; without even the implied warranty of
-    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-    GNU Affero General Public License for more details.
-
-    You should have received a copy of the GNU Affero General Public License
-    along with this program.  If not, see <http://www.gnu.org/licenses/>.
-
-    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
-    Department of Informatics, University of Oslo,
-    PO Box 1080 Blindern, NO-0316 Oslo, Norway
-*/
-
-class BloomFilter
-{
-private:
-  
-  Bitmap bitmap;
-  size_t m; /* total number of bits in bitmap */
-  int k; /* number of hash functions */
-  
-public:
-  
-  BloomFilter(unsigned long _m, int _k) : bitmap(_m)
-  {
-    bitmap.reset_all();
-    m = _m;
-    k = _k;
-  }
-  
-  bool get(const char * buf, size_t len)
-  {
-    uint128 hash = CityHash128(buf, len);
-    uint64 h0 = Uint128Low64(hash);
-    uint64 h1 = Uint128High64(hash);
-
-    for(int i=0; i<k; i++)
-      {
-        uint64 h = h0 ^ (i * h1);
-        if (! bitmap.get(h % m))
-          return false;
-      }
-    return true;
-  }
-
-  void set(const char * buf, size_t len)
-  {
-    uint128 hash = CityHash128(buf, len);
-    uint64 h0 = Uint128Low64(hash);
-    uint64 h1 = Uint128High64(hash);
-
-    for(int i=0; i<k; i++)
-      {
-        uint64 h = h0 ^ (i * h1);
-        bitmap.set(h % m);
-      }
-  }
-};
diff --git a/src/bloomflex.cc b/src/bloomflex.cc
new file mode 100644
index 00000000..1ae6f035
--- /dev/null
+++ b/src/bloomflex.cc
@@ -0,0 +1,86 @@
+/*
+    SWARM
+
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as
+    published by the Free Software Foundation, either version 3 of the
+    License, or (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
+    Department of Informatics, University of Oslo,
+    PO Box 1080 Blindern, NO-0316 Oslo, Norway
+*/
+
+/*
+  Blocked bloom filter with precomputed bit patterns
+  as described in
+
+  Putze F, Sanders P, Singler J (2009)
+  Cache-, Hash- and Space-Efficient Bloom Filters
+  Journal of Experimental Algorithmics, 14, 4
+  https://doi.org/10.1145/1498698.1594230
+*/
+
+#include "swarm.h"
+
+void bloomflex_patterns_generate(struct bloomflex_s * b);
+
+void bloomflex_patterns_generate(struct bloomflex_s * b)
+{
+#if 0
+  printf("Generating %" PRIu64 " patterns with %" PRIu64 " bits set.\n",
+         b->pattern_count,
+         b->pattern_k);
+#endif
+  for (unsigned int i = 0; i < b->pattern_count; i++)
+    {
+      uint64_t pattern = 0;
+      for (unsigned int j = 0; j < b->pattern_k; j++)
+        {
+          uint64_t onebit;
+          onebit = 1ULL << (arch_random() & 63);
+          while (pattern & onebit)
+            onebit = 1ULL << (arch_random() & 63);
+          pattern |= onebit;
+        }
+      b->patterns[i] = pattern;
+    }
+}
+
+struct bloomflex_s * bloomflex_init(uint64_t size, unsigned int k)
+{
+  /* Input size is in bytes for full bitmap */
+
+  struct bloomflex_s * b = static_cast<struct bloomflex_s *>(xmalloc(sizeof(struct bloomflex_s)));
+  b->size = size >> 3;
+
+  b->pattern_shift = 16;
+  b->pattern_count = 1 << b->pattern_shift;
+  b->pattern_mask = b->pattern_count - 1;
+  b->pattern_k = k;
+
+  b->patterns = static_cast<uint64_t *>(xmalloc(b->pattern_count * 8));
+  bloomflex_patterns_generate(b);
+
+  b->bitmap = static_cast<uint64_t *>(xmalloc(size));
+  memset(b->bitmap, 0xff, size);
+
+  return b;
+}
+
+void bloomflex_exit(struct bloomflex_s * b)
+{
+  xfree(b->bitmap);
+  xfree(b->patterns);
+  xfree(b);
+}
diff --git a/src/bloomflex.h b/src/bloomflex.h
new file mode 100644
index 00000000..54a8e067
--- /dev/null
+++ b/src/bloomflex.h
@@ -0,0 +1,58 @@
+/*
+    SWARM
+
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as
+    published by the Free Software Foundation, either version 3 of the
+    License, or (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
+    Department of Informatics, University of Oslo,
+    PO Box 1080 Blindern, NO-0316 Oslo, Norway
+*/
+
+struct bloomflex_s
+{
+  uint64_t size; /* size in number of longs (8 bytes) */
+  uint64_t pattern_shift;
+  uint64_t pattern_count;
+  uint64_t pattern_mask;
+  uint64_t pattern_k;
+  uint64_t * bitmap;
+  uint64_t * patterns;
+};
+
+struct bloomflex_s * bloomflex_init(uint64_t size, unsigned int k);
+
+void bloomflex_exit(struct bloomflex_s * b);
+
+inline uint64_t * bloomflex_adr(struct bloomflex_s * b, uint64_t h)
+{
+  return b->bitmap + ((h >> b->pattern_shift) % b->size);
+}
+
+inline uint64_t bloomflex_pat(struct bloomflex_s * b,
+                                     uint64_t h)
+{
+  return b->patterns[h & b->pattern_mask];
+}
+
+inline void bloomflex_set(struct bloomflex_s * b, uint64_t h)
+{
+  * bloomflex_adr(b, h) &= ~ bloomflex_pat(b, h);
+}
+
+inline bool bloomflex_get(struct bloomflex_s * b, uint64_t h)
+{
+  return ! (* bloomflex_adr(b, h) & bloomflex_pat(b, h));
+}
diff --git a/src/bloompat.cc b/src/bloompat.cc
new file mode 100644
index 00000000..ad41e930
--- /dev/null
+++ b/src/bloompat.cc
@@ -0,0 +1,86 @@
+/*
+    SWARM
+
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as
+    published by the Free Software Foundation, either version 3 of the
+    License, or (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
+    Department of Informatics, University of Oslo,
+    PO Box 1080 Blindern, NO-0316 Oslo, Norway
+*/
+
+/*
+  Blocked bloom filter with precomputed bit patterns
+  as described in
+
+  Putze F, Sanders P, Singler J (2009)
+  Cache-, Hash- and Space-Efficient Bloom Filters
+  Journal of Experimental Algorithmics, 14, 4
+  https://doi.org/10.1145/1498698.1594230
+*/
+
+#include "swarm.h"
+
+void bloom_patterns_generate(struct bloom_s * b);
+
+void bloom_patterns_generate(struct bloom_s * b)
+{
+  const unsigned int k = 8;
+  for (unsigned int i = 0; i < BLOOM_PATTERN_COUNT; i++)
+    {
+      uint64_t pattern = 0;
+      for (unsigned int j = 0; j < k; j++)
+        {
+          uint64_t onebit;
+          onebit = 1ULL << (arch_random() & 63);
+          while (pattern & onebit)
+            onebit = 1ULL << (arch_random() & 63);
+          pattern |= onebit;
+        }
+      b->patterns[i] = pattern;
+    }
+}
+
+void bloom_zap(struct bloom_s * b)
+{
+  memset(b->bitmap, 0xff, b->size);
+}
+
+struct bloom_s * bloom_init(uint64_t size)
+{
+  // Size is in bytes for full bitmap, must be power of 2
+  // at least 8
+  size = MAX(size, 8);
+
+  struct bloom_s * b = static_cast<struct bloom_s *>(xmalloc(sizeof(struct bloom_s)));
+
+  b->size = size;
+
+  b->mask = (size >> 3) - 1;
+
+  b->bitmap = static_cast<uint64_t *>(xmalloc(size));
+
+  bloom_zap(b);
+
+  bloom_patterns_generate(b);
+
+  return b;
+}
+
+void bloom_exit(struct bloom_s * b)
+{
+  xfree(b->bitmap);
+  xfree(b);
+}
diff --git a/src/bloompat.h b/src/bloompat.h
new file mode 100644
index 00000000..23700d3c
--- /dev/null
+++ b/src/bloompat.h
@@ -0,0 +1,60 @@
+/*
+    SWARM
+
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as
+    published by the Free Software Foundation, either version 3 of the
+    License, or (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
+    Department of Informatics, University of Oslo,
+    PO Box 1080 Blindern, NO-0316 Oslo, Norway
+*/
+
+#define BLOOM_PATTERN_SHIFT 10
+#define BLOOM_PATTERN_COUNT (1 << BLOOM_PATTERN_SHIFT)
+#define BLOOM_PATTERN_MASK (BLOOM_PATTERN_COUNT - 1)
+
+struct bloom_s
+{
+  uint64_t size;
+  uint64_t mask;
+  uint64_t * bitmap;
+  uint64_t patterns[BLOOM_PATTERN_COUNT];
+};
+
+void bloom_zap(struct bloom_s * b);
+
+struct bloom_s * bloom_init(uint64_t size);
+
+void bloom_exit(struct bloom_s * b);
+
+inline uint64_t * bloom_adr(struct bloom_s * b, uint64_t h)
+{
+  return b->bitmap + ((h >> BLOOM_PATTERN_SHIFT) & b->mask);
+}
+
+inline uint64_t bloom_pat(struct bloom_s * b, uint64_t h)
+{
+  return b->patterns[h & BLOOM_PATTERN_MASK];
+}
+
+inline void bloom_set(struct bloom_s * b, uint64_t h)
+{
+  * bloom_adr(b, h) &= ~ bloom_pat(b, h);
+}
+
+inline bool bloom_get(struct bloom_s * b, uint64_t h)
+{
+  return ! (* bloom_adr(b, h) & bloom_pat(b, h));
+}
diff --git a/src/cityhash/city.cc b/src/city.cc
similarity index 94%
rename from src/cityhash/city.cc
rename to src/city.cc
index 2edaae51..dc1ec15f 100644
--- a/src/cityhash/city.cc
+++ b/src/city.cc
@@ -27,8 +27,9 @@
 // possible hash functions, by using SIMD instructions, or by
 // compromising on hash quality.
 
-#include "config.h"
-#include <city.h>
+/* Minor modifications by TR to adapt to Swarm */
+
+#include "city.h"
 
 #include <algorithm>
 #include <string.h>  // for memcpy and memset
@@ -47,7 +48,7 @@ static uint32 UNALIGNED_LOAD32(const char *p) {
   return result;
 }
 
-#ifdef _MSC_VER
+#ifdef _WIN32
 
 #include <stdlib.h>
 #define bswap_32(x) _byteswap_ulong(x)
@@ -84,7 +85,7 @@ static uint32 UNALIGNED_LOAD32(const char *p) {
 #endif
 
 #if !defined(LIKELY)
-#if HAVE_BUILTIN_EXPECT
+#if defined(HAVE_BUILTIN_EXPECT)
 #define LIKELY(x) (__builtin_expect(!!(x), 1))
 #else
 #define LIKELY(x) (x)
@@ -144,7 +145,7 @@ static uint32 Hash32Len13to24(const char *s, size_t len) {
   uint32 d = Fetch32(s + (len >> 1));
   uint32 e = Fetch32(s);
   uint32 f = Fetch32(s + len - 4);
-  uint32 h = len;
+  uint32 h = static_cast<uint32>(len);
 
   return fmix(Mur(f, Mur(e, Mur(d, Mur(c, Mur(b, Mur(a, h)))))));
 }
@@ -154,14 +155,17 @@ static uint32 Hash32Len0to4(const char *s, size_t len) {
   uint32 c = 9;
   for (uint32 i = 0; i < len; i++) {
     signed char v = s[i];
-    b = b * c1 + v;
+    b = b * c1 + static_cast<unsigned int>(v);
     c ^= b;
   }
-  return fmix(Mur(b, Mur(len, c)));
+  return fmix(Mur(b, Mur(static_cast<uint32>(len), c)));
 }
 
 static uint32 Hash32Len5to12(const char *s, size_t len) {
-  uint32 a = len, b = len * 5, c = 9, d = b;
+  uint32 a = static_cast<uint32>(len),
+    b = static_cast<uint32>(len) * 5,
+    c = 9,
+    d = b;
   a += Fetch32(s);
   b += Fetch32(s + len - 4);
   c += Fetch32(s + ((len >> 1) & 4));
@@ -176,7 +180,9 @@ uint32 CityHash32(const char *s, size_t len) {
   }
 
   // len > 24
-  uint32 h = len, g = c1 * len, f = g;
+  uint32 h = static_cast<uint32>(len),
+    g = c1 * static_cast<uint32>(len),
+    f = g;
   uint32 a0 = Rotate32(Fetch32(s + len - 4) * c1, 17) * c2;
   uint32 a1 = Rotate32(Fetch32(s + len - 8) * c1, 17) * c2;
   uint32 a2 = Rotate32(Fetch32(s + len - 16) * c1, 17) * c2;
@@ -199,28 +205,28 @@ uint32 CityHash32(const char *s, size_t len) {
   f = f * 5 + 0xe6546b64;
   size_t iters = (len - 1) / 20;
   do {
-    uint32 a0 = Rotate32(Fetch32(s) * c1, 17) * c2;
-    uint32 a1 = Fetch32(s + 4);
-    uint32 a2 = Rotate32(Fetch32(s + 8) * c1, 17) * c2;
-    uint32 a3 = Rotate32(Fetch32(s + 12) * c1, 17) * c2;
-    uint32 a4 = Fetch32(s + 16);
-    h ^= a0;
+    uint32 aa0 = Rotate32(Fetch32(s) * c1, 17) * c2;
+    uint32 aa1 = Fetch32(s + 4);
+    uint32 aa2 = Rotate32(Fetch32(s + 8) * c1, 17) * c2;
+    uint32 aa3 = Rotate32(Fetch32(s + 12) * c1, 17) * c2;
+    uint32 aa4 = Fetch32(s + 16);
+    h ^= aa0;
     h = Rotate32(h, 18);
     h = h * 5 + 0xe6546b64;
-    f += a1;
+    f += aa1;
     f = Rotate32(f, 19);
     f = f * c1;
-    g += a2;
+    g += aa2;
     g = Rotate32(g, 18);
     g = g * 5 + 0xe6546b64;
-    h ^= a3 + a1;
+    h ^= aa3 + aa1;
     h = Rotate32(h, 19);
     h = h * 5 + 0xe6546b64;
-    g ^= a4;
+    g ^= aa4;
     g = bswap_32(g) * 5;
-    h += a4 * 5;
+    h += aa4 * 5;
     h = bswap_32(h);
-    f += a0;
+    f += aa0;
     PERMUTE3(f, h, g);
     s += 20;
   } while (--iters != 0);
@@ -277,11 +283,11 @@ static uint64 HashLen0to16(const char *s, size_t len) {
     return HashLen16(len + (a << 3), Fetch32(s + len - 4), mul);
   }
   if (len > 0) {
-    uint8 a = s[0];
-    uint8 b = s[len >> 1];
-    uint8 c = s[len - 1];
+    uint8 a = static_cast<uint8>(s[0]);
+    uint8 b = static_cast<uint8>(s[len >> 1]);
+    uint8 c = static_cast<uint8>(s[len - 1]);
     uint32 y = static_cast<uint32>(a) + (static_cast<uint32>(b) << 8);
-    uint32 z = len + (static_cast<uint32>(c) << 2);
+    uint32 z = static_cast<uint32>(len) + (static_cast<uint32>(c) << 2);
     return ShiftMix(y * k2 ^ z * k0) * k2;
   }
   return k2;
@@ -399,7 +405,7 @@ static uint128 CityMurmur(const char *s, size_t len, uint128 seed) {
   uint64 b = Uint128High64(seed);
   uint64 c = 0;
   uint64 d = 0;
-  signed long l = len - 16;
+  signed long l = static_cast<signed long>(len - 16);
   if (l <= 0) {  // len <= 16
     a = ShiftMix(a * k1) * k1;
     c = b * k1 + HashLen0to16(s, len);
diff --git a/src/cityhash/city.h b/src/city.h
similarity index 100%
rename from src/cityhash/city.h
rename to src/city.h
diff --git a/src/cityhash/citycrc.h b/src/citycrc.h
similarity index 99%
rename from src/cityhash/citycrc.h
rename to src/citycrc.h
index 318e3917..054d188a 100644
--- a/src/cityhash/citycrc.h
+++ b/src/citycrc.h
@@ -28,7 +28,7 @@
 #ifndef CITY_HASH_CRC_H_
 #define CITY_HASH_CRC_H_
 
-#include <city.h>
+#include "city.h"
 
 // Hash function for a byte array.
 uint128 CityHashCrc128(const char *s, size_t len);
diff --git a/src/cityhash/config.h b/src/cityhash/config.h
deleted file mode 100644
index 5cfdc797..00000000
--- a/src/cityhash/config.h
+++ /dev/null
@@ -1 +0,0 @@
-/* config.h for cityhash */
diff --git a/src/db.cc b/src/db.cc
index eb33e552..cbbf5c67 100644
--- a/src/db.cc
+++ b/src/db.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -29,7 +29,7 @@
 #define MEMCHUNK 1048576
 #define LINEALLOC LINE_MAX
 
-char map_nt[256] =
+static signed char map_nt[256] =
   {
     -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
     -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
@@ -49,52 +49,33 @@ char map_nt[256] =
     -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
   };
 
-char map_hex[256] =
-  {
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-     0,  1,  2,  3,  4,  5,  6,  7,  8,  9, -1, -1, -1, -1, -1, -1,
-    -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, 10, 11, 12, 13, 14, 15, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-    -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1
-  };
-
-unsigned long sequences = 0;
-unsigned long nucleotides = 0;
-unsigned long headerchars = 0;
-int longest = 0;
-int longestheader = 0;
-
-seqinfo_t * seqindex = 0;
-char * datap = 0;
-qgramvector_t * qgrams = 0;
-
-#if 0
-
-/* never used */
-
-void showseq(char * seq)
-{
-  char * p = seq;
-  while (char c = *p++)
-    {
-      putchar(sym_nt[(unsigned int)c]);
-    }
-}
-
-#endif
-
-void fprint_id(FILE * stream, unsigned long x)
+static unsigned int sequences = 0;
+static uint64_t nucleotides = 0;
+static uint64_t headerchars = 0;
+static unsigned int longest = 0;
+static uint64_t longestheader = 0;
+static char * datap = nullptr;
+static int missingabundance = 0;
+static uint64_t missingabundance_lineno = 0;
+static char * missingabundance_header = nullptr;
+
+seqinfo_t * seqindex = nullptr;
+qgramvector_t * qgrams = nullptr;
+
+int db_compare_abundance(const void * a, const void * b);
+
+bool find_swarm_abundance(const char * header,
+                          int * start,
+                          int * end,
+                          int64_t * number);
+
+bool find_usearch_abundance(const char * header,
+                            int * start,
+                            int * end,
+                            int64_t * number);
+void find_abundance(struct seqinfo_s * sp, uint64_t lineno);
+
+void fprint_id(FILE * stream, uint64_t x)
 {
   seqinfo_t * sp = seqindex + x;
   char * h = sp->header;
@@ -102,14 +83,14 @@ void fprint_id(FILE * stream, unsigned long x)
 
   if (opt_append_abundance && (sp->abundance_start == sp->abundance_end))
     if (opt_usearch_abundance)
-      fprintf(stream, "%.*s;size=%lu;", hdrlen, h, sp->abundance);
+      fprintf(stream, "%.*s;size=%" PRIu64 ";", hdrlen, h, sp->abundance);
     else
-      fprintf(stream, "%.*s_%lu", hdrlen, h, sp->abundance);
+      fprintf(stream, "%.*s_%" PRIu64, hdrlen, h, sp->abundance);
   else
     fprintf(stream, "%.*s", hdrlen, h);
 }
 
-void fprint_id_noabundance(FILE * stream, unsigned long x)
+void fprint_id_noabundance(FILE * stream, uint64_t x)
 {
   seqinfo_t * sp = seqindex + x;
   char * h = sp->header;
@@ -119,13 +100,13 @@ void fprint_id_noabundance(FILE * stream, unsigned long x)
     {
       /* print start of header */
       fprintf(stream, "%.*s", sp->abundance_start, h);
-      
+
       if (opt_usearch_abundance)
         {
           /* print semicolon if the abundance is not at either end */
           if ((sp->abundance_start > 0) && (sp->abundance_end < hdrlen))
             fprintf(stream, ";");
-          
+
           /* print remaining part */
           fprintf(stream, "%.*s", hdrlen - sp->abundance_end, h + sp->abundance_end);
         }
@@ -135,14 +116,14 @@ void fprint_id_noabundance(FILE * stream, unsigned long x)
 }
 
 void fprint_id_with_new_abundance(FILE * stream,
-                                  unsigned long seqno,
-                                  unsigned long abundance)
+                                  uint64_t seqno,
+                                  uint64_t abundance)
 {
   seqinfo_t * sp = seqindex + seqno;
 
   if (opt_usearch_abundance)
     fprintf(stream,
-            "%.*s%ssize=%lu;%.*s",
+            "%.*s%ssize=%" PRIu64 ";%.*s",
             sp->abundance_start,
             sp->header,
             sp->abundance_start > 0 ? ";" : "",
@@ -151,7 +132,7 @@ void fprint_id_with_new_abundance(FILE * stream,
             sp->header + sp->abundance_end);
   else
     fprintf(stream,
-            "%.*s_%lu",
+            "%.*s_%" PRIu64,
             sp->abundance_start,
             sp->header,
             abundance);
@@ -159,24 +140,207 @@ void fprint_id_with_new_abundance(FILE * stream,
 
 int db_compare_abundance(const void * a, const void * b)
 {
-  seqinfo_t * x = (seqinfo_t *) a;
-  seqinfo_t * y = (seqinfo_t *) b;
-  
+  const seqinfo_t * x = reinterpret_cast<const seqinfo_t *>(a);
+  const seqinfo_t * y = reinterpret_cast<const seqinfo_t *>(b);
+
   if (x->abundance > y->abundance)
     return -1;
   else if (x->abundance < y->abundance)
     return +1;
-  else 
+  else
     return strcmp(x->header, y->header);
 }
 
+bool find_swarm_abundance(const char * header,
+                          int * start,
+                          int * end,
+                          int64_t * number)
+{
+  /*
+    Identify the first occurence of the pattern (_)([0-9]+)$
+    in the header string.
+  */
+
+  * start = 0;
+  * end = 0;
+  * number = 0;
+
+  const char * digit_chars = "0123456789";
+
+  if (!header)
+    return false;
+
+  if (strlen(header) >= INT_MAX)
+    return false;
+
+  const char * us = strrchr(header, '_');
+
+  if (! us)
+    return false;
+
+  size_t digits = strspn(us + 1, digit_chars);
+
+  if (digits > 20)
+    return false;
+
+  if (us[digits + 1])
+    return false;
+
+  int64_t s = us - header;
+  int64_t e = s + 1 + static_cast<int64_t>(digits);
+
+  * start = static_cast<int>(s);
+  * end = static_cast<int>(e);
+  * number = atol(us + 1);
+
+  return true;
+}
+
+bool find_usearch_abundance(const char * header,
+                            int * start,
+                            int * end,
+                            int64_t * number)
+{
+  /*
+    Identify the first occurence of the pattern (^|;)size=([0-9]+)(;|$)
+    in the header string.
+  */
+
+  if (! header)
+    return false;
+
+  const char * attribute = "size=";
+  const char * digit_chars = "0123456789";
+
+  uint64_t hlen = strlen(header);
+  uint64_t alen = strlen(attribute);
+  uint64_t i = 0;
+
+  while (i + alen < hlen)
+    {
+      const char * r = strstr(header + i, attribute);
+
+      /* no match */
+      if (r == nullptr)
+        break;
+
+      i = static_cast<uint64_t>(r - header);
+
+      /* check for ';' in front */
+      if ((i > 0) && (header[i-1] != ';'))
+        {
+          i += alen + 1;
+          continue;
+        }
+
+      uint64_t digits = strspn(header + i + alen, digit_chars);
+
+      /* check for at least one digit */
+      if (digits == 0)
+        {
+          i += alen + 1;
+          continue;
+        }
+
+      /* check for ';' after */
+      if ((i + alen + digits < hlen) && (header[i + alen + digits] != ';'))
+        {
+          i += alen + digits + 2;
+          continue;
+        }
+
+      /* ok */
+      if (i > 0)
+        * start = static_cast<int>(i - 1);
+      else
+        * start = 0;
+      * end   = static_cast<int>(MIN(i + alen + digits + 1, hlen));
+      * number = atol(header + i + alen);
+      return true;
+    }
+
+  return false;
+}
+
+void find_abundance(struct seqinfo_s * sp, uint64_t lineno)
+{
+  char * header = sp->header;
+
+  /* read size/abundance annotation */
+  int64_t abundance = 0;
+  int start = 0;
+  int end = 0;
+  int64_t number = 0;
+
+  if (opt_usearch_abundance)
+    {
+      /* (^|;)size=([0-9]+)(;|$) */
+
+      if (find_usearch_abundance(header, & start, & end, & number))
+        {
+          if (number > 0)
+            abundance = number;
+          else
+            {
+              fprintf(stderr,
+                      "\nError: Illegal abundance value on line %" PRIu64 ":\n%s\n"
+                      "Abundance values should be positive integers.\n\n",
+                      lineno,
+                      header);
+              exit(1);
+            }
+        }
+    }
+  else
+    {
+      /* (_)([0-9]+)$ */
+
+      if (find_swarm_abundance(header, & start, & end, & number))
+        {
+          if (number > 0)
+            abundance = number;
+          else
+            {
+              fprintf(stderr,
+                      "\nError: Illegal abundance value on line %" PRIu64 ":\n%s\n"
+                      "Abundance values should be positive integers.\n\n",
+                      lineno,
+                      header);
+              exit(1);
+            }
+        }
+    }
+
+  if (abundance == 0)
+    {
+      start = sp->headerlen;
+      end = start;
+
+      if (opt_append_abundance)
+        abundance = opt_append_abundance;
+      else
+        {
+          missingabundance++;
+          if (missingabundance == 1)
+            {
+              missingabundance_lineno = lineno;
+              missingabundance_header = header;
+            }
+        }
+    }
+
+  sp->abundance = static_cast<uint64_t>(abundance);
+  sp->abundance_start = start;
+  sp->abundance_end = end;
+}
+
 void db_read(const char * filename)
 {
   /* allocate space */
 
-  unsigned long dataalloc = MEMCHUNK;
-  datap = (char *) xmalloc(dataalloc);
-  unsigned long datalen = 0;
+  uint64_t dataalloc = MEMCHUNK;
+  datap = static_cast<char *>(xmalloc(dataalloc));
+  uint64_t datalen = 0;
 
   longest = 0;
   longestheader = 0;
@@ -184,12 +348,15 @@ void db_read(const char * filename)
   nucleotides = 0;
   headerchars = 0;
 
-  FILE * fp = NULL;
+  FILE * fp = nullptr;
   if (filename)
     {
-      fp = fopen(filename, "r");
+      fp = fopen_input(filename);
       if (!fp)
-        fatal("Error: Unable to open input data file (%s).", filename);
+        {
+          fprintf(stderr, "\nError: Unable to open input data file (%s).\n", filename);
+          exit(1);
+        }
     }
   else
     fp = stdin;
@@ -199,9 +366,12 @@ void db_read(const char * filename)
   struct stat fs;
 
   if (fstat(fileno(fp), & fs))
-    fatal("Unable to fstat on input file (%s)", filename);
+    {
+      fprintf(stderr, "\nUnable to fstat on input file (%s)\n", filename);
+      exit(1);
+    }
   bool is_regular = S_ISREG(fs.st_mode);
-  long filesize = is_regular ? fs.st_size : 0;
+  int64_t filesize = is_regular ? fs.st_size : 0;
 
   if (! is_regular)
     fprintf(logfile, "Waiting for data... (Hit Ctrl-C and run swarm -h if you meant to read data from a file.)\n");
@@ -213,33 +383,30 @@ void db_read(const char * filename)
 
   unsigned int lineno = 1;
 
-  progress_init("Reading database: ", filesize);
+  progress_init("Reading sequences:", static_cast<uint64_t>(filesize));
+
   while(line[0])
     {
       /* read header */
-      /* the header ends at a space character, a newline or a nul character */
+      /* the header ends at a space, cr, lf or null character */
 
       if (line[0] != '>')
         fatal("Illegal header line in fasta file.");
-      
-      long headerlen = 0;
-      if (char * stop = strpbrk(line+1, " \r\n"))
-        headerlen = stop - (line+1);
-      else
-        headerlen = strlen(line+1);
-      
+
+      uint64_t headerlen = strcspn(line + 1, " \r\n");
+
       headerchars += headerlen;
-      
+
       if (headerlen > longestheader)
         longestheader = headerlen;
 
 
       /* store the line number */
-      
+
       while (datalen + sizeof(unsigned int) > dataalloc)
         {
           dataalloc += MEMCHUNK;
-          datap = (char *) xrealloc(datap, dataalloc);
+          datap = static_cast<char *>(xrealloc(datap, dataalloc));
         }
       memcpy(datap + datalen, & lineno, sizeof(unsigned int));
       datalen += sizeof(unsigned int);
@@ -250,7 +417,7 @@ void db_read(const char * filename)
       while (datalen + headerlen + 1 > dataalloc)
         {
           dataalloc += MEMCHUNK;
-          datap = (char *) xrealloc(datap, dataalloc);
+          datap = static_cast<char *>(xrealloc(datap, dataalloc));
         }
       memcpy(datap + datalen, line + 1, headerlen);
       *(datap + datalen + headerlen) = 0;
@@ -265,56 +432,79 @@ void db_read(const char * filename)
       lineno++;
 
 
+      /* store a dummy sequence length */
+
+      unsigned int length = 0;
+
+      while (datalen + sizeof(unsigned int) > dataalloc)
+        {
+          dataalloc += MEMCHUNK;
+          datap = static_cast<char *>(xrealloc(datap, dataalloc));
+        }
+      uint64_t datalen_seqlen = datalen;
+      memcpy(datap + datalen, & length, sizeof(unsigned int));
+      datalen += sizeof(unsigned int);
+
+
       /* read and store sequence */
 
-      unsigned long seqbegin = datalen;
+      uint64_t nt_buffer = 0;
+      unsigned int nt_bufferlen = 0;
+      const unsigned int nt_buffersize = 4 * sizeof(nt_buffer);
 
       while (line[0] && (line[0] != '>'))
         {
           unsigned char c;
           char * p = line;
-          while((c = *p++))
+          while((c = static_cast<unsigned char>(*p++)))
 	    {
-	    char m;
-            if ((m = map_nt[(unsigned int)c]) >= 0)
-              {
-                while (datalen >= dataalloc)
-                  {
-                    dataalloc += MEMCHUNK;
-                    datap = (char *) xrealloc(datap, dataalloc);
-                  }
-                
-                *(datap+datalen) = m;
-                datalen++;
-              }
-            else if ((c != 10) && (c != 13))
-              {
-                if ((c >= 32) && (c <= 126))
-                  fprintf(stderr,
-                          "\nError: Illegal character '%c' in sequence on line %u\n",
-                          c,
-                          lineno);
-                else
-                  fprintf(stderr,
-                          "\nError: Illegal character (ascii no %d) in sequence on line %u\n",
-                          c,
-                          lineno);
-                exit(1);
-              }
+              signed char m;
+              if ((m = map_nt[static_cast<unsigned int>(c)]) >= 0)
+                {
+                  nt_buffer |= ((static_cast<uint64_t>(m))-1) << (2 * nt_bufferlen);
+                  length++;
+                  nt_bufferlen++;
+
+                  if (nt_bufferlen == nt_buffersize)
+                    {
+                      while (datalen + sizeof(nt_buffer) > dataalloc)
+                        {
+                          dataalloc += MEMCHUNK;
+                          datap = static_cast<char *>(xrealloc(datap, dataalloc));
+                        }
+
+                      memcpy(datap + datalen, & nt_buffer, sizeof(nt_buffer));
+                      datalen += sizeof(nt_buffer);
+
+                      nt_bufferlen = 0;
+                      nt_buffer = 0;
+                    }
+                }
+              else if ((c != 10) && (c != 13))
+                {
+                  if ((c >= 32) && (c <= 126))
+                    fprintf(stderr,
+                            "\nError: Illegal character '%c' in sequence on line %u\n",
+                            c,
+                            lineno);
+                  else
+                    fprintf(stderr,
+                            "\nError: Illegal character (ascii no %d) in sequence on line %u\n",
+                            c,
+                            lineno);
+                  exit(1);
+                }
 	    }
+
           line[0] = 0;
           if (!fgets(line, LINEALLOC, fp))
             line[0] = 0;
           lineno++;
         }
-      
-      while (datalen >= dataalloc)
-        {
-          dataalloc += MEMCHUNK;
-          datap = (char *) xrealloc(datap, dataalloc);
-        }
-      
-      long length = datalen - seqbegin;
+
+      /* fill in real length */
+
+      memcpy(datap + datalen_seqlen, & length, sizeof(unsigned int));
 
       if (length == 0)
         {
@@ -327,129 +517,95 @@ void db_read(const char * filename)
       if (length > longest)
         longest = length;
 
-      *(datap+datalen) = 0;
-      datalen++;
+
+      /* save remaining padded 64-bit value with nt's, if any */
+
+      if (nt_bufferlen > 0)
+        {
+          while (datalen + sizeof(nt_buffer) > dataalloc)
+            {
+              dataalloc += MEMCHUNK;
+              datap = static_cast<char *>(xrealloc(datap, dataalloc));
+            }
+
+          memcpy(datap + datalen, & nt_buffer, sizeof(nt_buffer));
+          datalen += sizeof(nt_buffer);
+
+          nt_buffer = 0;
+          nt_bufferlen = 0;
+        }
 
       sequences++;
-      
+
       if (is_regular)
-        progress_update(ftell(fp));
+        progress_update(static_cast<uint64_t>(ftell(fp)));
     }
   progress_done();
 
   fclose(fp);
 
+  /* init zobrist hashing */
+
+  zobrist_init(longest + 2);  // add 2 for two insertions
+
   /* set up hash to check for unique headers */
 
-  unsigned long hdrhashsize = 2 * sequences;
+  uint64_t hdrhashsize = 2 * sequences;
 
   seqinfo_t * * hdrhashtable =
-    (seqinfo_t **) xmalloc(hdrhashsize * sizeof(seqinfo_t *));
+    static_cast<seqinfo_t **>(xmalloc(hdrhashsize * sizeof(seqinfo_t *)));
   memset(hdrhashtable, 0, hdrhashsize * sizeof(seqinfo_t *));
 
-  unsigned long duplicatedidentifiers = 0;
+  uint64_t duplicatedidentifiers = 0;
 
   /* set up hash to check for unique sequences */
 
-  unsigned long seqhashsize = 2 * sequences;
+  uint64_t seqhashsize = 2 * sequences;
 
-  seqinfo_t * * seqhashtable = 0;
+  seqinfo_t * * seqhashtable = nullptr;
 
-  if (opt_differences > 0)
+  if (opt_differences > 1)
     {
       seqhashtable =
-        (seqinfo_t **) xmalloc(seqhashsize * sizeof(seqinfo_t *));
+        static_cast<seqinfo_t **>(xmalloc(seqhashsize * sizeof(seqinfo_t *)));
       memset(seqhashtable, 0, seqhashsize * sizeof(seqinfo_t *));
     }
 
   /* create indices */
 
-  seqindex = (seqinfo_t *) xmalloc(sequences * sizeof(seqinfo_t));
+  seqindex = static_cast<seqinfo_t *>(xmalloc(sequences * sizeof(seqinfo_t)));
   seqinfo_t * seqindex_p = seqindex;
 
-  regex_t db_regexp;
-  regmatch_t pmatch[4];
-
-  if (opt_usearch_abundance)
-    {
-      if (regcomp(&db_regexp, "(^|;)size=([0-9]+)(;|$)", REG_EXTENDED))
-        fatal("Regular expression compilation failed");
-    }
-  else
-    {
-      if (regcomp(&db_regexp, "(_)([0-9]+)$", REG_EXTENDED))
-        fatal("Regular expression compilation failed");
-    }
-
-  seqinfo_t * lastseq = 0;
+  seqinfo_t * lastseq = nullptr;
 
   int presorted = 1;
-  int missingabundance = 0;
-  unsigned int missingabundance_lineno = 0;
-  char * missingabundance_header = 0;
 
   char * p = datap;
   progress_init("Indexing database:", sequences);
-  for(unsigned long i=0; i<sequences; i++)
+  for(uint64_t i=0; i<sequences; i++)
     {
       /* get line number */
-      unsigned int line_number = *((unsigned int*)p);
+      unsigned int line_number = *(reinterpret_cast<unsigned int*>(p));
       p += sizeof(unsigned int);
 
       /* get header */
       seqindex_p->header = p;
-      seqindex_p->headerlen = strlen(seqindex_p->header);
+      seqindex_p->headerlen = static_cast<int>(strlen(seqindex_p->header));
       p += seqindex_p->headerlen + 1;
 
       /* and sequence */
+      unsigned int seqlen = *(reinterpret_cast<unsigned int*>(p));
+      seqindex_p->seqlen = seqlen;
+      p += sizeof(unsigned int);
       seqindex_p->seq = p;
-      seqindex_p->seqlen = strlen(p);
-      p += seqindex_p->seqlen + 1;
+      p += nt_bytelength(seqlen);
 
       /* get amplicon abundance */
-      if (!regexec(&db_regexp, seqindex_p->header, 4, pmatch, 0))
-        {
-          seqindex_p->abundance = atol(seqindex_p->header + pmatch[2].rm_so);
-          seqindex_p->abundance_start = pmatch[0].rm_so;
-          seqindex_p->abundance_end = pmatch[0].rm_eo;
-
-          if (seqindex_p->abundance == 0)
-            {
-              fprintf(stderr,
-                      "\nError: Illegal abundance value on line %u:\n%s\n"
-                      "Abundance values should be positive integers.\n\n",
-                      line_number,
-                      seqindex_p->header);
-              exit(1);
-            }
-        }
-      else
-        {
-          seqindex_p->abundance_start = seqindex_p->headerlen;
-          seqindex_p->abundance_end = seqindex_p->headerlen;
-          seqindex_p->abundance = 0;
-        }
-      
-      if (seqindex_p->abundance < 1)
-        {
-          if (opt_append_abundance)
-            {
-              seqindex_p->abundance = opt_append_abundance;
-            }
-          else
-            {
-              missingabundance++;
-              if (missingabundance == 1)
-                {
-                  missingabundance_lineno = line_number;
-                  missingabundance_header = seqindex_p->header;
-                }
-            }
-        }
-
-      if (seqindex_p->abundance_start == 0)
-          fatal("Empty sequence identifier");
+      find_abundance(seqindex_p, line_number);
 
+      if ((seqindex_p->abundance_start == 0) &&
+          (seqindex_p->abundance_end == seqindex_p->headerlen))
+        fatal("Empty sequence identifier");
 
       /* check if the sequences are presorted by abundance and header */
 
@@ -466,21 +622,57 @@ void db_read(const char * filename)
 
       lastseq = seqindex_p;
 
-
       /* check for duplicated identifiers using hash table */
 
-      unsigned long hdrhash = HASH((unsigned char*)seqindex_p->header, seqindex_p->abundance_start);
+      /* find position and length of identifier in header */
+
+      int id_start, id_len;
+
+      if (seqindex_p->abundance_start > 0)
+        {
+          /* id first, then abundance (e.g. >name;size=1 or >name_1) */
+          id_start = 0;
+          id_len = seqindex_p->abundance_start;
+        }
+      else
+        {
+          /* abundance first then id (e.g. >size=1;name) */
+          id_start = seqindex_p->abundance_end;
+          id_len = seqindex_p->headerlen - seqindex_p->abundance_end;
+        }
+
+      uint64_t hdrhash
+        = HASH(reinterpret_cast<unsigned char*>(seqindex_p->header + id_start),
+               static_cast<uint64_t>(id_len));
       seqindex_p->hdrhash = hdrhash;
-      unsigned long hdrhashindex = hdrhash % hdrhashsize;
+      uint64_t hdrhashindex = hdrhash % hdrhashsize;
+
+      seqinfo_t * hdrfound = nullptr;
 
-      seqinfo_t * hdrfound = 0;
-    
       while ((hdrfound = hdrhashtable[hdrhashindex]))
         {
-          if ((hdrfound->hdrhash == hdrhash) &&
-              (hdrfound->abundance_start == seqindex_p->abundance_start) &&
-              (strncmp(hdrfound->header, seqindex_p->header, hdrfound->abundance_start) == 0))
-            break;
+          if (hdrfound->hdrhash == hdrhash)
+            {
+              int hit_id_start, hit_id_len;
+
+              if (hdrfound->abundance_start > 0)
+                {
+                  hit_id_start = 0;
+                  hit_id_len = hdrfound->abundance_start;
+                }
+              else
+                {
+                  hit_id_start = hdrfound->abundance_end;
+                  hit_id_len = hdrfound->headerlen - hdrfound->abundance_end;
+                }
+
+              if ((id_len == hit_id_len) &&
+                  (strncmp(seqindex_p->header + id_start,
+                           hdrfound->header + hit_id_start,
+                           static_cast<uint64_t>(id_len)) == 0))
+                break;
+            }
+
           hdrhashindex = (hdrhashindex + 1) % hdrhashsize;
         }
 
@@ -488,34 +680,42 @@ void db_read(const char * filename)
         {
           duplicatedidentifiers++;
           fprintf(stderr, "\nError: Duplicated sequence identifier: %.*s\n\n",
-                  seqindex_p->abundance_start,
-                  seqindex_p->header);
+                  id_len,
+                  seqindex_p->header + id_start);
           exit(1);
         }
 
       hdrhashtable[hdrhashindex] = seqindex_p;
-    
 
-      if (opt_differences > 0)
+      /* hash sequence */
+      seqindex_p->seqhash = zobrist_hash(reinterpret_cast<unsigned char*>
+                                         (seqindex_p->seq),
+                                         seqindex_p->seqlen);
+
+      if (opt_differences > 1)
         {
-          /* check for duplicated sequences using hash table */
-          unsigned long seqhash = HASH((unsigned char*)seqindex_p->seq,
-                                       seqindex_p->seqlen);
-          seqindex_p->seqhash = seqhash;
-          unsigned long seqhashindex = seqhash % seqhashsize;
-          seqinfo_t * seqfound = 0;
+          /* Check for duplicated sequences using hash table, */
+          /* but only for d>1. Handled internally for d=1.    */
+
+          uint64_t seqhashindex = seqindex_p->seqhash % seqhashsize;
+          seqinfo_t * seqfound = nullptr;
 
           while ((seqfound = seqhashtable[seqhashindex]))
             {
-              if ((seqfound->seqhash == seqhash) &&
+              if ((seqfound->seqhash == seqindex_p->seqhash) &&
                   (seqfound->seqlen == seqindex_p->seqlen) &&
-                  (memcmp(seqfound->seq, seqindex_p->seq, seqfound->seqlen) == 0))
+                  (memcmp(seqfound->seq,
+                          seqindex_p->seq,
+                          nt_bytelength(seqindex_p->seqlen)) == 0))
                 break;
               seqhashindex = (seqhashindex + 1) % seqhashsize;
             }
 
           if (seqfound)
-            duplicates_found++;
+            {
+              duplicates_found++;
+              break;
+            }
           else
             seqhashtable[seqhashindex] = seqindex_p;
         }
@@ -523,12 +723,26 @@ void db_read(const char * filename)
       seqindex_p++;
       progress_update(i);
     }
+
+  if (duplicates_found)
+    {
+      fprintf(logfile,
+              "\n\n"
+              "Error: some fasta entries have identical sequences.\n"
+              "Swarm expects dereplicated fasta files.\n"
+              "Such files can be produced with swarm or vsearch:\n"
+              " swarm -d 0 -w derep.fasta -o /dev/null input.fasta\n"
+              "or\n"
+              " vsearch --derep_fulllength input.fasta --sizein --sizeout --output derep.fasta\n");
+      exit(1);
+    }
+
   progress_done();
 
   if (missingabundance)
     {
       fprintf(stderr,
-              "\nError: Abundance annotations not found for %d sequences, starting on line %u.\n"
+              "\nError: Abundance annotations not found for %d sequences, starting on line %" PRIu64 ".\n"
               ">%s\n"
               "Fasta headers must end with abundance annotations (_INT or ;size=INT).\n"
               "The -z option must be used if the abundance annotation is in the latter format.\n"
@@ -541,14 +755,6 @@ void db_read(const char * filename)
       exit(1);
     }
 
-  if (duplicates_found)
-    {
-      fprintf(logfile,
-              "WARNING: %lu duplicated sequences detected.\n"
-              "Please consider dereplicating your data for optimal results.\n",
-              duplicates_found);
-    }
-
   if (!presorted)
     {
       progress_init("Abundance sorting:", 1);
@@ -556,27 +762,26 @@ void db_read(const char * filename)
       progress_done();
     }
 
-  regfree(&db_regexp);
-
-  free(hdrhashtable);
+  xfree(hdrhashtable);
 
   if (seqhashtable)
     {
-      free(seqhashtable);
-      seqhashtable = 0;
+      xfree(seqhashtable);
+      seqhashtable = nullptr;
     }
 }
 
 void db_qgrams_init()
 {
-  qgrams = (qgramvector_t *) xmalloc(sequences * sizeof(qgramvector_t));
+  qgrams = static_cast<qgramvector_t *>
+    (xmalloc(sequences * sizeof(qgramvector_t)));
 
   seqinfo_t * seqindex_p = seqindex;
   progress_init("Find qgram vects: ", sequences);
   for(unsigned int i=0; i<sequences; i++)
     {
       /* find qgrams */
-      findqgrams((unsigned char*) seqindex_p->seq,
+      findqgrams(reinterpret_cast<unsigned char*>(seqindex_p->seq),
                  seqindex_p->seqlen,
                  qgrams[i]);
       seqindex_p++;
@@ -587,136 +792,126 @@ void db_qgrams_init()
 
 void db_qgrams_done()
 {
-  free(qgrams);
+  xfree(qgrams);
 }
 
-unsigned long db_getsequencecount()
+unsigned int db_getsequencecount()
 {
   return sequences;
 }
 
-unsigned long db_getnucleotidecount()
+uint64_t db_getnucleotidecount()
 {
   return nucleotides;
 }
 
-#if 0
-
-/* never used */
-
-unsigned long db_getlongestheader()
-{
-  return longestheader;
-}
-
-#endif
-
-unsigned long db_getlongestsequence()
+unsigned int db_getlongestsequence()
 {
   return longest;
 }
 
-#if 0
-
-/* never used */
-
-seqinfo_t * db_getseqinfo(unsigned long seqno)
+uint64_t db_gethash(uint64_t seqno)
 {
-  return seqindex+seqno;
+  return seqindex[seqno].seqhash;
 }
 
-#endif
-
-char * db_getsequence(unsigned long seqno)
+char * db_getsequence(uint64_t seqno)
 {
   return seqindex[seqno].seq;
 }
 
-void db_getsequenceandlength(unsigned long seqno,
+void db_getsequenceandlength(uint64_t seqno,
                              char ** address,
-                             long * length)
+                             unsigned int * length)
 {
   *address = seqindex[seqno].seq;
-  *length = (long)(seqindex[seqno].seqlen);
+  *length = seqindex[seqno].seqlen;
 }
 
-unsigned long db_getsequencelen(unsigned long seqno)
+unsigned int db_getsequencelen(uint64_t seqno)
 {
   return seqindex[seqno].seqlen;
 }
 
-#if 0
-
-/* never used */
-
-char * db_getheader(unsigned long seqno)
+char * db_getheader(uint64_t seqno)
 {
   return seqindex[seqno].header;
 }
 
-unsigned long db_getheaderlen(unsigned long seqno)
-{
-  return seqindex[seqno].headerlen;
-}
-
-#endif
-
-unsigned long db_getabundance(unsigned long seqno)
+uint64_t db_getabundance(uint64_t seqno)
 {
   return seqindex[seqno].abundance;
 }
 
-#if 0
-
-/* never used */
-
-void db_putseq(long seqno)
-{
-  char * seq;
-  long len;
-  db_getsequenceandlength(seqno, & seq, & len);
-  for(int i=0; i<len; i++)
-    putchar(sym_nt[(int)(seq[i])]);
-}
-
-#endif
-
 void db_free()
 {
+  zobrist_exit();
+
   if (datap)
-    free(datap);
+    xfree(datap);
   if (seqindex)
-    free(seqindex);
+    xfree(seqindex);
 }
 
-void db_fprintseq(FILE * fp, int a, int width)
+void db_fprintseq(FILE * fp, unsigned int a, unsigned int width)
 {
   char * seq = db_getsequence(a);
-  int len = db_getsequencelen(a);
+  unsigned int len = db_getsequencelen(a);
   char buffer[1025];
   char * buf;
 
   if (len < 1025)
     buf = buffer;
   else
-    buf = (char*) xmalloc(len+1);
+    buf = static_cast<char*>(xmalloc(len+1));
 
-  for(int i=0; i<len; i++)
-    buf[i] = sym_nt[(int)(seq[i])];
+  for(unsigned int i = 0; i < len; i++)
+    buf[i] = sym_nt[1 + nt_extract(seq, i)];
   buf[len] = 0;
 
   if (width < 1)
-    fprintf(fp, "%.*s\n", (int)(len), buf);
+    fprintf(fp, "%.*s\n", len, buf);
   else
     {
-      long rest = len;
-      for(int i=0; i<len; i += width)
+      unsigned int rest = len;
+      for(unsigned int i = 0; i < len; i += width)
         {
-          fprintf(fp, "%.*s\n", (int)(MIN(rest,width)), buf+i);
+          fprintf(fp, "%.*s\n", MIN(rest, width), buf+i);
           rest -= width;
         }
     }
 
   if (len >= 1025)
-    free(buf);
+    xfree(buf);
 }
+
+
+#if 0
+
+/* Unused functions */
+
+unsigned int db_getheaderlen(uint64_t seqno)
+{
+  return seqindex[seqno].headerlen;
+}
+
+unsigned int db_getlongestheader()
+{
+  return longestheader;
+}
+
+seqinfo_t * db_getseqinfo(uint64_t seqno)
+{
+  return seqindex+seqno;
+}
+
+void db_putseq(int64_t seqno)
+{
+  char * seq;
+  int64_t len;
+  db_getsequenceandlength(seqno, & seq, & len);
+  for(int i=0; i<len; i++)
+    putchar(sym_nt[1+nt_extract(seq, i)]);
+}
+
+#endif
diff --git a/src/derep.cc b/src/derep.cc
index 5fcae4a9..72aca9d7 100644
--- a/src/derep.cc
+++ b/src/derep.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -23,22 +23,26 @@
 
 #include "swarm.h"
 
+#define HASH hash_cityhash64
+
 //#define REVCOMP
 
 struct bucket
 {
-  unsigned long hash;
+  uint64_t hash;
   unsigned int seqno_first;
   unsigned int seqno_last;
-  unsigned long mass;
+  uint64_t mass;
   unsigned int size;
   unsigned int singletons;
 };
 
+int derep_compare(const void * a, const void * b);
+
 int derep_compare(const void * a, const void * b)
 {
-  struct bucket * x = (struct bucket *) a;
-  struct bucket * y = (struct bucket *) b;
+  const struct bucket * x = static_cast<const struct bucket *>(a);
+  const struct bucket * y = static_cast<const struct bucket *>(b);
 
   /* highest abundance first, otherwise keep order */
 
@@ -60,14 +64,14 @@ int derep_compare(const void * a, const void * b)
 #ifdef REVCOMP
 char map_complement[5] = { 0, 4, 3, 2, 1 };
 
-void reverse_complement(char * rc, char * seq, long len)
+void reverse_complement(char * rc, char * seq, int64_t len)
 {
   /* Write the reverse complementary sequence to rc.
      The memory for rc must be long enough for the rc of the sequence
      (identical to the length of seq + 1). */
 
-  for(long i=0; i<len; i++)
-    rc[i] = map_complement[(int)(seq[len-1-i])];
+  for(int64_t i=0; i<len; i++)
+    rc[i] = map_complement[(int)(1 + nt_extract(seq, len-1-i))];
   rc[len] = 0;
 }
 #endif
@@ -75,34 +79,34 @@ void reverse_complement(char * rc, char * seq, long len)
 void dereplicate()
 {
   /* adjust size of hash table for 2/3 fill rate */
-  long dbsequencecount = db_getsequencecount();
-  long hashtablesize = 1;
-  while (1.0 * dbsequencecount / hashtablesize > 0.7)
+  uint64_t dbsequencecount = db_getsequencecount();
+  uint64_t hashtablesize = 1;
+  while (100 * dbsequencecount > 70 * hashtablesize)
     hashtablesize <<= 1;
-  int hash_mask = hashtablesize - 1;
+  uint64_t derep_hash_mask = hashtablesize - 1;
 
   struct bucket * hashtable =
-    (struct bucket *) xmalloc(sizeof(bucket) * hashtablesize);
+    static_cast<struct bucket *>(xmalloc(sizeof(bucket) * hashtablesize));
 
   memset(hashtable, 0, sizeof(bucket) * hashtablesize);
 
-  long swarmcount = 0;
-  unsigned long maxmass = 0;
+  uint64_t swarmcount = 0;
+  uint64_t maxmass = 0;
   unsigned int maxsize = 0;
 
   /* alloc and init table of links to other sequences in cluster */
-  unsigned int * nextseqtab = (unsigned int *)
-    xmalloc(sizeof(unsigned int) * dbsequencecount);
+  unsigned int * nextseqtab = static_cast<unsigned int *>
+    (xmalloc(sizeof(unsigned int) * dbsequencecount));
   memset(nextseqtab, 0, sizeof(unsigned int) * dbsequencecount);
 
 #ifdef REVCOMP
   /* allocate memory for reverse complementary sequence */
   char * rc_seq = (char*) xmalloc(db_getlongestsequence() + 1);
 #endif
-  
+
   progress_init("Dereplicating:    ", dbsequencecount);
 
-  for(long i=0; i<dbsequencecount; i++)
+  for(unsigned int i=0; i<dbsequencecount; i++)
     {
       unsigned int seqlen = db_getsequencelen(i);
       char * seq = db_getsequence(i);
@@ -115,14 +119,17 @@ void dereplicate()
         collision when the number of sequences is about 5e9.
       */
 
-      unsigned long hash = CityHash64(seq, seqlen);
-      unsigned long j = hash & hash_mask;
+      uint64_t hash = HASH(reinterpret_cast<unsigned char *>(seq),
+                           nt_bytelength(seqlen));
+      uint64_t j = hash & derep_hash_mask;
       struct bucket * bp = hashtable + j;
-      
+
       while ((bp->mass) &&
              ((bp->hash != hash) ||
               (seqlen != db_getsequencelen(bp->seqno_first)) ||
-              (strcmp(seq, db_getsequence(bp->seqno_first)))))
+              (memcmp(seq,
+                      db_getsequence(bp->seqno_first),
+                      nt_bytelength(seqlen)))))
         {
           bp++;
           j++;
@@ -140,14 +147,16 @@ void dereplicate()
           /* check minus strand as well */
 
           reverse_complement(rc_seq, seq, seqlen);
-          unsigned long rc_hash = CityHash64(rc_seq, seqlen);
+          uint64_t rc_hash = HASH((unsigned char*)rc_seq, nt_bytelength(seqlen));
           struct bucket * rc_bp = hashtable + rc_hash % hashtablesize;
-          unsigned long k = rc_hash & hash_mask;
-          
+          uint64_t k = rc_hash & derep_hash_mask;
+
           while ((rc_bp->mass) &&
                  ((rc_bp->hash != rc_hash) ||
                   (seqlen != db_getsequencelen(rc_bp->seqno_first)) ||
-                  (strcmp(rc_seq, db_getsequence(rc_bp->seqno_first)))))
+                  (memcmp(rc_seq,
+                          db_getsequence(rc_bp->seqno_first),
+                          nt_bytelength(seqlen)))))
             {
               rc_bp++;
               k++;
@@ -166,7 +175,7 @@ void dereplicate()
         }
 #endif
 
-      long ab = db_getabundance(i);
+      uint64_t ab = db_getabundance(i);
 
       if (bp->mass)
         {
@@ -201,9 +210,9 @@ void dereplicate()
   progress_done();
 
 #ifdef REVCOMP
-  free(rc_seq);
+  xfree(rc_seq);
 #endif
-  
+
   progress_init("Sorting:          ", 1);
   qsort(hashtable, hashtablesize, sizeof(bucket), derep_compare);
   progress_done();
@@ -214,15 +223,15 @@ void dereplicate()
   progress_init("Writing swarms:   ", swarmcount);
 
   if (opt_mothur)
-    fprintf(outfile, "swarm_%ld\t%ld", opt_differences, swarmcount);
+    fprintf(outfile, "swarm_%" PRId64 "\t%" PRIu64, opt_differences, swarmcount);
 
-  for(int i = 0; i < swarmcount; i++)
+  for(unsigned int i = 0; i < swarmcount; i++)
     {
-      int seed = hashtable[i].seqno_first;
+      unsigned int seed = hashtable[i].seqno_first;
       if (opt_mothur)
         fputc('\t', outfile);
       fprint_id(outfile, seed);
-      int a = nextseqtab[seed];
+      unsigned int a = nextseqtab[seed];
 
       while (a)
         {
@@ -233,7 +242,7 @@ void dereplicate()
           fprint_id(outfile, a);
           a = nextseqtab[a];
         }
-      
+
       if (!opt_mothur)
         fputc('\n', outfile);
 
@@ -242,7 +251,7 @@ void dereplicate()
 
   if (opt_mothur)
     fputc('\n', outfile);
-  
+
   progress_done();
 
 
@@ -251,9 +260,9 @@ void dereplicate()
   if (opt_seeds)
     {
       progress_init("Writing seeds:    ", swarmcount);
-      for(int i=0; i < swarmcount; i++)
+      for(unsigned int i=0; i < swarmcount; i++)
         {
-          int seed = hashtable[i].seqno_first;
+          unsigned int seed = hashtable[i].seqno_first;
           fprintf(fp_seeds, ">");
           fprint_id_with_new_abundance(fp_seeds, seed, hashtable[i].mass);
           fprintf(fp_seeds, "\n");
@@ -269,33 +278,33 @@ void dereplicate()
     {
       progress_init("Writing UCLUST:   ", swarmcount);
 
-      for(unsigned int swarmid = 0; swarmid < swarmcount ; swarmid++)
+      for(unsigned int swarmid = 0; swarmid < swarmcount; swarmid++)
         {
           struct bucket * bp = hashtable + swarmid;
-          
-          int seed = bp->seqno_first;
+
+          unsigned int seed = bp->seqno_first;
 
           fprintf(uclustfile, "C\t%u\t%u\t*\t*\t*\t*\t*\t",
                   swarmid,
                   bp->size);
           fprint_id(uclustfile, seed);
           fprintf(uclustfile, "\t*\n");
-          
-          fprintf(uclustfile, "S\t%u\t%lu\t*\t*\t*\t*\t*\t",
+
+          fprintf(uclustfile, "S\t%u\t%u\t*\t*\t*\t*\t*\t",
                   swarmid,
                   db_getsequencelen(seed));
           fprint_id(uclustfile, seed);
           fprintf(uclustfile, "\t*\n");
-          
-          int a = nextseqtab[seed];
+
+          unsigned int a = nextseqtab[seed];
 
           while (a)
             {
               fprintf(uclustfile,
-                      "H\t%u\t%lu\t%.1f\t+\t0\t0\t%s\t",
+                      "H\t%u\t%u\t%.1f\t+\t0\t0\t%s\t",
                       swarmid,
                       db_getsequencelen(a),
-                      100.0, 
+                      100.0,
                       "=");
               fprint_id(uclustfile, a);
               fprintf(uclustfile, "\t");
@@ -303,7 +312,7 @@ void dereplicate()
               fprintf(uclustfile, "\n");
               a = nextseqtab[a];
             }
-          
+
           progress_update(swarmid+1);
         }
       progress_done();
@@ -314,18 +323,18 @@ void dereplicate()
   if (opt_internal_structure)
     {
       progress_init("Writing structure:", swarmcount);
-      
-      for(long i = 0; i < swarmcount; i++)
+
+      for(uint64_t i = 0; i < swarmcount; i++)
         {
           struct bucket * sp = hashtable + i;
-          long seed = sp->seqno_first;
-          int a = nextseqtab[seed];
+          uint64_t seed = sp->seqno_first;
+          unsigned int a = nextseqtab[seed];
           while (a)
             {
               fprint_id_noabundance(internal_structure_file, seed);
               fprintf(internal_structure_file, "\t");
               fprint_id_noabundance(internal_structure_file, a);
-              fprintf(internal_structure_file, "\t%d\t%ld\t%d\n", 0, i+1, 0);
+              fprintf(internal_structure_file, "\t%d\t%" PRIu64 "\t%d\n", 0, i+1, 0);
               a = nextseqtab[a];
             }
           progress_update(i);
@@ -338,12 +347,12 @@ void dereplicate()
   if (statsfile)
     {
       progress_init("Writing stats:    ", swarmcount);
-      for(long i = 0; i < swarmcount; i++)
+      for(uint64_t i = 0; i < swarmcount; i++)
         {
           struct bucket * sp = hashtable + i;
-          fprintf(statsfile, "%u\t%lu\t", sp->size, sp->mass);
+          fprintf(statsfile, "%u\t%" PRIu64 "\t", sp->size, sp->mass);
           fprint_id_noabundance(statsfile, sp->seqno_first);
-          fprintf(statsfile, "\t%lu\t%u\t%u\t%u\n", 
+          fprintf(statsfile, "\t%" PRIu64 "\t%u\t%u\t%u\n",
                   db_getabundance(sp->seqno_first),
                   sp->singletons, 0U, 0U);
           progress_update(i);
@@ -353,10 +362,10 @@ void dereplicate()
 
 
   fprintf(logfile, "\n");
-  fprintf(logfile, "Number of swarms:  %ld\n", swarmcount);
+  fprintf(logfile, "Number of swarms:  %" PRIu64 "\n", swarmcount);
   fprintf(logfile, "Largest swarm:     %u\n", maxsize);
-  fprintf(logfile, "Heaviest swarm:    %lu\n", maxmass);
+  fprintf(logfile, "Heaviest swarm:    %" PRIu64 "\n", maxmass);
 
-  free(nextseqtab);
-  free(hashtable);
+  xfree(nextseqtab);
+  xfree(hashtable);
 }
diff --git a/src/hashtable.cc b/src/hashtable.cc
new file mode 100644
index 00000000..92b323b4
--- /dev/null
+++ b/src/hashtable.cc
@@ -0,0 +1,62 @@
+/*
+    SWARM
+
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as
+    published by the Free Software Foundation, either version 3 of the
+    License, or (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
+    Department of Informatics, University of Oslo,
+    PO Box 1080 Blindern, NO-0316 Oslo, Norway
+*/
+
+#include "swarm.h"
+
+#define HASHFILLPCT 70
+
+uint64_t hash_mask;
+unsigned char * hash_occupied = nullptr;
+uint64_t * hash_values = nullptr;
+unsigned int * hash_data = nullptr;
+uint64_t hash_tablesize = 0;
+
+void hash_zap()
+{
+  memset(hash_occupied, 0, (hash_tablesize + 63) / 8);
+}
+
+void hash_alloc(uint64_t amplicons)
+{
+  hash_tablesize = 1;
+  while (100 * amplicons > HASHFILLPCT * hash_tablesize)
+    hash_tablesize <<= 1;
+  hash_mask = hash_tablesize - 1;
+
+  hash_occupied =
+    static_cast<unsigned char *>(xmalloc((hash_tablesize + 63) / 8));
+  hash_zap();
+
+  hash_values =
+    static_cast<uint64_t *>(xmalloc(hash_tablesize * sizeof(uint64_t)));
+
+  hash_data = static_cast<unsigned int *>
+    (xmalloc(hash_tablesize * sizeof(unsigned int)));
+}
+
+void hash_free()
+{
+  xfree(hash_occupied);
+  xfree(hash_values);
+  xfree(hash_data);
+}
diff --git a/src/hashtable.h b/src/hashtable.h
new file mode 100644
index 00000000..342399ef
--- /dev/null
+++ b/src/hashtable.h
@@ -0,0 +1,81 @@
+/*
+    SWARM
+
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as
+    published by the Free Software Foundation, either version 3 of the
+    License, or (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
+    Department of Informatics, University of Oslo,
+    PO Box 1080 Blindern, NO-0316 Oslo, Norway
+*/
+
+extern uint64_t hash_mask;
+extern unsigned char * hash_occupied;
+extern uint64_t * hash_values;
+extern unsigned int * hash_data;
+extern uint64_t hash_tablesize;
+
+inline uint64_t hash_get_tablesize()
+{
+  return hash_tablesize;
+}
+
+inline uint64_t hash_getindex(uint64_t hash)
+{
+  // Shift bits right to get independence from the simple Bloom filter hash
+  hash = hash >> 32;
+  return hash & hash_mask;
+}
+
+inline uint64_t hash_getnextindex(uint64_t j)
+{
+  return (j+1) & hash_mask;
+}
+
+inline void hash_set_occupied(uint64_t j)
+{
+  hash_occupied[j >> 3] |= (1 << (j & 7));
+}
+
+inline bool hash_is_occupied(uint64_t j)
+{
+  return hash_occupied[j >> 3] & (1 << (j & 7));
+}
+
+inline void hash_set_value(uint64_t j, uint64_t hash)
+{
+  hash_values[j] = hash;
+}
+
+inline bool hash_compare_value(uint64_t j, uint64_t hash)
+{
+  return (hash_values[j] == hash);
+}
+
+inline unsigned int hash_get_data(uint64_t j)
+{
+  return hash_data[j];
+}
+
+inline void hash_set_data(uint64_t j, unsigned int x)
+{
+  hash_data[j] = x;
+}
+
+void hash_zap();
+
+void hash_alloc(uint64_t amplicons);
+
+void hash_free();
diff --git a/src/matrix.cc b/src/matrix.cc
index a8c699bb..c25a7d62 100644
--- a/src/matrix.cc
+++ b/src/matrix.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -23,16 +23,18 @@
 
 #include "swarm.h"
 
-long SCORELIMIT_7 = 0;
-long SCORELIMIT_8;
-long SCORELIMIT_16;
-long SCORELIMIT_32;
-long SCORELIMIT_63;
+int64_t SCORELIMIT_7 = 0;
+int64_t SCORELIMIT_8;
+int64_t SCORELIMIT_16;
+int64_t SCORELIMIT_32;
+int64_t SCORELIMIT_63;
 char BIAS;
 
-unsigned char * score_matrix_8 = NULL;
-unsigned short * score_matrix_16 = NULL;
-long * score_matrix_63 = NULL;
+unsigned char * score_matrix_8 = nullptr;
+unsigned short * score_matrix_16 = nullptr;
+int64_t * score_matrix_63 = nullptr;
+
+void score_matrix_read();
 
 #if 0
 
@@ -53,7 +55,7 @@ void score_matrix_dump()
     fprintf(logfile, "%2d %c ", i, sym_nt[i]);
     for(int j=0; j<16; j++)
       {
-        fprintf(logfile, "%2ld", score_matrix_63[(i<<5) + j]);
+        fprintf(logfile, "%2" PRId64, score_matrix_63[(i<<5) + j]);
       }
     fprintf(logfile, "\n");
   }
@@ -64,12 +66,12 @@ void score_matrix_dump()
 void score_matrix_read()
 {
   int a, b;
-  long sc, lo, hi; 
-  
-  score_matrix_8 = (unsigned char *) xmalloc(32*32*sizeof(char));
-  score_matrix_16 = (unsigned short *) xmalloc(32*32*sizeof(short));
-  score_matrix_63 = (long *) xmalloc(32*32*sizeof(long));
-  
+  int64_t sc, lo, hi;
+
+  score_matrix_8 = static_cast<unsigned char*>(xmalloc(32*32*sizeof(char)));
+  score_matrix_16 = static_cast<unsigned short*>(xmalloc(32*32*sizeof(short)));
+  score_matrix_63 = static_cast<int64_t *>(xmalloc(32*32*sizeof(int64_t)));
+
   hi = -1000;
   lo = 1000;
 
@@ -88,13 +90,13 @@ void score_matrix_read()
 
   SCORELIMIT_8  = 256 - hi;
   SCORELIMIT_16 = 65536 - hi;
-  
+
   for(a=0;a<32;a++)
     for(b=0;b<32;b++)
     {
       sc = score_matrix_63[(a<<5) + b];
-      score_matrix_8[(a<<5) + b] = (unsigned char) sc;
-      score_matrix_16[(a<<5) + b] = (unsigned short) sc;
+      score_matrix_8[(a<<5) + b] = static_cast<unsigned char>(sc);
+      score_matrix_16[(a<<5) + b] = static_cast<unsigned short>(sc);
     }
 }
 
@@ -106,10 +108,10 @@ void score_matrix_init()
 
 void score_matrix_free()
 {
-  free(score_matrix_8);
-  score_matrix_8 = NULL;
-  free(score_matrix_16);
-  score_matrix_16 = NULL;
-  free(score_matrix_63);
-  score_matrix_63 = NULL;
+  xfree(score_matrix_8);
+  score_matrix_8 = nullptr;
+  xfree(score_matrix_16);
+  score_matrix_16 = nullptr;
+  xfree(score_matrix_63);
+  score_matrix_63 = nullptr;
 }
diff --git a/src/nw.cc b/src/nw.cc
index 60fbebf3..9fc25e3a 100644
--- a/src/nw.cc
+++ b/src/nw.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -23,6 +23,9 @@
 
 #include "swarm.h"
 
+void pushop(char newop, char ** cigarendp, char * op, int * count);
+void finishop(char ** cigarendp, char * op, int * count);
+
 void pushop(char newop, char ** cigarendp, char * op, int * count)
 {
   if (newop == *op)
@@ -34,8 +37,9 @@ void pushop(char newop, char ** cigarendp, char * op, int * count)
     {
       char buf[25];
       int len = snprintf(buf, 25, "%d", *count);
+      assert(len >= 0);
       *cigarendp -= len;
-      memcpy(*cigarendp, buf, len);
+      memcpy(*cigarendp, buf, static_cast<size_t>(len));
     }
     *op = newop;
     *count = 1;
@@ -51,8 +55,9 @@ void finishop(char ** cigarendp, char * op, int * count)
     {
       char buf[25];
       int len = snprintf(buf, 25, "%d", *count);
+      assert(len >= 0);
       *cigarendp -= len;
-      memcpy(*cigarendp, buf, len);
+      memcpy(*cigarendp, buf, static_cast<size_t>(len));
     }
     *op = 0;
     *count = 0;
@@ -76,7 +81,7 @@ const unsigned char maskextleft = 8;
   1. left/insert/e (gap in query sequence (qseq))
   2. align/diag/h (match/mismatch)
   3. up/delete/f (gap in database sequence (dseq))
-  
+
   qseq: the reference/query/upper/vertical/from sequence
   dseq: the sample/database/lower/horisontal/to sequence
 
@@ -106,86 +111,87 @@ const unsigned char maskextleft = 8;
 */
 
 void nw(char * dseq,
-        char * dend,
+        int64_t dlen,
         char * qseq,
-        char * qend,
-        long * score_matrix,
-        unsigned long gapopen,
-        unsigned long gapextend,
-        unsigned long * nwscore,
-        unsigned long * nwdiff,
-        unsigned long * nwalignmentlength,
+        int64_t qlen,
+        int64_t * score_matrix,
+        int64_t gapopen,
+        int64_t gapextend,
+        int64_t * nwscore,
+        int64_t * nwdiff,
+        int64_t * nwalignmentlength,
         char ** nwalignment,
         unsigned char * dir,
-        unsigned long * hearray,
-        unsigned long queryno,
-        unsigned long dbseqno)
+        int64_t * hearray,
+        uint64_t queryno,
+        uint64_t dbseqno)
 {
   /* dir must point to at least qlen*dlen bytes of allocated memory
-     hearray must point to at least 2*qlen longs of allocated memory (8*qlen bytes) */
-
-  long n, e;
+     hearray must point to at least 2*qlen longs of allocated memory
+     (8*qlen bytes) */
 
-  long qlen = qend - qseq;
-  long dlen = dend - dseq;
+  int64_t n, e;
 
-  memset(dir, 0, qlen*dlen);
+  memset(dir, 0, static_cast<size_t>(qlen * dlen));
 
-  long i, j;
+  int64_t i, j;
 
   for(i=0; i<qlen; i++)
-  {
-    hearray[2*i]   = 1 * gapopen + (i+1) * gapextend; // H (N)
-    hearray[2*i+1] = 2 * gapopen + (i+2) * gapextend; // E
-  }
+    {
+      hearray[2*i]   = 1 * gapopen + (i+1) * gapextend; // H (N)
+      hearray[2*i+1] = 2 * gapopen + (i+2) * gapextend; // E
+    }
 
   for(j=0; j<dlen; j++)
-  {
-    long unsigned *hep;
-    hep = hearray;
-    long f = 2 * gapopen + (j+2) * gapextend;
-    long h = (j == 0) ? 0 : (gapopen + j * gapextend);
-    
-    for(i=0; i<qlen; i++)
     {
-      long index = qlen*j+i;
-      
-      n = *hep;
-      e = *(hep+1);
-      h += score_matrix[(dseq[j]<<5) + qseq[i]];
-      
-      dir[index] |= (f < h ? maskup : 0);
-      h = MIN(h, f);
-      h = MIN(h, e);
-      dir[index] |= (e == h ? maskleft : 0);
-
-      *hep = h;
-      
-      h += gapopen + gapextend;
-      e += gapextend;
-      f += gapextend;
-      
-      dir[index] |= (f < h ? maskextup : 0);
-      dir[index] |= (e < h ? maskextleft : 0);
-      f = MIN(h,f);
-      e = MIN(h,e);
-      
-      *(hep+1) = e;
-      h = n;
-      hep += 2;
+      int64_t *hep;
+      hep = hearray;
+      int64_t f = 2 * gapopen + (j+2) * gapextend;
+      int64_t h = (j == 0) ? 0 : (gapopen + j * gapextend);
+
+      for(i=0; i<qlen; i++)
+        {
+          int64_t index = qlen*j+i;
+
+          n = *hep;
+          e = *(hep+1);
+          h += score_matrix
+            [((nt_extract(dseq, static_cast<uint64_t>(j)) + 1) << 5)
+             +(nt_extract(qseq, static_cast<uint64_t>(i)) + 1)];
+
+          dir[index] |= (f < h ? maskup : 0);
+          h = MIN(h, f);
+          h = MIN(h, e);
+          dir[index] |= (e == h ? maskleft : 0);
+
+          *hep = h;
+
+          h += gapopen + gapextend;
+          e += gapextend;
+          f += gapextend;
+
+          dir[index] |= (f < h ? maskextup : 0);
+          dir[index] |= (e < h ? maskextleft : 0);
+          f = MIN(h,f);
+          e = MIN(h,e);
+
+          *(hep+1) = e;
+          h = n;
+          hep += 2;
+        }
     }
-  }
-  
-  long dist = hearray[2*qlen-2];
-  
+
+  int64_t dist = hearray[2*qlen-2];
+
   /* backtrack: count differences and save alignment in cigar string */
 
-  long score = 0;
-  long alength = 0;
-  long matches = 0;
+  int64_t score = 0;
+  int64_t alength = 0;
+  int64_t matches = 0;
 
-  char * cigar = (char *) xmalloc(qlen + dlen + 1);
-  char * cigarend = cigar+qlen+dlen+1;
+  char * cigar = static_cast<char *>(xmalloc
+                                     (static_cast<size_t>(qlen + dlen + 1)));
+  char * cigarend = cigar + qlen + dlen + 1;
 
   char op = 0;
   int count = 0;
@@ -195,87 +201,98 @@ void nw(char * dseq,
   j = dlen;
 
   while ((i>0) && (j>0))
-  {
-    int d = dir[qlen*(j-1)+(i-1)];
-
-    alength++;
-
-    if ((op == 'I') && (d & maskextleft))
     {
-      score += gapextend;
-      j--;
-      pushop('I', &cigarend, &op, &count);
+      int d = dir[qlen*(j-1)+(i-1)];
+
+      alength++;
+
+      if ((op == 'I') && (d & maskextleft))
+        {
+          score += gapextend;
+          j--;
+          pushop('I', &cigarend, &op, &count);
+        }
+      else if ((op == 'D') && (d & maskextup))
+        {
+          score += gapextend;
+          i--;
+          pushop('D', &cigarend, &op, &count);
+        }
+      else if (d & maskleft)
+        {
+          score += gapextend;
+          if (op != 'I')
+            score += gapopen;
+          j--;
+          pushop('I', &cigarend, &op, &count);
+        }
+      else if (d & maskup)
+        {
+          score += gapextend;
+          if (op != 'D')
+            score += gapopen;
+          i--;
+          pushop('D', &cigarend, &op, &count);
+        }
+      else
+        {
+          score += score_matrix
+            [((nt_extract(dseq, static_cast<uint64_t>(j - 1)) + 1) << 5)
+             +(nt_extract(qseq, static_cast<uint64_t>(i - 1)) + 1)];
+
+          if (nt_extract(qseq, static_cast<uint64_t>(i - 1)) ==
+              nt_extract(dseq, static_cast<uint64_t>(j - 1)))
+            matches++;
+          i--;
+          j--;
+          pushop('M', &cigarend, &op, &count);
+        }
     }
-    else if ((op == 'D') && (d & maskextup))
+
+  while(i>0)
     {
+      alength++;
       score += gapextend;
+      if (op != 'D')
+        score += gapopen;
       i--;
       pushop('D', &cigarend, &op, &count);
     }
-    else if (d & maskleft)
+
+  while(j>0)
     {
+      alength++;
       score += gapextend;
       if (op != 'I')
         score += gapopen;
       j--;
       pushop('I', &cigarend, &op, &count);
     }
-    else if (d & maskup)
-    {
-      score += gapextend;
-      if (op != 'D')
-        score +=gapopen;
-      i--;
-      pushop('D', &cigarend, &op, &count);
-    }
-    else
-    {
-      score += score_matrix[(dseq[j-1] << 5) + qseq[i-1]];
-      if (qseq[i-1] == dseq[j-1])
-        matches++;
-      i--;
-      j--;
-      pushop('M', &cigarend, &op, &count);
-    }
-  }
-  
-  while(i>0)
-  {
-    alength++;
-    score += gapextend;
-    if (op != 'D')
-      score += gapopen;
-    i--;
-    pushop('D', &cigarend, &op, &count);
-  }
-  
-  while(j>0)
-  {
-    alength++;
-    score += gapextend;
-    if (op != 'I')
-      score += gapopen;
-    j--;
-    pushop('I', &cigarend, &op, &count);
-  }
 
   finishop(&cigarend, &op, &count);
 
   /* move and reallocate cigar */
 
-  long cigarlength = cigar+qlen+dlen-cigarend;
-  memmove(cigar, cigarend, cigarlength+1);
-  cigar = (char*) xrealloc(cigar, cigarlength+1);
+  size_t cigaralloc = static_cast<size_t>(cigar + qlen + dlen - cigarend + 1);
+  memmove(cigar, cigarend, cigaralloc);
+  cigar = static_cast<char*>(xrealloc(cigar, cigaralloc));
 
   * nwscore = dist;
   * nwdiff = alength - matches;
   * nwalignmentlength = alength;
   * nwalignment = cigar;
 
+  assert(score == dist);
+
+#if 0
   if (score != dist)
   {
-    fprintf(stderr, "WARNING: Error with query no %lu and db sequence no %lu:\n", queryno, dbseqno);
-    fprintf(stderr, "Initial and recomputed alignment score disagreement: %ld %ld\n", dist, score);
+    fprintf(stderr, "WARNING: Error with query no %" PRIu64 " and db sequence no %" PRIu64 ":\n", queryno, dbseqno);
+    fprintf(stderr, "Initial and recomputed alignment score disagreement: %" PRId64 " %" PRId64 "\n", dist, score);
     fprintf(stderr, "Alignment: %s\n", cigar);
   }
+#else
+  (void) queryno;
+  (void) dbseqno;
+#endif
 }
diff --git a/src/qgram.cc b/src/qgram.cc
index 04fcf217..4d640a4b 100644
--- a/src/qgram.cc
+++ b/src/qgram.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -31,13 +31,13 @@ static struct thread_info_s
   pthread_t pthread;
   pthread_mutex_t workmutex;
   pthread_cond_t workcond;
-  int work;
+  int64_t work;
 
   /* specialized thread info */
-  unsigned long seed;
-  unsigned long listlen;
-  unsigned long * amplist;
-  unsigned long * difflist;
+  uint64_t seed;
+  uint64_t listlen;
+  uint64_t * amplist;
+  uint64_t * difflist;
 } * ti;
 
 #if 0
@@ -58,46 +58,84 @@ void printqgrams(unsigned char * qgramvector)
 
 #endif
 
-void findqgrams(unsigned char * seq, unsigned long seqlen, 
+void findqgrams(unsigned char * seq, uint64_t seqlen,
                 unsigned char * qgramvector)
 {
   /* set qgram bit vector by xoring occurrences of qgrams in sequence */
 
   memset(qgramvector, 0, QGRAMVECTORBYTES);
-  
-  unsigned long qgram = 0;
-  unsigned long i = 0;
+
+  uint64_t qgram = 0;
+  unsigned int i = 0;
 
   while((i < QGRAMLENGTH-1) && (i<seqlen))
   {
-    qgram = (qgram << 2) | (seq[i] - 1);
+    qgram = (qgram << 2) | nt_extract(reinterpret_cast<char *>(seq), i);
     i++;
   }
 
   while(i < seqlen)
   {
-    qgram = (qgram << 2) | (seq[i] - 1);
-    qgramvector[(qgram >> 3) & (QGRAMVECTORBYTES-1)] ^= (1 << (qgram & 7)); 
+    qgram = (qgram << 2) | nt_extract(reinterpret_cast<char *>(seq), i);
+    qgramvector[(qgram >> 3) & (QGRAMVECTORBYTES-1)] ^= (1 << (qgram & 7));
     i++;
   }
 }
 
-/* 
-   Unable to get the Mac gcc compiler v 4.2.1 produce the real 
+void qgram_work_diff(thread_info_s * tip);
+void * qgram_worker(void * vp);
+uint64_t compareqgramvectors(unsigned char * a, unsigned char * b);
+
+#ifdef __aarch64__
+
+uint64_t compareqgramvectors(unsigned char * a, unsigned char * b)
+{
+  uint8x16_t * ap = (uint8x16_t *) a;
+  uint8x16_t * bp = (uint8x16_t *) b;
+  uint64_t count = 0;
+
+  while ((unsigned char*)ap < a + QGRAMVECTORBYTES)
+    count += vaddvq_u8(vcntq_u8(veorq_u8(*ap++, *bp++)));
+
+  return count;
+}
+
+#elif defined __PPC__
+
+uint64_t compareqgramvectors(unsigned char * a, unsigned char * b)
+{
+  vector unsigned char * ap = (vector unsigned char *) a;
+  vector unsigned char * bp = (vector unsigned char *) b;
+  vector unsigned long long count_vector = { 0, 0 };
+
+  while ((unsigned char *)ap < a + QGRAMVECTORBYTES)
+    count_vector += vec_vpopcnt((vector unsigned long long)(vec_xor(*ap++, *bp++)));
+
+  return count_vector[0] + count_vector[1];
+}
+
+#elif defined __x86_64__
+
+/*
+   Unable to get the Mac gcc compiler v 4.2.1 produce the real
    popcnt instruction. Therefore resorting to assembly code.
 */
 
 #define popcnt_asm(x,y)                                         \
-  __asm__ __volatile__ ("popcnt %1,%0" : "=r"(y) : "r"(x));
+  __asm__ __volatile__ ("popcnt %1,%0" : "=r"(y) : "r"(x))
 
-inline unsigned long popcount(unsigned long x)
+inline uint64_t popcount(uint64_t x)
 {
-  unsigned long y;
+  uint64_t y;
   popcnt_asm(x,y);
   return y;
 }
 
-unsigned long popcount_128(__m128i x)
+uint64_t popcount_128(__m128i x);
+uint64_t compareqgramvectors_128(unsigned char * a, unsigned char * b);
+uint64_t compareqgramvectors_64(unsigned char * a, unsigned char * b);
+
+uint64_t popcount_128(__m128i x)
 {
   __m128i mask1 = _mm_set_epi8(0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55,
                                0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55, 0x55);
@@ -141,43 +179,45 @@ unsigned long popcount_128(__m128i x)
 
   /* return low 64 bits: return value is always in range 0 to 128 */
 
-  unsigned long o = (unsigned long) _mm_movepi64_pi64(n);
+  uint64_t o = reinterpret_cast<uint64_t>(_mm_movepi64_pi64(n));
 
   return o;
 }
 
-unsigned long compareqgramvectors_128(unsigned char * a, unsigned char * b)
+uint64_t compareqgramvectors_128(unsigned char * a, unsigned char * b)
 {
   /* Count number of different bits */
   /* Uses SSE2 but not POPCNT instruction */
   /* input MUST be 16-byte aligned */
 
-  __m128i * ap = (__m128i *) a;
-  __m128i * bp = (__m128i *) b;
-  unsigned long count = 0;
+  __m128i * ap = reinterpret_cast<__m128i *>(a);
+  __m128i * bp = reinterpret_cast<__m128i *>(b);
+  uint64_t count = 0;
 
-  while ((unsigned char*)ap < a + QGRAMVECTORBYTES)
+  while (reinterpret_cast<unsigned char*>(ap) < a + QGRAMVECTORBYTES)
     count += popcount_128(_mm_xor_si128(*ap++, *bp++));
-  
+
   return count;
 }
 
-unsigned long compareqgramvectors_64(unsigned char * a, unsigned char * b)
+
+uint64_t compareqgramvectors_64(unsigned char * a, unsigned char * b)
 {
   /* Count number of different bits */
   /* Uses the POPCNT instruction, requires CPU with this feature */
 
-  unsigned long *ap = (unsigned long*)a;
-  unsigned long *bp = (unsigned long*)b;
-  unsigned long count = 0;
+  uint64_t *ap = reinterpret_cast<uint64_t*>(a);
+  uint64_t *bp = reinterpret_cast<uint64_t*>(b);
+  uint64_t count = 0;
 
-  while ((unsigned char*) ap < a + QGRAMVECTORBYTES)
+  while (reinterpret_cast<unsigned char*>(ap) < a + QGRAMVECTORBYTES)
     count += popcount(*ap++ ^ *bp++);
-  
+
   return count;
 }
 
-unsigned long compareqgramvectors(unsigned char * a, unsigned char * b)
+
+uint64_t compareqgramvectors(unsigned char * a, unsigned char * b)
 {
   if (popcnt_present)
     return compareqgramvectors_64(a,b);
@@ -185,29 +225,34 @@ unsigned long compareqgramvectors(unsigned char * a, unsigned char * b)
     return compareqgramvectors_128(a,b);
 }
 
+#else
+
+#error Unknown architecture
+
+#endif
 
-inline unsigned long qgram_diff(unsigned long a, unsigned long b)
+inline uint64_t qgram_diff(uint64_t a, uint64_t b)
 {
-  unsigned long diffqgrams = compareqgramvectors(db_getqgramvector(a),
-                                                 db_getqgramvector(b));
-  unsigned long mindiff = (diffqgrams + 2*QGRAMLENGTH - 1)/(2*QGRAMLENGTH);
+  uint64_t diffqgrams = compareqgramvectors(db_getqgramvector(a),
+                                            db_getqgramvector(b));
+  uint64_t mindiff = (diffqgrams + 2*QGRAMLENGTH - 1)/(2*QGRAMLENGTH);
   return mindiff;
 }
 
 void qgram_work_diff(thread_info_s * tip)
 {
-  unsigned long seed = tip->seed;
-  unsigned long listlen = tip->listlen;
-  unsigned long * amplist = tip->amplist;
-  unsigned long * difflist = tip->difflist;
+  uint64_t seed = tip->seed;
+  uint64_t listlen = tip->listlen;
+  uint64_t * amplist = tip->amplist;
+  uint64_t * difflist = tip->difflist;
 
-  for(unsigned long i=0; i<listlen; i++)
+  for(uint64_t i=0; i<listlen; i++)
     difflist[i] = qgram_diff(seed, amplist[i]);
 }
 
 void * qgram_worker(void * vp)
 {
-  long t = (long) vp;
+  int64_t t = reinterpret_cast<int64_t>(vp);
   struct thread_info_s * tip = ti + t;
 
   pthread_mutex_lock(&tip->workmutex);
@@ -226,26 +271,27 @@ void * qgram_worker(void * vp)
         }
     }
   pthread_mutex_unlock(&tip->workmutex);
-  return 0;
+  return nullptr;
 }
 
 void qgram_diff_init()
 {
   pthread_attr_init(&attr);
   pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
-  
+
   /* allocate memory for thread info */
-  ti = (struct thread_info_s *) xmalloc(opt_threads * 
-                                        sizeof(struct thread_info_s));
-  
+  ti = static_cast<struct thread_info_s *>
+    (xmalloc(static_cast<uint64_t>(opt_threads) *
+             sizeof(struct thread_info_s)));
+
   /* init and create worker threads */
-  for(long t=0; t<opt_threads; t++)
+  for(int64_t t=0; t<opt_threads; t++)
     {
       struct thread_info_s * tip = ti + t;
       tip->work = 0;
-      pthread_mutex_init(&tip->workmutex, NULL);
-      pthread_cond_init(&tip->workcond, NULL);
-      if (pthread_create(&tip->pthread, &attr, qgram_worker, (void*)(long)t))
+      pthread_mutex_init(&tip->workmutex, nullptr);
+      pthread_cond_init(&tip->workcond, nullptr);
+      if (pthread_create(&tip->pthread, &attr, qgram_worker, reinterpret_cast<void*>(t)))
         fatal("Cannot create thread");
     }
 }
@@ -253,10 +299,10 @@ void qgram_diff_init()
 void qgram_diff_done()
 {
   /* finish and clean up worker threads */
-  for(long t=0; t<opt_threads; t++)
+  for(int64_t t=0; t<opt_threads; t++)
     {
       struct thread_info_s * tip = ti + t;
-      
+
       /* tell worker to quit */
       pthread_mutex_lock(&tip->workmutex);
       tip->work = -1;
@@ -264,39 +310,39 @@ void qgram_diff_done()
       pthread_mutex_unlock(&tip->workmutex);
 
       /* wait for worker to quit */
-      if (pthread_join(tip->pthread, NULL))
+      if (pthread_join(tip->pthread, nullptr))
         fatal("Cannot join thread");
 
       pthread_cond_destroy(&tip->workcond);
       pthread_mutex_destroy(&tip->workmutex);
     }
 
-  free(ti);
+  xfree(ti);
   pthread_attr_destroy(&attr);
 }
 
-void qgram_diff_fast(unsigned long seed,
-                     unsigned long listlen,
-                     unsigned long * amplist,
-                     unsigned long * difflist)
+void qgram_diff_fast(uint64_t seed,
+                     uint64_t listlen,
+                     uint64_t * amplist,
+                     uint64_t * difflist)
 {
-  long thr = opt_threads;
-  
-  const unsigned long m = 150;
-
-  if (listlen < m*thr)
-    thr = (listlen+m-1)/m;
-  
-  unsigned long * next_amplist = amplist;
-  unsigned long * next_difflist = difflist;
-  unsigned long listrest = listlen;
-  unsigned long thrrest = thr;
-  
+  uint64_t thr = static_cast<uint64_t>(opt_threads);
+
+  const uint64_t m = 150;
+
+  if (listlen < m * thr)
+    thr = (listlen + m - 1) / m;
+
+  uint64_t * next_amplist = amplist;
+  uint64_t * next_difflist = difflist;
+  uint64_t listrest = listlen;
+  uint64_t thrrest = thr;
+
   /* distribute work */
-  for(long t=0; t<thr; t++)
+  for(uint64_t t=0; t<thr; t++)
     {
       thread_info_s * tip = ti + t;
-      unsigned long chunk = (listrest + thrrest - 1) / thrrest;
+      uint64_t chunk = (listrest + thrrest - 1) / thrrest;
 
       tip->seed = seed;
       tip->amplist = next_amplist;
@@ -316,7 +362,7 @@ void qgram_diff_fast(unsigned long seed,
   else
     {
       /* wake up threads */
-      for(long t=0; t<thr; t++)
+      for(uint64_t t=0; t<thr; t++)
         {
           struct thread_info_s * tip = ti + t;
           pthread_mutex_lock(&tip->workmutex);
@@ -324,9 +370,9 @@ void qgram_diff_fast(unsigned long seed,
           pthread_cond_signal(&tip->workcond);
           pthread_mutex_unlock(&tip->workmutex);
         }
-      
+
       /* wait for threads to finish their work */
-      for(int t=0; t<thr; t++)
+      for(uint64_t t=0; t<thr; t++)
         {
           struct thread_info_s * tip = ti + t;
           pthread_mutex_lock(&tip->workmutex);
diff --git a/src/scan.cc b/src/scan.cc
index 9e027e6e..a6f17fd0 100644
--- a/src/scan.cc
+++ b/src/scan.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -31,19 +31,18 @@ static struct thread_info_s
   pthread_t pthread;
   pthread_mutex_t workmutex;
   pthread_cond_t workcond;
-  int work;
+  int64_t work;
 
   /* specialized thread info */
-  unsigned long seed;
-  unsigned long listlen;
-  unsigned long * amplist;
-  unsigned long * difflist;
+  uint64_t seed;
+  uint64_t listlen;
+  uint64_t * amplist;
+  uint64_t * difflist;
 } * ti;
 
 
-pthread_mutex_t workmutex = PTHREAD_MUTEX_INITIALIZER;
+static pthread_mutex_t workmutex = PTHREAD_MUTEX_INITIALIZER;
 
-queryinfo_t query;
 
 struct search_data
 {
@@ -55,37 +54,49 @@ struct search_data
 
   BYTE * hearray;
 
-  unsigned long * dir_array;
+  uint64_t * dir_array;
 
-  unsigned long target_count;
-  unsigned long target_index;
+  uint64_t target_count;
+  uint64_t target_index;
 };
 
-struct search_data * sd;
-
-unsigned long master_next;
-unsigned long master_length;
-
-unsigned long remainingchunks;
+static struct search_data * sd;
+static uint64_t master_next;
+static uint64_t master_length;
+static uint64_t remainingchunks;
+static uint64_t * master_targets;
+static uint64_t * master_scores;
+static uint64_t * master_diffs;
+static uint64_t * master_alignlengths;
+static int master_bits;
+static uint64_t dirbufferbytes;
 
-unsigned long * master_targets;
-unsigned long * master_scores;
-unsigned long * master_diffs;
-unsigned long * master_alignlengths;
-int master_bits;
+queryinfo_t query;
+uint64_t longestdbsequence;
 
-unsigned long longestdbsequence;
-unsigned long dirbufferbytes;
+void search_alloc(struct search_data * sdp);
+void search_free(struct search_data * sdp);
+void search_init(struct search_data * sdp);
+void search_chunk(struct search_data * sdp, int64_t bits);
+int search_getwork(uint64_t * countref, uint64_t * firstref);
+void search_worker_core(uint64_t t);
+void * search_worker(void * vp);
 
 void search_alloc(struct search_data * sdp)
 {
   dirbufferbytes = 8 * longestdbsequence * ((longestdbsequence+3)/4) * 4;
-  sdp->qtable = (BYTE**) xmalloc(longestdbsequence * sizeof(BYTE*));
-  sdp->qtable_w = (WORD**) xmalloc(longestdbsequence * sizeof(WORD*));
-  sdp->dprofile = (BYTE*) xmalloc(4*16*32);
-  sdp->dprofile_w = (WORD*) xmalloc(4*2*8*32);
-  sdp->hearray = (BYTE*) xmalloc(longestdbsequence * 32);
-  sdp->dir_array = (unsigned long *) xmalloc(dirbufferbytes);
+  sdp->qtable = static_cast<BYTE**>
+    (xmalloc(longestdbsequence * sizeof(BYTE*)));
+  sdp->qtable_w = static_cast<WORD**>
+    (xmalloc(longestdbsequence * sizeof(WORD*)));
+  sdp->dprofile = static_cast<BYTE*>
+    (xmalloc(4*16*32));
+  sdp->dprofile_w = static_cast<WORD*>
+    (xmalloc(4*2*8*32));
+  sdp->hearray = static_cast<BYTE*>
+    (xmalloc(longestdbsequence * 32));
+  sdp->dir_array = static_cast<uint64_t *>
+    (xmalloc(dirbufferbytes));
 
   memset(sdp->hearray, 0, longestdbsequence*32);
   memset(sdp->dir_array, 0, dirbufferbytes);
@@ -93,41 +104,41 @@ void search_alloc(struct search_data * sdp)
 
 void search_free(struct search_data * sdp)
 {
-  free(sdp->qtable);
-  free(sdp->qtable_w);
-  free(sdp->dprofile);
-  free(sdp->dprofile_w);
-  free(sdp->hearray);
-  free(sdp->dir_array);
+  xfree(sdp->qtable);
+  xfree(sdp->qtable_w);
+  xfree(sdp->dprofile);
+  xfree(sdp->dprofile_w);
+  xfree(sdp->hearray);
+  xfree(sdp->dir_array);
 }
 
 void search_init(struct search_data * sdp)
 {
-  for (long i = 0; i < query.len; i++ )
+  for (unsigned int i = 0; i < query.len; i++ )
   {
-    sdp->qtable[i] = sdp->dprofile + 64*query.seq[i];
-    sdp->qtable_w[i] = sdp->dprofile_w + 32*query.seq[i];
+    sdp->qtable[i] = sdp->dprofile + 64 * (nt_extract(query.seq, i) + 1);
+    sdp->qtable_w[i] = sdp->dprofile_w + 32 * (nt_extract(query.seq, i) + 1);
   }
 }
 
-void search_chunk(struct search_data * sdp, long bits)
+void search_chunk(struct search_data * sdp, int64_t bits)
 {
   if (sdp->target_count == 0)
     return;
 
 #if 0
 
-  for(unsigned long i=0; i<sdp->target_count; i++)
+  for(uint64_t i=0; i<sdp->target_count; i++)
     {
     char * dseq;
-    long dlen;
+    int64_t dlen;
     char * nwalignment;
 
-    unsigned long seqno = master_targets[sdp->target_index + i];
+    uint64_t seqno = master_targets[sdp->target_index + i];
     db_getsequenceandlength(seqno, & dseq, & dlen);
 
-    nw(dseq, dseq + dlen,
-       query.seq, query.seq + query.len,
+    nw(dseq, dlen,
+       query.seq, query.len,
        score_matrix_63,
        penalty_gapopen, penalty_gapextend,
        master_scores + sdp->target_index + i,
@@ -135,40 +146,46 @@ void search_chunk(struct search_data * sdp, long bits)
        master_alignlengths + sdp->target_index + i,
        & nwalignment,
        (unsigned char *) sdp->dir_array,
-       (unsigned long int *) sdp->hearray,
+       (uint64_t int *) sdp->hearray,
        query.qno, seqno);
 
 #if 0
     printf("\nAlignment: %s\n", nwalignment);
 #endif
 
-    free(nwalignment);
+    xfree(nwalignment);
   }
 
   return;
 
 #endif
 
+#ifdef __aarch64__
+  /* always use 16-bit version on aarch64 because it is faster */
+ (void) bits;
+  if (1)
+#else
   if (bits == 16)
+#endif
     search16(sdp->qtable_w,
-             penalty_gapopen,
-             penalty_gapextend,
-             (WORD*) score_matrix_16,
+             static_cast<WORD>(penalty_gapopen),
+             static_cast<WORD>(penalty_gapextend),
+             static_cast<WORD*>(score_matrix_16),
              sdp->dprofile_w,
-             (WORD*) sdp->hearray,
+             reinterpret_cast<WORD*>(sdp->hearray),
              sdp->target_count,
              master_targets + sdp->target_index,
              master_scores + sdp->target_index,
              master_diffs + sdp->target_index,
              master_alignlengths + sdp->target_index,
-             query.len,
+             static_cast<uint64_t>(query.len),
              dirbufferbytes/8,
              sdp->dir_array);
   else
     search8(sdp->qtable,
-            penalty_gapopen,
-            penalty_gapextend,
-            (BYTE*) score_matrix_8,
+            static_cast<BYTE>(penalty_gapopen),
+            static_cast<BYTE>(penalty_gapextend),
+            static_cast<BYTE*>(score_matrix_8),
             sdp->dprofile,
             sdp->hearray,
             sdp->target_count,
@@ -176,38 +193,38 @@ void search_chunk(struct search_data * sdp, long bits)
             master_scores + sdp->target_index,
             master_diffs + sdp->target_index,
             master_alignlengths + sdp->target_index,
-            query.len,
+            static_cast<uint64_t>(query.len),
             dirbufferbytes/8,
             sdp->dir_array);
 }
- 
-int search_getwork(unsigned long * countref, unsigned long * firstref)
+
+int search_getwork(uint64_t * countref, uint64_t * firstref)
 {
   // * countref = how many sequences to search
   // * firstref = index into master_targets/scores/diffs where thread should start
-  
-  unsigned long status = 0;
-  
+
+  int status = 0;
+
   pthread_mutex_lock(&workmutex);
-  
+
   if (master_next < master_length)
     {
-      unsigned long chunksize = 
+      uint64_t chunksize =
         ((master_length - master_next + remainingchunks - 1) / remainingchunks);
-      
+
       * countref = chunksize;
       * firstref = master_next;
-      
+
       master_next += chunksize;
       remainingchunks--;
       status = 1;
     }
-  
+
   pthread_mutex_unlock(&workmutex);
-  
+
   return status;
 }
-  
+
 #if 0
 
 /* never used */
@@ -216,16 +233,16 @@ void master_dump()
 {
   printf("master_dump\n");
   printf("   i    t    s    d\n");
-  for(unsigned long i=0; i< 1403; i++)
+  for(uint64_t i=0; i< 1403; i++)
     {
-      printf("%4lu %4lu %4lu %4lu\n", i, master_targets[i],
-             master_scores[i], master_diffs[i]);
+      printf("%4" PRIu64 " %4" PRIu64 " %4" PRIu64 " %4" PRIu64 "\n",
+             i, master_targets[i], master_scores[i], master_diffs[i]);
     }
 }
-  
+
 #endif
 
-void search_worker_core(int t)
+void search_worker_core(uint64_t t)
 {
   search_init(sd+t);
   while(search_getwork(& sd[t].target_count, & sd[t].target_index))
@@ -234,7 +251,7 @@ void search_worker_core(int t)
 
 void * search_worker(void * vp)
 {
-  long t = (long) vp;
+  uint64_t t = reinterpret_cast<uint64_t>(vp);
   struct thread_info_s * tip = ti + t;
 
   pthread_mutex_lock(&tip->workmutex);
@@ -253,20 +270,22 @@ void * search_worker(void * vp)
         }
     }
   pthread_mutex_unlock(&tip->workmutex);
-  return 0;
+  return nullptr;
 }
 
-void search_do(unsigned long query_no, 
-               unsigned long listlength,
-               unsigned long * targets,
-               unsigned long * scores,
-               unsigned long * diffs,
-               unsigned long * alignlengths,
-               long bits)
+void search_do(uint64_t query_no,
+               uint64_t listlength,
+               uint64_t * targets,
+               uint64_t * scores,
+               uint64_t * diffs,
+               uint64_t * alignlengths,
+               int bits)
 {
   query.qno = query_no;
-  db_getsequenceandlength(query_no, &query.seq, &query.len);
-  
+  unsigned int query_len = 0;
+  db_getsequenceandlength(query_no, &query.seq, &query_len);
+  query.len = query_len;
+
   master_next = 0;
   master_length = listlength;
   master_targets = targets;
@@ -275,21 +294,21 @@ void search_do(unsigned long query_no,
   master_alignlengths = alignlengths;
   master_bits = bits;
 
-  unsigned long thr = opt_threads;
+  uint64_t thr = static_cast<uint64_t>(opt_threads);
 
   if (bits == 8)
     {
-      if (master_length <= (unsigned long)(15 * thr) )
+      if (master_length <= 15 * thr)
         thr = (master_length + 15) / 16;
     }
   else
     {
-      if (master_length <= (unsigned long)(7 * thr) )
+      if (master_length <= 7 * thr)
         thr = (master_length + 7) / 8;
     }
 
   remainingchunks = thr;
-  
+
   if (thr == 1)
     {
       search_worker_core(0);
@@ -297,7 +316,7 @@ void search_do(unsigned long query_no,
   else
     {
       /* wake up threads */
-      for(unsigned long t=0; t<thr; t++)
+      for(uint64_t t=0; t<thr; t++)
         {
           struct thread_info_s * tip = ti + t;
           pthread_mutex_lock(&tip->workmutex);
@@ -305,9 +324,9 @@ void search_do(unsigned long query_no,
           pthread_cond_signal(&tip->workcond);
           pthread_mutex_unlock(&tip->workmutex);
         }
-      
+
       /* wait for threads to finish their work */
-      for(unsigned long t=0; t<thr; t++)
+      for(uint64_t t=0; t<thr; t++)
         {
           struct thread_info_s * tip = ti + t;
           pthread_mutex_lock(&tip->workmutex);
@@ -321,42 +340,45 @@ void search_do(unsigned long query_no,
 void search_begin()
 {
   longestdbsequence = db_getlongestsequence();
-  
-  sd = (struct search_data *) xmalloc(sizeof(search_data) * opt_threads);
 
-  for(long t=0; t<opt_threads; t++)
+  sd = static_cast<struct search_data *>
+    (xmalloc(sizeof(search_data) * static_cast<uint64_t>(opt_threads)));
+
+  for(int64_t t=0; t<opt_threads; t++)
     search_alloc(sd+t);
 
   /* start threads */
 
   pthread_attr_init(&attr);
   pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
-  
+
   /* allocate memory for thread info */
-  ti = (struct thread_info_s *) xmalloc(opt_threads *
-                                        sizeof(struct thread_info_s));
-  
+  ti = static_cast<struct thread_info_s *>
+    (xmalloc(static_cast<uint64_t>(opt_threads)*sizeof(struct thread_info_s)));
+
   /* init and create worker threads */
-  for(long t=0; t<opt_threads; t++)
+  for(int64_t t=0; t<opt_threads; t++)
     {
       struct thread_info_s * tip = ti + t;
       tip->work = 0;
-      pthread_mutex_init(&tip->workmutex, NULL);
-      pthread_cond_init(&tip->workcond, NULL);
-      if (pthread_create(&tip->pthread, &attr, search_worker, (void*)(long)t))
+      pthread_mutex_init(&tip->workmutex, nullptr);
+      pthread_cond_init(&tip->workcond, nullptr);
+      if (pthread_create(&tip->pthread,
+                         &attr,
+                         search_worker,
+                         reinterpret_cast<void*>(t)))
         fatal("Cannot create thread");
     }
-
 }
 
 void search_end()
 {
   /* finish and clean up worker threads */
 
-  for(long t=0; t<opt_threads; t++)
+  for(int64_t t=0; t<opt_threads; t++)
     {
       struct thread_info_s * tip = ti + t;
-      
+
       /* tell worker to quit */
       pthread_mutex_lock(&tip->workmutex);
       tip->work = -1;
@@ -364,17 +386,17 @@ void search_end()
       pthread_mutex_unlock(&tip->workmutex);
 
       /* wait for worker to quit */
-      if (pthread_join(tip->pthread, NULL))
+      if (pthread_join(tip->pthread, nullptr))
         fatal("Cannot join thread");
 
       pthread_cond_destroy(&tip->workcond);
       pthread_mutex_destroy(&tip->workmutex);
     }
 
-  free(ti);
+  xfree(ti);
   pthread_attr_destroy(&attr);
 
-  for(long t=0; t<opt_threads; t++)
+  for(int64_t t=0; t<opt_threads; t++)
     search_free(sd+t);
-  free(sd);
+  xfree(sd);
 }
diff --git a/src/search16.cc b/src/search16.cc
index fe87631e..57168d57 100644
--- a/src/search16.cc
+++ b/src/search16.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -26,446 +26,473 @@
 #define CHANNELS 8
 #define CDEPTH 4
 
-#define SHUFFLE 1
+#ifdef __aarch64__
+
+typedef int16x8_t VECTORTYPE;
+
+#define CAST_VECTOR_p(x) reinterpret_cast<VECTORTYPE *>(x)
+
+const uint16x8_t neon_mask =
+  {0x0003, 0x000c, 0x0030, 0x00c0, 0x0300, 0x0c00, 0x3000, 0xc000};
+
+#define v_load(a) vld1q_s16((const int16_t *)(a))
+#define v_store(a, b) vst1q_s16((int16_t *)(a), (b))
+#define v_merge_lo_16(a, b) vzip1q_s16((a),(b))
+#define v_merge_hi_16(a, b) vzip2q_s16((a),(b))
+#define v_merge_lo_32(a, b) vreinterpretq_s16_s32(vzip1q_s32 \
+          (vreinterpretq_s32_s16(a), vreinterpretq_s32_s16(b)))
+#define v_merge_hi_32(a, b) vreinterpretq_s16_s32(vzip2q_s32 \
+          (vreinterpretq_s32_s16(a), vreinterpretq_s32_s16(b)))
+#define v_merge_lo_64(a, b) vreinterpretq_s16_s64(vcombine_s64 \
+          (vget_low_s64(vreinterpretq_s64_s16(a)), \
+           vget_low_s64(vreinterpretq_s64_s16(b))))
+#define v_merge_hi_64(a, b) vreinterpretq_s16_s64(vcombine_s64 \
+          (vget_high_s64(vreinterpretq_s64_s16(a)), \
+           vget_high_s64(vreinterpretq_s64_s16(b))))
+#define v_min(a, b) vminq_s16((a), (b))
+#define v_add(a, b) vqaddq_u16((a), (b))
+#define v_sub(a, b) vqsubq_u16((a), (b))
+#define v_dup(a) vdupq_n_s16(a)
+#define v_zero v_dup(0)
+#define v_and(a, b) vandq_u16((a), (b))
+#define v_xor(a, b) veorq_u16((a), (b))
+#define v_shift_left(a) vextq_u16((v_zero), (a), 7)
+#define v_mask_gt(a, b) vaddvq_u16(vandq_u16((vcgtq_s16((a), (b))), neon_mask))
+#define v_mask_eq(a, b) vaddvq_u16(vandq_u16((vceqq_s16((a), (b))), neon_mask))
+
+#elif defined __x86_64__
+
+typedef __m128i VECTORTYPE;
+
+#define CAST_VECTOR_p(x) reinterpret_cast<VECTORTYPE *>(x)
+
+#define v_load(a) _mm_load_si128(CAST_VECTOR_p(a))
+#define v_store(a, b) _mm_store_si128(CAST_VECTOR_p(a), (b))
+#define v_merge_lo_16(a, b) _mm_unpacklo_epi16((a),(b))
+#define v_merge_hi_16(a, b) _mm_unpackhi_epi16((a),(b))
+#define v_merge_lo_32(a, b) _mm_unpacklo_epi32((a),(b))
+#define v_merge_hi_32(a, b) _mm_unpackhi_epi32((a),(b))
+#define v_merge_lo_64(a, b) _mm_unpacklo_epi64((a),(b))
+#define v_merge_hi_64(a, b) _mm_unpackhi_epi64((a),(b))
+#define v_min(a, b) _mm_min_epi16((a), (b))
+#define v_add(a, b) _mm_adds_epu16((a), (b))
+#define v_sub(a, b) _mm_subs_epu16((a), (b))
+#define v_dup(a) _mm_set1_epi16(a)
+#define v_zero v_dup(0)
+#define v_and(a, b) _mm_and_si128((a), (b))
+#define v_xor(a, b) _mm_xor_si128((a), (b))
+#define v_shift_left(a) _mm_slli_si128((a), 2)
+#define v_mask_gt(a, b) static_cast<unsigned short>(_mm_movemask_epi8(_mm_cmpgt_epi16((a), (b))))
+#define v_mask_eq(a, b) static_cast<unsigned short>(_mm_movemask_epi8(_mm_cmpeq_epi16((a), (b))))
+
+#elif defined __PPC__
+
+typedef vector unsigned short VECTORTYPE;
+
+#define CAST_VECTOR_p(x) reinterpret_cast<VECTORTYPE *>(x)
+
+const vector unsigned char perm_merge_long_low =
+  {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+   0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+
+const vector unsigned char perm_merge_long_high =
+  {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
+   0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+
+const vector unsigned char perm_bits =
+  { 0x78, 0x70, 0x68, 0x60, 0x58, 0x50, 0x48, 0x40,
+    0x38, 0x30, 0x28, 0x20, 0x18, 0x10, 0x08, 0x00 };
+
+#define v_load(a) *CAST_VECTOR_p(a)
+#define v_store(a, b) vec_st((VECTORTYPE)(b), 0, CAST_VECTOR_p(a))
+#define v_merge_lo_16(a, b) vec_mergeh((VECTORTYPE)(a), (VECTORTYPE)(b))
+#define v_merge_hi_16(a, b) vec_mergel((VECTORTYPE)(a), (VECTORTYPE)(b))
+#define v_merge_lo_32(a, b) (VECTORTYPE) vec_mergeh((vector int)(a),    \
+                                                    (vector int)(b))
+#define v_merge_hi_32(a, b) (VECTORTYPE) vec_mergel((vector int)(a),    \
+                                                    (vector int)(b))
+#define v_merge_lo_64(a, b) (VECTORTYPE) vec_perm((vector long long)(a), \
+                                                  (vector long long)(b), \
+                                                  perm_merge_long_low)
+#define v_merge_hi_64(a, b) (VECTORTYPE) vec_perm((vector long long)(a), \
+                                                  (vector long long)(b), \
+                                                  perm_merge_long_high)
+#define v_min(a, b) (VECTORTYPE) vec_min((vector signed short) (a),     \
+                                         (vector signed short) (b))
+#define v_add(a, b) vec_adds((a), (b))
+#define v_sub(a, b) vec_subs((a), (b))
+#define v_dup(a) vec_splats((unsigned short)(a));
+#define v_zero vec_splat_u16(0)
+#define v_and(a, b) vec_and((a), (b))
+#define v_xor(a, b) vec_xor((a), (b))
+#define v_shift_left(a) vec_sld((a), v_zero, 2)
+#define v_mask_gt(a, b) ((vector unsigned short) \
+  vec_vbpermq((vector unsigned char) vec_cmpgt((a), (b)), perm_bits))[4]
+#define v_mask_eq(a, b) ((vector unsigned short) \
+  vec_vbpermq((vector unsigned char) vec_cmpeq((a), (b)), perm_bits))[4]
+
+#else
+
+#error Unknown Architecture
 
-#if 0
-
-/* never used */
+#endif
 
+#if 0
 void dprofile_dump16(WORD * dprofile)
 {
-  char * s = sym_nt;
   printf("\ndprofile:\n");
   for(int i=0; i<32; i++)
-  {
-    printf("%c: ",s[i]);
-    for(int k=0; k<CDEPTH; k++)
     {
-      printf("[");
-      for(int j=0; j<CHANNELS; j++)
-        printf(" %3d", (short) dprofile[CHANNELS*CDEPTH*i + CHANNELS*k + j]);
-      printf("]");
+      printf("%c: ", sym_nt[i]);
+      for(int k=0; k<CDEPTH; k++)
+        {
+          printf("[");
+          for(int j=0; j<CHANNELS; j++)
+            printf(" %3d", (short) dprofile[CHANNELS*CDEPTH*i + CHANNELS*k + j]);
+          printf("]");
+        }
+      printf("\n");
     }
-    printf("\n");
-  }
   exit(1);
 }
-
 #endif
 
+void align_cells_regular_16(VECTORTYPE * Sm,
+                            VECTORTYPE * hep,
+                            VECTORTYPE ** qp,
+                            VECTORTYPE * Qm,
+                            VECTORTYPE * Rm,
+                            uint64_t ql,
+                            VECTORTYPE * F0,
+                            uint64_t * dir_long,
+                            VECTORTYPE * H0);
+
+void align_cells_masked_16(VECTORTYPE * Sm,
+                           VECTORTYPE * hep,
+                           VECTORTYPE ** qp,
+                           VECTORTYPE * Qm,
+                           VECTORTYPE * Rm,
+                           uint64_t ql,
+                           VECTORTYPE * F0,
+                           uint64_t * dir_long,
+                           VECTORTYPE * H0,
+                           VECTORTYPE * Mm,
+                           VECTORTYPE * MQ,
+                           VECTORTYPE * MR,
+                           VECTORTYPE * MQ0);
+
+uint64_t backtrack_16(char * qseq,
+                      char * dseq,
+                      uint64_t qlen,
+                      uint64_t dlen,
+                      uint64_t * dirbuffer,
+                      uint64_t offset,
+                      uint64_t dirbuffersize,
+                      uint64_t channel,
+                      uint64_t * alignmentlengthp);
+
 inline void dprofile_fill16(WORD * dprofile_word,
                             WORD * score_matrix_word,
                             BYTE * dseq)
 {
-  __m128i xmm0,  xmm1,  xmm2,  xmm3,  xmm4,  xmm5,  xmm6,  xmm7;
-  __m128i xmm8,  xmm9,  xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  __m128i xmm16, xmm17, xmm18, xmm19, xmm20, xmm21, xmm22, xmm23;
-  __m128i xmm24, xmm25, xmm26, xmm27, xmm28, xmm29, xmm30, xmm31;
-
-  // clocks? 4*(8+3*(8+4)+8) = 52*4 = 208
-  
-  for (int j=0; j<CDEPTH; j++)
-  {
-    int d[CHANNELS];
-    for(int z=0; z<CHANNELS; z++)
-      d[z] = dseq[j*CHANNELS+z] << 5;
-      
-    // for(int i=0; i<24; i += 8)
-    // for(int i=0; i<32; i += 8)
-    for(int i=0; i<8; i += 8)
+  VECTORTYPE reg0,  reg1,  reg2,  reg3,  reg4,  reg5,  reg6,  reg7;
+  VECTORTYPE reg8,  reg9,  reg10, reg11, reg12, reg13, reg14, reg15;
+  VECTORTYPE reg16, reg17, reg18, reg19, reg20, reg21, reg22, reg23;
+  VECTORTYPE reg24, reg25, reg26, reg27, reg28, reg29, reg30, reg31;
+
+  for (unsigned int j=0; j<CDEPTH; j++)
     {
-      xmm0  = _mm_load_si128((__m128i*)(score_matrix_word + d[0] + i));
-      xmm1  = _mm_load_si128((__m128i*)(score_matrix_word + d[1] + i));
-      xmm2  = _mm_load_si128((__m128i*)(score_matrix_word + d[2] + i));
-      xmm3  = _mm_load_si128((__m128i*)(score_matrix_word + d[3] + i));
-      xmm4  = _mm_load_si128((__m128i*)(score_matrix_word + d[4] + i));
-      xmm5  = _mm_load_si128((__m128i*)(score_matrix_word + d[5] + i));
-      xmm6  = _mm_load_si128((__m128i*)(score_matrix_word + d[6] + i));
-      xmm7  = _mm_load_si128((__m128i*)(score_matrix_word + d[7] + i));
-      
-      xmm8  = _mm_unpacklo_epi16(xmm0,  xmm1);
-      xmm9  = _mm_unpackhi_epi16(xmm0,  xmm1);
-      xmm10 = _mm_unpacklo_epi16(xmm2,  xmm3);
-      xmm11 = _mm_unpackhi_epi16(xmm2,  xmm3);
-      xmm12 = _mm_unpacklo_epi16(xmm4,  xmm5);
-      xmm13 = _mm_unpackhi_epi16(xmm4,  xmm5);
-      xmm14 = _mm_unpacklo_epi16(xmm6,  xmm7);
-      xmm15 = _mm_unpackhi_epi16(xmm6,  xmm7);
-      
-      xmm16 = _mm_unpacklo_epi32(xmm8,  xmm10);
-      xmm17 = _mm_unpackhi_epi32(xmm8,  xmm10);
-      xmm18 = _mm_unpacklo_epi32(xmm12, xmm14);
-      xmm19 = _mm_unpackhi_epi32(xmm12, xmm14);
-      xmm20 = _mm_unpacklo_epi32(xmm9,  xmm11);
-      xmm21 = _mm_unpackhi_epi32(xmm9,  xmm11);
-      xmm22 = _mm_unpacklo_epi32(xmm13, xmm15);
-      xmm23 = _mm_unpackhi_epi32(xmm13, xmm15);
-      
-      xmm24 = _mm_unpacklo_epi64(xmm16, xmm18);
-      xmm25 = _mm_unpackhi_epi64(xmm16, xmm18);
-      xmm26 = _mm_unpacklo_epi64(xmm17, xmm19);
-      xmm27 = _mm_unpackhi_epi64(xmm17, xmm19);
-      xmm28 = _mm_unpacklo_epi64(xmm20, xmm22);
-      xmm29 = _mm_unpackhi_epi64(xmm20, xmm22);
-      xmm30 = _mm_unpacklo_epi64(xmm21, xmm23);
-      xmm31 = _mm_unpackhi_epi64(xmm21, xmm23);
-      
-      _mm_store_si128((__m128i*)(dprofile_word + CDEPTH*CHANNELS*(i+0) + CHANNELS*j), xmm24);
-      _mm_store_si128((__m128i*)(dprofile_word + CDEPTH*CHANNELS*(i+1) + CHANNELS*j), xmm25);
-      _mm_store_si128((__m128i*)(dprofile_word + CDEPTH*CHANNELS*(i+2) + CHANNELS*j), xmm26);
-      _mm_store_si128((__m128i*)(dprofile_word + CDEPTH*CHANNELS*(i+3) + CHANNELS*j), xmm27);
-      _mm_store_si128((__m128i*)(dprofile_word + CDEPTH*CHANNELS*(i+4) + CHANNELS*j), xmm28);
-      _mm_store_si128((__m128i*)(dprofile_word + CDEPTH*CHANNELS*(i+5) + CHANNELS*j), xmm29);
-      _mm_store_si128((__m128i*)(dprofile_word + CDEPTH*CHANNELS*(i+6) + CHANNELS*j), xmm30);
-      _mm_store_si128((__m128i*)(dprofile_word + CDEPTH*CHANNELS*(i+7) + CHANNELS*j), xmm31);
+      unsigned int d[CHANNELS];
+      for(unsigned int z=0; z<CHANNELS; z++)
+        d[z] = (static_cast<unsigned int>(dseq[j*CHANNELS+z])) << 5;
+
+      for(int i=0; i<8; i += 8)
+        {
+          reg0  = v_load(score_matrix_word + d[0] + i);
+          reg1  = v_load(score_matrix_word + d[1] + i);
+          reg2  = v_load(score_matrix_word + d[2] + i);
+          reg3  = v_load(score_matrix_word + d[3] + i);
+          reg4  = v_load(score_matrix_word + d[4] + i);
+          reg5  = v_load(score_matrix_word + d[5] + i);
+          reg6  = v_load(score_matrix_word + d[6] + i);
+          reg7  = v_load(score_matrix_word + d[7] + i);
+
+          reg8  = v_merge_lo_16(reg0,  reg1);
+          reg9  = v_merge_hi_16(reg0,  reg1);
+          reg10 = v_merge_lo_16(reg2,  reg3);
+          reg11 = v_merge_hi_16(reg2,  reg3);
+          reg12 = v_merge_lo_16(reg4,  reg5);
+          reg13 = v_merge_hi_16(reg4,  reg5);
+          reg14 = v_merge_lo_16(reg6,  reg7);
+          reg15 = v_merge_hi_16(reg6,  reg7);
+
+          reg16 = v_merge_lo_32(reg8,  reg10);
+          reg17 = v_merge_hi_32(reg8,  reg10);
+          reg18 = v_merge_lo_32(reg12, reg14);
+          reg19 = v_merge_hi_32(reg12, reg14);
+          reg20 = v_merge_lo_32(reg9,  reg11);
+          reg21 = v_merge_hi_32(reg9,  reg11);
+          reg22 = v_merge_lo_32(reg13, reg15);
+          reg23 = v_merge_hi_32(reg13, reg15);
+
+          reg24 = v_merge_lo_64(reg16, reg18);
+          reg25 = v_merge_hi_64(reg16, reg18);
+          reg26 = v_merge_lo_64(reg17, reg19);
+          reg27 = v_merge_hi_64(reg17, reg19);
+          reg28 = v_merge_lo_64(reg20, reg22);
+          reg29 = v_merge_hi_64(reg20, reg22);
+          reg30 = v_merge_lo_64(reg21, reg23);
+          reg31 = v_merge_hi_64(reg21, reg23);
+
+          v_store(dprofile_word + CDEPTH*CHANNELS*(i+0) + CHANNELS*j, reg24);
+          v_store(dprofile_word + CDEPTH*CHANNELS*(i+1) + CHANNELS*j, reg25);
+          v_store(dprofile_word + CDEPTH*CHANNELS*(i+2) + CHANNELS*j, reg26);
+          v_store(dprofile_word + CDEPTH*CHANNELS*(i+3) + CHANNELS*j, reg27);
+          v_store(dprofile_word + CDEPTH*CHANNELS*(i+4) + CHANNELS*j, reg28);
+          v_store(dprofile_word + CDEPTH*CHANNELS*(i+5) + CHANNELS*j, reg29);
+          v_store(dprofile_word + CDEPTH*CHANNELS*(i+6) + CHANNELS*j, reg30);
+          v_store(dprofile_word + CDEPTH*CHANNELS*(i+7) + CHANNELS*j, reg31);
+        }
     }
-  }
 #if 0
   dprofile_dump16(dprofile_word);
 #endif
 }
 
-/* 
-   Sorry for the assembler code below. This code was originally written
-   several years ago when compilers were not that good at compiling
-   intrinsics to optimal code.
-   Similar code using intrinsics instead of assembler is available in
-   the vsearch codebase.
-*/
+inline void onestep_16(VECTORTYPE & H,
+                       VECTORTYPE & N,
+                       VECTORTYPE & F,
+                       VECTORTYPE V,
+                       unsigned short * DIR,
+                       VECTORTYPE & E,
+                       VECTORTYPE QR,
+                       VECTORTYPE R)
+{
+  H = v_add(H, V);
+  *(DIR+0) = v_mask_gt(H, F);
+  H = v_min(H, F);
+  H = v_min(H, E);
+  *(DIR+1) = v_mask_eq(H, E);
+  N = H;
+  H = v_add(H, QR);
+  F = v_add(F, R);
+  E = v_add(E, R);
+  *(DIR+2) = v_mask_gt(H, F);
+  *(DIR+3) = v_mask_gt(H, E);
+  F = v_min(H, F);
+  E = v_min(H, E);
+}
 
-// Due to the use of pminsw instead of pminuw (which is sse4) below,
-// the code works only with 15-bit values
-
-#define INITIALIZE                                      \
-  "        movq      %3, %%rax               \n"        \
-  "        movdqa    (%%rax), %%xmm14        \n"        \
-  "        movq      %4, %%rax               \n"        \
-  "        movdqa    (%%rax), %%xmm15        \n"        \
-  "        movq      %9, %%rax               \n"        \
-  "        movdqa    (%%rax), %%xmm0         \n"        \
-  "        movdqa    (%7), %%xmm7            \n"        \
-  "        movdqa    %%xmm7, %%xmm3          \n"        \
-  "        psubusw   %%xmm14, %%xmm3         \n"        \
-  "        movdqa    %%xmm3, %%xmm1          \n"        \
-  "        paddusw   %%xmm15, %%xmm3         \n"        \
-  "        movdqa    %%xmm3, %%xmm2          \n"        \
-  "        paddusw   %%xmm15, %%xmm3         \n"        \
-  "        movdqa    %%xmm7, %%xmm4          \n"        \
-  "        paddusw   %%xmm15, %%xmm7         \n"        \
-  "        movdqa    %%xmm7, %%xmm5          \n"        \
-  "        paddusw   %%xmm15, %%xmm7         \n"        \
-  "        movdqa    %%xmm7, %%xmm6          \n"        \
-  "        paddusw   %%xmm15, %%xmm7         \n"        \
-  "        movq      %5, %%r12               \n"        \
-  "        shlq      $3, %%r12               \n"        \
-  "        movq      %%r12, %%r10            \n"        \
-  "        andq      $-16, %%r10             \n"        \
-  "        xorq      %%r11, %%r11            \n" 
-
-#define ONESTEP(H, N, F, V, DIR)                        \
-  "        paddusw   " V ", " H "            \n"        \
-  "        movdqa    " H ", %%xmm13          \n"        \
-  "        pcmpgtw   " F ", %%xmm13          \n"        \
-  "        pmovmskb  %%xmm13, %%edx          \n"        \
-  "        movw      %%dx, 0+" DIR "         \n"        \
-  "        pminsw    " F ", " H "            \n"        \
-  "        pminsw    %%xmm12, " H "          \n"        \
-  "        movdqa    " H ", %%xmm13          \n"        \
-  "        pcmpeqw   %%xmm12, %%xmm13        \n"        \
-  "        pmovmskb  %%xmm13, %%edx          \n"        \
-  "        movw      %%dx, 2+" DIR "         \n"        \
-  "        movdqa    " H ", " N "            \n"        \
-  "        paddusw   %%xmm14, " H "          \n"        \
-  "        paddusw   %%xmm15, " F "          \n"        \
-  "        paddusw   %%xmm15, %%xmm12        \n"        \
-  "        movdqa    " H ", %%xmm13          \n"        \
-  "        pcmpgtw   " F ", %%xmm13          \n"        \
-  "        pmovmskb  %%xmm13, %%edx          \n"        \
-  "        movw      %%dx, 4+" DIR "         \n"        \
-  "        movdqa    " H ", %%xmm13          \n"        \
-  "        pcmpgtw   %%xmm12, %%xmm13        \n"        \
-  "        pmovmskb  %%xmm13, %%edx          \n"        \
-  "        movw      %%dx, 6+" DIR "         \n"        \
-  "        pminsw    " H ", %%xmm12          \n"        \
-  "        pminsw    " H ", " F "            \n"
-
-
-inline void donormal16(__m128i * Sm,
-                       __m128i * hep,
-                       __m128i ** qp,
-                       __m128i * Qm,
-                       __m128i * Rm,
-                       long ql,
-                       __m128i * Zm,
-                       __m128i * F0,
-                       unsigned long * dir,
-                       __m128i * H0
-                      )
+void align_cells_regular_16(VECTORTYPE * Sm,
+                            VECTORTYPE * hep,
+                            VECTORTYPE ** qp,
+                            VECTORTYPE * Qm,
+                            VECTORTYPE * Rm,
+                            uint64_t ql,
+                            VECTORTYPE * F0,
+                            uint64_t * dir_long,
+                            VECTORTYPE * H0)
 {
-  __asm__
-    __volatile__
-    ( 
-     INITIALIZE
-     
-     "        jmp       2f                  \n"
-     
-     "1:      movq      0(%2,%%r11,1), %%rax    \n" // load x from qp[qi]
-     "        movdqa    0(%1,%%r11,4), %%xmm8   \n" // load N0
-     "        movdqa    16(%1,%%r11,4), %%xmm12 \n" // load E
-     
-     ONESTEP("%%xmm0", "%%xmm9",        "%%xmm4", " 0(%%rax)", " 0(%8,%%r11,4)")
-     ONESTEP("%%xmm1", "%%xmm10",       "%%xmm5", "16(%%rax)", " 8(%8,%%r11,4)")
-     ONESTEP("%%xmm2", "%%xmm11",       "%%xmm6", "32(%%rax)", "16(%8,%%r11,4)")
-     ONESTEP("%%xmm3", "0(%1,%%r11,4)", "%%xmm7", "48(%%rax)", "24(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 16(%1,%%r11,4) \n" // save E
-     "        movq      8(%2,%%r11,1), %%rax    \n" // load x from qp[qi+1]
-     "        movdqa    32(%1,%%r11,4), %%xmm0  \n" // load H0
-     "        movdqa    48(%1,%%r11,4), %%xmm12 \n" // load E
-     
-     ONESTEP("%%xmm8",  "%%xmm1",           "%%xmm4", "0(%%rax)" , "32(%8,%%r11,4)")
-     ONESTEP("%%xmm9",  "%%xmm2",           "%%xmm5", "16(%%rax)", "40(%8,%%r11,4)")
-     ONESTEP("%%xmm10", "%%xmm3",           "%%xmm6", "32(%%rax)", "48(%8,%%r11,4)")
-     ONESTEP("%%xmm11", "32(%1,%%r11,4)",   "%%xmm7", "48(%%rax)", "56(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 48(%1,%%r11,4) \n" // save E
-     "        addq      $16, %%r11              \n" // qi++
-     "2:      cmpq      %%r11, %%r10            \n" // qi = ql4 ?
-     "        jne       1b                      \n" // loop
-     
-     "4:      cmpq      %%r11, %%r12            \n" 
-     "        je        3f                      \n"
-     "        movq      0(%2,%%r11,1), %%rax    \n" // load x from qp[qi]
-     "        movdqa    16(%1,%%r11,4), %%xmm12 \n" // load E
-     
-     ONESTEP("%%xmm0",  "%%xmm9",          "%%xmm4", "0(%%rax)" , " 0(%8,%%r11,4)")
-     ONESTEP("%%xmm1",  "%%xmm10",         "%%xmm5", "16(%%rax)", " 8(%8,%%r11,4)")
-     ONESTEP("%%xmm2",  "%%xmm11",         "%%xmm6", "32(%%rax)", "16(%8,%%r11,4)")
-     ONESTEP("%%xmm3",  "0(%1,%%r11,4)",   "%%xmm7", "48(%%rax)", "24(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 16(%1,%%r11,4) \n" // save E
-     
-     "        movdqa    %%xmm9, %%xmm1          \n"
-     "        movdqa    %%xmm10, %%xmm2         \n"
-     "        movdqa    %%xmm11, %%xmm3         \n"
-     "        movdqa    0(%1,%%r11,4), %%xmm4   \n"
-     "        jmp       5f                      \n"
-     
-     "3:      movdqa    -32(%1,%%r11,4), %%xmm4 \n"
-     
-     "5:      movq      %0, %%rax               \n" // save final Hs
-     "        movdqa    %%xmm1, (%%rax)         \n"
-     "        addq      $16, %%rax              \n"
-     "        movdqa    %%xmm2, (%%rax)         \n"
-     "        addq      $16, %%rax              \n"
-     "        movdqa    %%xmm3, (%%rax)         \n"
-     "        addq      $16, %%rax              \n"
-     "        movdqa    %%xmm4, (%%rax)         \n"
-     
-     : 
-     : "m"(Sm), "r"(hep),  "r"(qp), "m"(Qm), 
-       "m"(Rm), "r"(ql),   "m"(Zm), "r"(F0),
-       "r"(dir),"m"(H0)
-       
-     : "xmm0",  "xmm1",  "xmm2",  "xmm3",
-       "xmm4",  "xmm5",  "xmm6",  "xmm7",
-       "xmm8",  "xmm9",  "xmm10", "xmm11", 
-       "xmm12", "xmm13", "xmm14", "xmm15",
-       "rax",   "r10",   "r11",   "r12",
-       "rdx",   "cc"
-      );
+  VECTORTYPE Q, R, E;
+  VECTORTYPE h0, h1, h2, h3, h4, h5, h6, h7, h8;
+  VECTORTYPE f0, f1, f2, f3;
+
+  unsigned short * dir = reinterpret_cast<unsigned short *>(dir_long);
+
+  Q = *Qm;
+  R = *Rm;
+
+  f0 = *F0;
+  f1 = v_add(f0, R);
+  f2 = v_add(f1, R);
+  f3 = v_add(f2, R);
+
+  h0 = *H0;
+  h1 = v_sub(f0, Q);
+  h2 = v_add(h1, R);
+  h3 = v_add(h2, R);
+  h4 = v_zero;
+  h5 = v_zero;
+  h6 = v_zero;
+  h7 = v_zero;
+  h8 = v_zero;
+
+  for(uint64_t i = 0; i < ql; i++)
+    {
+      VECTORTYPE * x;
+
+      x = qp[i + 0];
+      h4 = hep[2*i + 0];
+      E  = hep[2*i + 1];
+      onestep_16(h0, h5, f0, x[0], dir + 16*i +  0, E, Q, R);
+      onestep_16(h1, h6, f1, x[1], dir + 16*i +  4, E, Q, R);
+      onestep_16(h2, h7, f2, x[2], dir + 16*i +  8, E, Q, R);
+      onestep_16(h3, h8, f3, x[3], dir + 16*i + 12, E, Q, R);
+      hep[2*i + 0] = h8;
+      hep[2*i + 1] = E;
+      h0 = h4;
+      h1 = h5;
+      h2 = h6;
+      h3 = h7;
+    }
+
+  Sm[0] = h5;
+  Sm[1] = h6;
+  Sm[2] = h7;
+  Sm[3] = h8;
 }
 
-inline void domasked16(__m128i * Sm,
-                       __m128i * hep,
-                       __m128i ** qp,
-                       __m128i * Qm, 
-                       __m128i * Rm, 
-                       long ql,      
-                       __m128i * Zm,
-                       __m128i * F0,
-                       unsigned long * dir,
-                       __m128i * H0,
-                       __m128i * Mm,
-                       __m128i * MQ,
-                       __m128i * MR,
-                       __m128i * MQ0)
+void align_cells_masked_16(VECTORTYPE * Sm,
+                           VECTORTYPE * hep,
+                           VECTORTYPE ** qp,
+                           VECTORTYPE * Qm,
+                           VECTORTYPE * Rm,
+                           uint64_t ql,
+                           VECTORTYPE * F0,
+                           uint64_t * dir_long,
+                           VECTORTYPE * H0,
+                           VECTORTYPE * Mm,
+                           VECTORTYPE * MQ,
+                           VECTORTYPE * MR,
+                           VECTORTYPE * MQ0)
 {
-  
-  __asm__
-    __volatile__
-    (
-     INITIALIZE
-
-     "        jmp       2f                       \n"
-     
-     "1:      movq      0(%2,%%r11,1), %%rax     \n" // load x from qp[qi]
-     "        movdqa    0(%1,%%r11,4), %%xmm8    \n" // load N0
-     "        movdqa    16(%1,%%r11,4), %%xmm12  \n" // load E
-     "        movdqa    (%11), %%xmm13           \n" 
-     "        psubusw   (%10), %%xmm8            \n" // mask N0
-     "        psubusw   (%10), %%xmm12           \n" // mask E
-     "        paddusw   %%xmm13, %%xmm8          \n" // init N0
-     "        paddusw   %%xmm13, %%xmm12         \n" // init E
-     "        paddusw   (%13), %%xmm12           \n" // fix E
-     "        paddusw   (%12), %%xmm13           \n" // update
-     "        movdqa    %%xmm13, (%11)           \n"
-     
-     ONESTEP("%%xmm0",  "%%xmm9",          "%%xmm4", "0(%%rax)" , " 0(%8,%%r11,4)")
-     ONESTEP("%%xmm1",  "%%xmm10",         "%%xmm5", "16(%%rax)", " 8(%8,%%r11,4)")
-     ONESTEP("%%xmm2",  "%%xmm11",         "%%xmm6", "32(%%rax)", "16(%8,%%r11,4)")
-     ONESTEP("%%xmm3",  "0(%1,%%r11,4)",   "%%xmm7", "48(%%rax)", "24(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 16(%1,%%r11,4)  \n" // save E
-
-     "        movq      8(%2,%%r11,1), %%rax     \n" // load x from qp[qi+1]
-     "        movdqa    32(%1,%%r11,4), %%xmm0   \n" // load H0
-     "        movdqa    48(%1,%%r11,4), %%xmm12  \n" // load E
-     "        movdqa    (%11), %%xmm13           \n"
-     "        psubusw   (%10), %%xmm0            \n" // mask H0
-     "        psubusw   (%10), %%xmm12           \n" // mask E
-     "        paddusw   %%xmm13, %%xmm0          \n"
-     "        paddusw   %%xmm13, %%xmm12         \n"
-     "        paddusw   (%13), %%xmm12           \n" // fix E
-     "        paddusw   (%12), %%xmm13           \n"
-     "        movdqa    %%xmm13, (%11)           \n"
-     
-     ONESTEP("%%xmm8",  "%%xmm1",           "%%xmm4", "0(%%rax)" , "32(%8,%%r11,4)")
-     ONESTEP("%%xmm9",  "%%xmm2",           "%%xmm5", "16(%%rax)", "40(%8,%%r11,4)")
-     ONESTEP("%%xmm10", "%%xmm3",           "%%xmm6", "32(%%rax)", "48(%8,%%r11,4)")
-     ONESTEP("%%xmm11", "32(%1,%%r11,4)",   "%%xmm7", "48(%%rax)", "56(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 48(%1,%%r11,4)  \n" // save E
-     "        addq      $16, %%r11               \n" // qi++
-     "2:      cmpq      %%r11, %%r10             \n" // qi = ql4 ?
-     "        jne       1b                       \n" // loop
-     
-     "        cmpq      %%r11, %%r12             \n" 
-     "        je        3f                       \n"
-     "        movq      0(%2,%%r11,1), %%rax     \n" // load x from qp[qi]
-     "        movdqa    16(%1,%%r11,4), %%xmm12  \n" // load E
-     "        movdqa    (%11), %%xmm13           \n"
-     "        psubusw   (%10), %%xmm12           \n" // mask E
-     "        paddusw   %%xmm13, %%xmm12         \n"
-     "        paddusw   (%13), %%xmm12           \n" // fix E
-     "        paddusw   (%12), %%xmm13           \n"
-     "        movdqa    %%xmm13, (%11)           \n"
-     
-     ONESTEP("%%xmm0",  "%%xmm9",          "%%xmm4", "0(%%rax)" , " 0(%8,%%r11,4)")
-     ONESTEP("%%xmm1",  "%%xmm10",         "%%xmm5", "16(%%rax)", " 8(%8,%%r11,4)")
-     ONESTEP("%%xmm2",  "%%xmm11",         "%%xmm6", "32(%%rax)", "16(%8,%%r11,4)")
-     ONESTEP("%%xmm3",  "0(%1,%%r11,4)",   "%%xmm7", "48(%%rax)", "24(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 16(%1,%%r11,4)  \n" // save E
-     
-     "        movdqa    %%xmm9, %%xmm1           \n"
-     "        movdqa    %%xmm10, %%xmm2          \n"
-     "        movdqa    %%xmm11, %%xmm3          \n"
-     "        movdqa    0(%1,%%r11,4), %%xmm4    \n"
-     "        jmp       5f                       \n"
-     
-     "3:      movdqa    -32(%1,%%r11,4), %%xmm4  \n"
-     
-     "5:      movq      %0, %%rax                \n" // save final Hs
-     "        movdqa    %%xmm1, (%%rax)          \n"
-     "        addq      $16, %%rax               \n"
-     "        movdqa    %%xmm2, (%%rax)          \n"
-     "        addq      $16, %%rax               \n"
-     "        movdqa    %%xmm3, (%%rax)          \n"
-     "        addq      $16, %%rax               \n"
-     "        movdqa    %%xmm4, (%%rax)          \n"
-     
-     : 
-     
-     : "m"(Sm), "r"(hep),"r"(qp), "m"(Qm), 
-       "m"(Rm), "r"(ql), "m"(Zm), "r"(F0),
-       "r"(dir),
-       "m"(H0), "r"(Mm), "r"(MQ), "r"(MR),
-       "r"(MQ0)
-       
-     : "xmm0",  "xmm1",  "xmm2",  "xmm3",
-       "xmm4",  "xmm5",  "xmm6",  "xmm7",
-       "xmm8",  "xmm9",  "xmm10", "xmm11", 
-       "xmm12", "xmm13", "xmm14", "xmm15",
-       "rax",   "r10",   "r11",   "r12",
-       "rdx",   "cc"
-     );
+  VECTORTYPE Q, R, E;
+  VECTORTYPE h0, h1, h2, h3, h4, h5, h6, h7, h8;
+  VECTORTYPE f0, f1, f2, f3;
+
+  unsigned short * dir = reinterpret_cast<unsigned short *>(dir_long);
+
+  Q = *Qm;
+  R = *Rm;
+
+  f0 = *F0;
+  f1 = v_add(f0, R);
+  f2 = v_add(f1, R);
+  f3 = v_add(f2, R);
+
+  h0 = *H0;
+  h1 = v_sub(f0, Q);
+  h2 = v_add(h1, R);
+  h3 = v_add(h2, R);
+  h4 = v_zero;
+  h5 = v_zero;
+  h6 = v_zero;
+  h7 = v_zero;
+  h8 = v_zero;
+
+  for(uint64_t i = 0; i < ql; i++)
+    {
+      VECTORTYPE * x;
+
+      h4 = hep[2*i + 0];
+      E  = hep[2*i + 1];
+      x = qp[i + 0];
+
+      /* mask h4 and E */
+      h4 = v_sub(h4, *Mm);
+      E  = v_sub(E,  *Mm);
+
+      /* init h4 and E */
+      h4 = v_add(h4, *MQ);
+      E  = v_add(E,  *MQ);
+      E  = v_add(E,  *MQ0);
+
+      /* update MQ */
+      *MQ = v_add(*MQ,  *MR);
+
+      onestep_16(h0, h5, f0, x[0], dir + 16*i +  0, E, Q, R);
+      onestep_16(h1, h6, f1, x[1], dir + 16*i +  4, E, Q, R);
+      onestep_16(h2, h7, f2, x[2], dir + 16*i +  8, E, Q, R);
+      onestep_16(h3, h8, f3, x[3], dir + 16*i + 12, E, Q, R);
+      hep[2*i + 0] = h8;
+      hep[2*i + 1] = E;
+
+      h0 = h4;
+      h1 = h5;
+      h2 = h6;
+      h3 = h7;
+    }
+
+  Sm[0] = h5;
+  Sm[1] = h6;
+  Sm[2] = h7;
+  Sm[3] = h8;
 }
 
-unsigned long backtrack16(char * qseq,
-                          char * dseq,
-                          unsigned long qlen,
-                          unsigned long dlen,
-                          unsigned long * dirbuffer,
-                          unsigned long offset,
-                          unsigned long dirbuffersize,
-                          unsigned long channel,
-                          unsigned long * alignmentlengthp)
+uint64_t backtrack_16(char * qseq,
+                      char * dseq,
+                      uint64_t qlen,
+                      uint64_t dlen,
+                      uint64_t * dirbuffer,
+                      uint64_t offset,
+                      uint64_t dirbuffersize,
+                      uint64_t channel,
+                      uint64_t * alignmentlengthp)
 {
-  unsigned long maskup      = 3UL << (2*channel+ 0);
-  unsigned long maskleft    = 3UL << (2*channel+16);
-  unsigned long maskextup   = 3UL << (2*channel+32);
-  unsigned long maskextleft = 3UL << (2*channel+48);
+  uint64_t maskup      = 3ULL << (2*channel+ 0);
+  uint64_t maskleft    = 3ULL << (2*channel+16);
+  uint64_t maskextup   = 3ULL << (2*channel+32);
+  uint64_t maskextleft = 3ULL << (2*channel+48);
 
 #if 0
 
   printf("Dumping backtracking array\n");
 
-  for(unsigned long i=0; i<qlen; i++)
-  {
-    for(unsigned long j=0; j<dlen; j++)
+  for(uint64_t i=0; i<qlen; i++)
     {
-      unsigned long d = dirbuffer[(offset + longestdbsequence*4*(j/4)
-                                   + 4*i + (j&3)) % dirbuffersize];
-      if (d & maskleft)
-      {
-        printf("<");
-      }
-      else if (d & maskup)
-      {
-        printf("^");
-      }
-      else
-      {
-        printf("\\");
-      }
+      for(uint64_t j=0; j<dlen; j++)
+        {
+          uint64_t d = dirbuffer[(offset + longestdbsequence * 4 * (j / 4)
+                                  + 4 * i + (j & 3)) % dirbuffersize];
+          if (d & maskleft)
+            {
+              printf("<");
+            }
+          else if (d & maskup)
+            {
+              printf("^");
+            }
+          else
+            {
+              printf("\\");
+            }
+        }
+      printf("\n");
     }
-    printf("\n");
-  }
 
   printf("Dumping gap extension array\n");
 
-  for(unsigned long i=0; i<qlen; i++)
-  {
-    for(unsigned long j=0; j<dlen; j++)
+  for(uint64_t i=0; i<qlen; i++)
     {
-      unsigned long d = dirbuffer[(offset + longestdbsequence*4*(j/4)
-                                   + 4*i + (j&3)) % dirbuffersize];
-      if (d & maskextup)
-      {
-        if (d & maskextleft)
-          printf("+");
-        else
-          printf("^");
-      }
-      else if (d & maskextleft)
-      {
-        printf("<");
-      }
-      else
-      {
-        printf("\\");
-      }
+      for(uint64_t j=0; j<dlen; j++)
+        {
+          uint64_t d = dirbuffer[(offset + longestdbsequence * 4 *(j / 4)
+                                  + 4 * i + (j & 3)) % dirbuffersize];
+          if (d & maskextup)
+            {
+              if (d & maskextleft)
+                printf("+");
+              else
+                printf("^");
+            }
+          else if (d & maskextleft)
+            {
+              printf("<");
+            }
+          else
+            {
+              printf("\\");
+            }
+        }
+      printf("\n");
     }
-    printf("\n");
-  }
 
 #endif
 
-  long i = qlen - 1;
-  long j = dlen - 1;
-  unsigned long aligned = 0;
-  unsigned long matches = 0;
+  int64_t i = static_cast<int64_t>(qlen) - 1;
+  int64_t j = static_cast<int64_t>(dlen) - 1;
+  uint64_t aligned = 0;
+  uint64_t matches = 0;
   char op = 0;
 
 #undef SHOWALIGNMENT
@@ -473,46 +500,50 @@ unsigned long backtrack16(char * qseq,
   printf("alignment, reversed: ");
 #endif
 
-  while ((i>=0) && (j>=0))
-  {
-    aligned++;
+  while ((i >= 0) && (j >= 0))
+    {
+      aligned++;
 
-    unsigned long d = 
-      dirbuffer[(offset + longestdbsequence*4*(j/4) + 4*i + (j&3)) % dirbuffersize];
+      uint64_t d
+        = dirbuffer[(offset
+                     + longestdbsequence * 4 * static_cast<uint64_t>(j / 4)
+                     + static_cast<uint64_t>(4 * i + (j & 3)))
+                    % dirbuffersize];
 
-    if ((op == 'I') && (d & maskextleft))
-    {
-      j--;
-    }
-    else if ((op == 'D') && (d & maskextup))
-    {
-      i--;
-    }
-    else if (d & maskleft)
-    {
-      j--;
-      op = 'I';
-    }
-    else if (d & maskup)
-    {
-      i--;
-      op = 'D';
-    }
-    else
-    {
-      if (qseq[i] == dseq[j])
-        matches++;
-      i--;
-      j--;
-      op = 'M';
-    }
+      if ((op == 'I') && (d & maskextleft))
+        {
+          j--;
+        }
+      else if ((op == 'D') && (d & maskextup))
+        {
+          i--;
+        }
+      else if (d & maskleft)
+        {
+          j--;
+          op = 'I';
+        }
+      else if (d & maskup)
+        {
+          i--;
+          op = 'D';
+        }
+      else
+        {
+          if (nt_extract(qseq, static_cast<uint64_t>(i)) ==
+              nt_extract(dseq, static_cast<uint64_t>(j)))
+            matches++;
+          i--;
+          j--;
+          op = 'M';
+        }
 
 #ifdef SHOWALIGNMENT
-    printf("%c", op);
+      printf("%c", op);
 #endif
-  }
+    }
 
-  while (i>=0)
+  while (i >= 0)
     {
       aligned++;
       i--;
@@ -521,7 +552,7 @@ unsigned long backtrack16(char * qseq,
 #endif
     }
 
-  while (j>=0)
+  while (j >= 0)
     {
       aligned++;
       j--;
@@ -544,233 +575,239 @@ void search16(WORD * * q_start,
               WORD * score_matrix,
               WORD * dprofile,
               WORD * hearray,
-              unsigned long sequences,
-              unsigned long * seqnos,
-              unsigned long * scores,
-              unsigned long * diffs,
-              unsigned long * alignmentlengths,
-              unsigned long qlen,
-              unsigned long dirbuffersize,
-              unsigned long * dirbuffer)
+              uint64_t sequences,
+              uint64_t * seqnos,
+              uint64_t * scores,
+              uint64_t * diffs,
+              uint64_t * alignmentlengths,
+              uint64_t qlen,
+              uint64_t dirbuffersize,
+              uint64_t * dirbuffer)
 {
-  __m128i Q, R, T, M, T0, MQ, MR, MQ0;
-  __m128i *hep, **qp;
-
-  BYTE * d_begin[CHANNELS];
-  BYTE * d_end[CHANNELS];
-  unsigned long d_offset[CHANNELS];
-  BYTE * d_address[CHANNELS];
-  unsigned long d_length[CHANNELS];
-  
-  __m128i dseqalloc[CDEPTH];
-  
-  __m128i H0;
-  __m128i F0;
-  __m128i S[4];
-
-  BYTE * dseq = (BYTE*) & dseqalloc;
-  BYTE zero;
-
-  long seq_id[CHANNELS];
-  unsigned long next_id = 0;
-  unsigned long done;
-  
-  T0 = _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, -1);
-  Q  = _mm_set1_epi16(gap_open_penalty+gap_extend_penalty);
-  R  = _mm_set1_epi16(gap_extend_penalty);
-
-  zero = 0;
+  VECTORTYPE Q, R, T, M, T0, MQ, MR, MQ0;
+  VECTORTYPE *hep, **qp;
+
+  uint64_t d_pos[CHANNELS];
+  uint64_t d_offset[CHANNELS];
+  char * d_address[CHANNELS];
+  uint64_t d_length[CHANNELS];
+
+  VECTORTYPE dseqalloc[CDEPTH];
+
+  VECTORTYPE H0;
+  VECTORTYPE F0;
+  VECTORTYPE S[4];
+
+  BYTE * dseq = reinterpret_cast<BYTE*>(& dseqalloc);
+
+  int64_t seq_id[CHANNELS];
+  uint64_t next_id = 0;
+  uint64_t done;
+
+#ifdef __aarch64__
+  const VECTORTYPE T0_init = { -1, 0, 0, 0, 0, 0, 0, 0 };
+#elif defined __x86_64__
+  const VECTORTYPE T0_init = _mm_set_epi16(0, 0, 0, 0, 0, 0, 0, -1);
+#elif defined __PPC__
+  const VECTORTYPE T0_init = { (unsigned short)(-1), 0, 0, 0, 0, 0, 0, 0 };
+#endif
+
+  T0 = T0_init;
+
+  Q  = v_dup(static_cast<short>(gap_open_penalty + gap_extend_penalty));
+  R  = v_dup(static_cast<short>(gap_extend_penalty));
+
   done = 0;
 
-  hep = (__m128i*) hearray;
-  qp = (__m128i**) q_start;
+  hep = CAST_VECTOR_p(hearray);
+  qp = reinterpret_cast<VECTORTYPE**>(q_start);
 
   for (int c=0; c<CHANNELS; c++)
-  {
-    d_begin[c] = &zero;
-    d_end[c] = d_begin[c];
-    seq_id[c] = -1;
-  }
-  
-  F0 = _mm_setzero_si128();
-  H0 = _mm_setzero_si128();
-  
+    {
+      d_address[c] = nullptr;
+      d_pos[c] = 0;
+      d_length[c] = 0;
+      seq_id[c] = -1;
+    }
+
+  F0 = v_zero;
+  H0 = v_zero;
+
   int easy = 0;
 
-  unsigned long * dir = dirbuffer;
+  uint64_t * dir = dirbuffer;
 
   while(1)
-  {
-
-    if (easy)
     {
-      // fill all channels
 
-      for(int c=0; c<CHANNELS; c++)
-      {
-        for(int j=0; j<CDEPTH; j++)
+      if (easy)
         {
-          if (d_begin[c] < d_end[c])
-            dseq[CHANNELS*j+c] = *(d_begin[c]++);
-          else
-            dseq[CHANNELS*j+c] = 0;
-        }
-        if (d_begin[c] == d_end[c])
-          easy = 0;
-      }
+          // fill all channels
 
-      if (ssse3_present)
-        dprofile_shuffle16(dprofile, score_matrix, dseq);
-      else
-        dprofile_fill16(dprofile, score_matrix, dseq);
-      
-      donormal16(S, hep, qp, &Q, &R, qlen, 0, &F0, dir, &H0);
-    }
-    else
-    {
-      // One or more sequences ended in the previous block 
-      // We have to switch over to a new sequence
+          for(int c=0; c<CHANNELS; c++)
+            {
+              for(int j=0; j<CDEPTH; j++)
+                {
+                  if (d_pos[c] < d_length[c])
+                    dseq[CHANNELS*j+c]
+                      = 1 + nt_extract(d_address[c], d_pos[c]++);
+                  else
+                    dseq[CHANNELS*j+c] = 0;
+                }
+              if (d_pos[c] == d_length[c])
+                easy = 0;
+            }
 
-      easy = 1;
+#ifdef __x86_64__
+          if (ssse3_present)
+            dprofile_shuffle16(dprofile, score_matrix, dseq);
+          else
+#endif
+            dprofile_fill16(dprofile, score_matrix, dseq);
 
-      M = _mm_setzero_si128();
-      T = T0;
-      for (int c=0; c<CHANNELS; c++)
-      {
-        if (d_begin[c] < d_end[c])
-        {
-          // this channel has more sequence
-
-          for(int j=0; j<CDEPTH; j++)
-          {
-            if (d_begin[c] < d_end[c])
-              dseq[CHANNELS*j+c] = *(d_begin[c]++);
-            else
-              dseq[CHANNELS*j+c] = 0;
-          }
-          if (d_begin[c] == d_end[c])
-            easy = 0;
+          align_cells_regular_16(S, hep, qp, &Q, &R, qlen, &F0, dir, &H0);
         }
-        else
+      else
         {
-          // sequence in channel c ended
-          // change of sequence
-
-          M = _mm_xor_si128(M, T);
-
-          long cand_id = seq_id[c];
-          
-          if (cand_id >= 0)
-          {
-            // printf("Completed channel %d, sequence %ld\n", c, cand_id);
-            // save score
-
-            char * dbseq = (char*) d_address[c];
-            long dbseqlen = d_length[c];
-            long z = (dbseqlen+3) % 4;
-            long score = ((WORD*)S)[z*CHANNELS+c];
-            scores[cand_id] = score;
-            
-            unsigned long diff;
-
-            if (score < 65535)
-              {
-                long offset = d_offset[c];
-                diff = backtrack16(query.seq, dbseq, qlen, dbseqlen,
-                                   dirbuffer,
-                                   offset,
-                                   dirbuffersize, c,
-                                   alignmentlengths + cand_id);
-              }
-            else
-              {
-                diff = MIN((65535 / penalty_mismatch),
-                           (65535 - penalty_gapopen) / penalty_gapextend);
-              }
-            
-            diffs[cand_id] = diff;
-
-            done++;
-          }
-
-          if (next_id < sequences)
-          {
-            // get next sequence
-            seq_id[c] = next_id;
-            long seqno = seqnos[next_id];
-            char* address;
-            long length;
-
-            db_getsequenceandlength(seqno, & address, & length);
-            
-            // printf("Seqno: %ld Address: %p\n", seqno, address);
-            d_address[c] = (BYTE*) address;
-            d_length[c] = length;
-
-            d_begin[c] = (unsigned char*) address;
-            d_end[c] = (unsigned char*) address + length;
-            d_offset[c] = dir - dirbuffer;
-            next_id++;
-            
-            ((WORD*)&H0)[c] = 0;
-            ((WORD*)&F0)[c] = 2 * gap_open_penalty + 2 * gap_extend_penalty;
-            
-            
-            // fill channel
-            for(int j=0; j<CDEPTH; j++)
+          // One or more sequences ended in the previous block
+          // We have to switch over to a new sequence
+
+          easy = 1;
+
+          M = v_zero;
+          T = T0;
+          for (unsigned int c = 0; c < CHANNELS; c++)
             {
-              if (d_begin[c] < d_end[c])
-                dseq[CHANNELS*j+c] = *(d_begin[c]++);
+              if (d_pos[c] < d_length[c])
+                {
+                  // this channel has more sequence
+
+                  for(unsigned int j = 0; j < CDEPTH; j++)
+                    {
+                      if (d_pos[c] < d_length[c])
+                        dseq[CHANNELS * j + c]
+                          = 1 + nt_extract(d_address[c], d_pos[c]++);
+                      else
+                        dseq[CHANNELS*j+c] = 0;
+                    }
+                  if (d_pos[c] == d_length[c])
+                    easy = 0;
+                }
               else
-                dseq[CHANNELS*j+c] = 0;
+                {
+                  // sequence in channel c ended
+                  // change of sequence
+
+                  M = v_xor(M, T);
+
+                  int64_t cand_id = seq_id[c];
+
+                  if (cand_id >= 0)
+                    {
+                      // save score
+
+                      char * dbseq = reinterpret_cast<char*>(d_address[c]);
+                      uint64_t dbseqlen = d_length[c];
+                      uint64_t z = (dbseqlen+3) % 4;
+                      uint64_t score
+                        = (reinterpret_cast<WORD*>(S))[z * CHANNELS + c];
+                      scores[cand_id] = score;
+
+                      uint64_t diff;
+
+                      if (score < 65535)
+                        {
+                          uint64_t offset = d_offset[c];
+                          diff = backtrack_16(query.seq, dbseq, qlen, dbseqlen,
+                                              dirbuffer,
+                                              offset,
+                                              dirbuffersize, c,
+                                              alignmentlengths + cand_id);
+                        }
+                      else
+                        {
+                          diff = static_cast<uint64_t>
+                            (MIN((65535 / penalty_mismatch),
+                                 (65535 - penalty_gapopen)
+                                 / penalty_gapextend));
+                        }
+
+                      diffs[cand_id] = diff;
+
+                      done++;
+                    }
+
+                  if (next_id < sequences)
+                    {
+                      // get next sequence
+                      seq_id[c] = static_cast<int64_t>(next_id);
+                      uint64_t seqno = seqnos[next_id];
+                      char* address;
+                      unsigned int length;
+
+                      db_getsequenceandlength(seqno, & address, & length);
+
+                      d_address[c] = address;
+                      d_length[c] = length;
+
+                      d_pos[c] = 0;
+                      d_offset[c] = static_cast<uint64_t>(dir - dirbuffer);
+                      next_id++;
+
+                      (reinterpret_cast<WORD*>(&H0))[c] = 0;
+                      (reinterpret_cast<WORD*>(&F0))[c] = 2 * gap_open_penalty + 2 * gap_extend_penalty;
+
+                      // fill channel
+                      for(unsigned int j = 0; j < CDEPTH; j++)
+                        {
+                          if (d_pos[c] < d_length[c])
+                            dseq[CHANNELS*j+c] = 1 + nt_extract(d_address[c], d_pos[c]++);
+                          else
+                            dseq[CHANNELS*j+c] = 0;
+                        }
+                      if (d_pos[c] == d_length[c])
+                        easy = 0;
+                    }
+                  else
+                    {
+                      // no more sequences, empty channel
+                      seq_id[c] = -1;
+                      d_address[c] = nullptr;
+                      d_pos[c] = 0;
+                      d_length[c] = 0;
+                      for (unsigned int j=0; j<CDEPTH; j++)
+                        dseq[CHANNELS*j+c] = 0;
+                    }
+                }
+
+              T = v_shift_left(T);
             }
-            if (d_begin[c] == d_end[c])
-              easy = 0;
-          }
-          else
-          {
-            // no more sequences, empty channel
-            seq_id[c] = -1;
-            d_begin[c] = &zero;
-            d_end[c] = d_begin[c];
-            for (int j=0; j<CDEPTH; j++)
-              dseq[CHANNELS*j+c] = 0;
-          }
 
+          if (done == sequences)
+            break;
 
-        }
+#ifdef __x86_64__
+          if (ssse3_present)
+            dprofile_shuffle16(dprofile, score_matrix, dseq);
+          else
+#endif
+            dprofile_fill16(dprofile, score_matrix, dseq);
 
-        T = _mm_slli_si128(T, 2);
-      }
+          MQ = v_and(M, Q);
+          MR = v_and(M, R);
+          MQ0 = MQ;
 
-      if (done == sequences)
-        break;
-          
-      if (ssse3_present)
-        dprofile_shuffle16(dprofile, score_matrix, dseq);
-      else
-        dprofile_fill16(dprofile, score_matrix, dseq);
-          
-      MQ = _mm_and_si128(M, Q);
-      MR = _mm_and_si128(M, R);
-      MQ0 = MQ;
-      
-      domasked16(S, hep, qp, &Q, &R, qlen, 0, &F0, dir, &H0, &M, &MQ, &MR,
-                 &MQ0);
-    }
-    
-    F0 = _mm_adds_epu16(F0, R);
-    F0 = _mm_adds_epu16(F0, R);
-    F0 = _mm_adds_epu16(F0, R);
-    H0 = _mm_subs_epu16(F0, Q);
-    F0 = _mm_adds_epu16(F0, R);
+          align_cells_masked_16(S, hep, qp, &Q, &R, qlen, &F0, dir, &H0, &M, &MQ, &MR, &MQ0);
+        }
 
+      F0 = v_add(F0, R);
+      F0 = v_add(F0, R);
+      F0 = v_add(F0, R);
+      H0 = v_sub(F0, Q);
+      F0 = v_add(F0, R);
 
-    dir += 4*longestdbsequence;
-    
-    if (dir >= dirbuffer + dirbuffersize)
-    {
-      dir -= dirbuffersize;
+      dir += 4*longestdbsequence;
+      if (dir >= dirbuffer + dirbuffersize)
+        dir -= dirbuffersize;
     }
-  }
 }
diff --git a/src/search8.cc b/src/search8.cc
index a12d4803..7d1c5dad 100644
--- a/src/search8.cc
+++ b/src/search8.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -26,730 +26,710 @@
 #define CHANNELS 16
 #define CDEPTH 4
 
-#define MATRIXWIDTH 16
+/* uses 16 unsigned 8-bit values */
+
+#ifdef __aarch64__
+
+typedef int8x16_t VECTORTYPE;
+
+#define CAST_VECTOR_p(x) reinterpret_cast<VECTORTYPE *>(x)
+
+const uint8x16_t neon_mask =
+  { 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80,
+    0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80 };
+
+const uint16x8_t neon_shift = { 0, 0, 0, 0, 8, 8, 8, 8 };
+
+#define v_load_64(a) vld1q_dup_u64((const uint64_t *)(a))
+#define v_store(a, b) vst1q_s8((int8_t *)(a), (b))
+#define v_merge_lo_8(a, b) vzip1q_s8((a),(b))
+#define v_merge_lo_16(a, b) vzip1q_s16((a),(b))
+#define v_merge_hi_16(a, b) vzip2q_s16((a),(b))
+#define v_merge_lo_32(a, b) vreinterpretq_s16_s32(vzip1q_s32 \
+          (vreinterpretq_s32_s16(a), vreinterpretq_s32_s16(b)))
+#define v_merge_hi_32(a, b) vreinterpretq_s16_s32(vzip2q_s32 \
+          (vreinterpretq_s32_s16(a), vreinterpretq_s32_s16(b)))
+#define v_merge_lo_64(a, b) vreinterpretq_s16_s64(vcombine_s64 \
+          (vget_low_s64(vreinterpretq_s64_s16(a)), \
+           vget_low_s64(vreinterpretq_s64_s16(b))))
+#define v_merge_hi_64(a, b) vreinterpretq_s16_s64(vcombine_s64 \
+          (vget_high_s64(vreinterpretq_s64_s16(a)), \
+           vget_high_s64(vreinterpretq_s64_s16(b))))
+#define v_min(a, b) vminq_u8((a), (b))
+#define v_add(a, b) vqaddq_u8((a), (b))
+#define v_sub(a, b) vqsubq_u8((a), (b))
+#define v_dup(a) vdupq_n_u8(a)
+#define v_zero v_dup(0)
+#define v_and(a, b) vandq_u8((a), (b))
+#define v_xor(a, b) veorq_u8((a), (b))
+#define v_shift_left(a) vextq_u8((v_zero), (a), 15)
+#define v_mask_eq(a, b) vaddvq_u16(vshlq_u16(vpaddlq_u8(vandq_u8 \
+          ((vceqq_s8((a), (b))), neon_mask)), neon_shift))
+
+#elif defined __x86_64__
+
+typedef __m128i VECTORTYPE;
+
+#define CAST_VECTOR_p(x) reinterpret_cast<VECTORTYPE *>(x)
+
+#define v_load_64(a) _mm_loadl_epi64(CAST_VECTOR_p(a))
+#define v_store(a, b) _mm_store_si128(CAST_VECTOR_p(a), (b))
+#define v_merge_lo_8(a, b) _mm_unpacklo_epi8((a),(b))
+#define v_merge_lo_16(a, b) _mm_unpacklo_epi16((a),(b))
+#define v_merge_hi_16(a, b) _mm_unpackhi_epi16((a),(b))
+#define v_merge_lo_32(a, b) _mm_unpacklo_epi32((a),(b))
+#define v_merge_hi_32(a, b) _mm_unpackhi_epi32((a),(b))
+#define v_merge_lo_64(a, b) _mm_unpacklo_epi64((a),(b))
+#define v_merge_hi_64(a, b) _mm_unpackhi_epi64((a),(b))
+#define v_min(a, b) _mm_min_epu8((a), (b))
+#define v_add(a, b) _mm_adds_epu8((a), (b))
+#define v_sub(a, b) _mm_subs_epu8((a), (b))
+#define v_dup(a) _mm_set1_epi8(a)
+#define v_zero v_dup(0)
+#define v_and(a, b) _mm_and_si128((a), (b))
+#define v_xor(a, b) _mm_xor_si128((a), (b))
+#define v_shift_left(a) _mm_slli_si128((a), 1)
+#define v_mask_eq(a, b) static_cast<unsigned short>(_mm_movemask_epi8(_mm_cmpeq_epi8((a), (b))))
+
+#elif defined __PPC__
+
+typedef vector unsigned char VECTORTYPE;
+
+#define CAST_VECTOR_p(x) reinterpret_cast<VECTORTYPE *>(x)
+
+const vector unsigned char perm_merge_long_low =
+  {0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
+   0x10, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17};
+
+const vector unsigned char perm_merge_long_high =
+  {0x08, 0x09, 0x0a, 0x0b, 0x0c, 0x0d, 0x0e, 0x0f,
+   0x18, 0x19, 0x1a, 0x1b, 0x1c, 0x1d, 0x1e, 0x1f};
+
+const vector unsigned char perm_bits =
+  { 0x78, 0x70, 0x68, 0x60, 0x58, 0x50, 0x48, 0x40,
+    0x38, 0x30, 0x28, 0x20, 0x18, 0x10, 0x08, 0x00 };
+
+#define v_load_64(a) (VECTORTYPE)vec_splats(*((unsigned long long *)(a)))
+#define v_store(a, b) vec_st((VECTORTYPE)(b), 0, (VECTORTYPE*)(a))
+#define v_merge_lo_8(a, b) vec_mergeh((a), (b))
+#define v_merge_lo_16(a, b) (VECTORTYPE)vec_mergeh((vector short)(a),\
+                                                   (vector short)(b))
+#define v_merge_hi_16(a, b) (VECTORTYPE)vec_mergel((vector short)(a),\
+                                                   (vector short)(b))
+#define v_merge_lo_32(a, b) (VECTORTYPE)vec_mergeh((vector int)(a), \
+                                                   (vector int)(b))
+#define v_merge_hi_32(a, b) (VECTORTYPE)vec_mergel((vector int)(a), \
+                                                   (vector int)(b))
+#define v_merge_lo_64(a, b) (VECTORTYPE)vec_perm((vector long long)(a), \
+                                                 (vector long long)(b), \
+                                                 perm_merge_long_low)
+#define v_merge_hi_64(a, b) (VECTORTYPE)vec_perm((vector long long)(a), \
+                                                 (vector long long)(b), \
+                                                 perm_merge_long_high)
+#define v_min(a, b) vec_min((a), (b))
+#define v_add(a, b) vec_adds((a), (b))
+#define v_sub(a, b) vec_subs((a), (b))
+#define v_dup(a) vec_splats((unsigned char)(a));
+#define v_zero vec_splat_u8(0)
+#define v_and(a, b) vec_and((a), (b))
+#define v_xor(a, b) vec_xor((a), (b))
+#define v_shift_left(a) vec_sld((a), v_zero, 1)
+#define v_mask_eq(a, b) ((vector unsigned short) \
+                         vec_vbpermq((vector unsigned char)             \
+                                     vec_cmpeq((a), (b)), perm_bits))[4]
+
+#else
+
+#error Unknown Architecture
 
-#if 0
+#endif
 
-/* never used */
+void align_cells_regular_8(VECTORTYPE * Sm,
+                           VECTORTYPE * hep,
+                           VECTORTYPE ** qp,
+                           VECTORTYPE * Qm,
+                           VECTORTYPE * Rm,
+                           uint64_t ql,
+                           VECTORTYPE * F0,
+                           uint64_t * dir_long,
+                           VECTORTYPE * H0);
+
+void align_cells_masked_8(VECTORTYPE * Sm,
+                          VECTORTYPE * hep,
+                          VECTORTYPE ** qp,
+                          VECTORTYPE * Qm,
+                          VECTORTYPE * Rm,
+                          uint64_t ql,
+                          VECTORTYPE * F0,
+                          uint64_t * dir_long,
+                          VECTORTYPE * H0,
+                          VECTORTYPE * Mm,
+                          VECTORTYPE * MQ,
+                          VECTORTYPE * MR,
+                          VECTORTYPE * MQ0);
 
+#if 0
 void dprofile_dump8(BYTE * dprofile)
 {
-  char * ss = sym_nt;
-
   printf("\ndprofile:\n");
   for(int k=0; k<4; k++)
-  {
-    printf("k=%d 0 1 2 3 4 5 6 7 8 9 a b c d e f\n", k);
-    for(int i=0; i<16; i++)
     {
-      printf("%c: ",ss[i]);
-      for(int j=0; j<16; j++)
-        printf("%2d", (char) dprofile[i*64+16*k+j]);
-      printf("\n");
+      printf("k=%d 0 1 2 3 4 5 6 7 8 9 a b c d e f\n", k);
+      for(int i=0; i<16; i++)
+        {
+          printf("%c: ", sym_nt[i]);
+          for(int j = 0; j < 16; j++)
+            printf("%2d", (char) dprofile[i*64+16*k + j]);
+          printf("\n");
+        }
     }
-  }
   printf("\n");
   exit(1);
 }
+#endif
 
-int dumpcounter = 0;
-char lines[4*16*1000];
-
-void dseq_dump8(BYTE * dseq)
+inline void dprofile_fill8(BYTE * dprofile,
+                           BYTE * score_matrix,
+                           BYTE * dseq)
 {
-  char * s = sym_nt;
+  VECTORTYPE reg0,  reg1, reg2,  reg3,  reg4,  reg5,  reg6,  reg7;
+  VECTORTYPE reg8,  reg9, reg10, reg11, reg12, reg13, reg14, reg15;
 
-  if (dumpcounter < 21)
-  {
-    for(int i=0; i<CHANNELS; i++)
+  for(unsigned int j=0; j<CDEPTH; j++)
     {
-      for(int j=0; j<CDEPTH; j++)
-      {
-        lines[4000*i+4*dumpcounter+j] = s[dseq[j*CHANNELS+i]];
-      }
+      unsigned int d[CHANNELS];
+      for(unsigned int i = 0; i < CHANNELS; i++)
+        d[i] = (static_cast<unsigned int>(dseq[j * CHANNELS + i])) << 5;
+
+      reg0  = v_load_64(score_matrix + d[ 0]);
+      reg2  = v_load_64(score_matrix + d[ 2]);
+      reg4  = v_load_64(score_matrix + d[ 4]);
+      reg6  = v_load_64(score_matrix + d[ 6]);
+      reg8  = v_load_64(score_matrix + d[ 8]);
+      reg10 = v_load_64(score_matrix + d[10]);
+      reg12 = v_load_64(score_matrix + d[12]);
+      reg14 = v_load_64(score_matrix + d[14]);
+
+      reg0  = v_merge_lo_8(reg0,  *CAST_VECTOR_p(score_matrix + d[ 1]));
+      reg2  = v_merge_lo_8(reg2,  *CAST_VECTOR_p(score_matrix + d[ 3]));
+      reg4  = v_merge_lo_8(reg4,  *CAST_VECTOR_p(score_matrix + d[ 5]));
+      reg6  = v_merge_lo_8(reg6,  *CAST_VECTOR_p(score_matrix + d[ 7]));
+      reg8  = v_merge_lo_8(reg8,  *CAST_VECTOR_p(score_matrix + d[ 9]));
+      reg10 = v_merge_lo_8(reg10, *CAST_VECTOR_p(score_matrix + d[11]));
+      reg12 = v_merge_lo_8(reg12, *CAST_VECTOR_p(score_matrix + d[13]));
+      reg14 = v_merge_lo_8(reg14, *CAST_VECTOR_p(score_matrix + d[15]));
+
+      reg1 = reg0;
+      reg0 = v_merge_lo_16(reg0, reg2);
+      reg1 = v_merge_hi_16(reg1, reg2);
+      reg5 = reg4;
+      reg4 = v_merge_lo_16(reg4, reg6);
+      reg5 = v_merge_hi_16(reg5, reg6);
+      reg9 = reg8;
+      reg8 = v_merge_lo_16(reg8, reg10);
+      reg9 = v_merge_hi_16(reg9, reg10);
+      reg13 = reg12;
+      reg12 = v_merge_lo_16(reg12, reg14);
+      reg13 = v_merge_hi_16(reg13, reg14);
+
+      reg2  = reg0;
+      reg0  = v_merge_lo_32(reg0, reg4);
+      reg2  = v_merge_hi_32(reg2, reg4);
+      reg6  = reg1;
+      reg1  = v_merge_lo_32(reg1, reg5);
+      reg6  = v_merge_hi_32(reg6, reg5);
+      reg10 = reg8;
+      reg8  = v_merge_lo_32(reg8, reg12);
+      reg10 = v_merge_hi_32(reg10, reg12);
+      reg14 = reg9;
+      reg9  = v_merge_lo_32(reg9, reg13);
+      reg14 = v_merge_hi_32(reg14, reg13);
+
+      reg3  = reg0;
+      reg0  = v_merge_lo_64(reg0, reg8);
+      reg3  = v_merge_hi_64(reg3, reg8);
+      reg7  = reg2;
+      reg2  = v_merge_lo_64(reg2, reg10);
+      reg7  = v_merge_hi_64(reg7, reg10);
+      reg11 = reg1;
+      reg1  = v_merge_lo_64(reg1, reg9);
+      reg11 = v_merge_hi_64(reg11, reg9);
+      reg15 = reg6;
+      reg6  = v_merge_lo_64(reg6, reg14);
+      reg15 = v_merge_hi_64(reg15, reg14);
+
+      v_store(dprofile+16*j+  0, reg0);
+      v_store(dprofile+16*j+ 64, reg3);
+      v_store(dprofile+16*j+128, reg2);
+      v_store(dprofile+16*j+192, reg7);
+      v_store(dprofile+16*j+256, reg1);
+      v_store(dprofile+16*j+320, reg11);
+      v_store(dprofile+16*j+384, reg6);
+      v_store(dprofile+16*j+448, reg15);
+
+
+      // loads not aligned on 16 byte boundary, cannot load and unpack in one instr.
+
+      reg0  = v_load_64(score_matrix + 8 + d[0 ]);
+      reg1  = v_load_64(score_matrix + 8 + d[1 ]);
+      reg2  = v_load_64(score_matrix + 8 + d[2 ]);
+      reg3  = v_load_64(score_matrix + 8 + d[3 ]);
+      reg4  = v_load_64(score_matrix + 8 + d[4 ]);
+      reg5  = v_load_64(score_matrix + 8 + d[5 ]);
+      reg6  = v_load_64(score_matrix + 8 + d[6 ]);
+      reg7  = v_load_64(score_matrix + 8 + d[7 ]);
+      reg8  = v_load_64(score_matrix + 8 + d[8 ]);
+      reg9  = v_load_64(score_matrix + 8 + d[9 ]);
+      reg10 = v_load_64(score_matrix + 8 + d[10]);
+      reg11 = v_load_64(score_matrix + 8 + d[11]);
+      reg12 = v_load_64(score_matrix + 8 + d[12]);
+      reg13 = v_load_64(score_matrix + 8 + d[13]);
+      reg14 = v_load_64(score_matrix + 8 + d[14]);
+      reg15 = v_load_64(score_matrix + 8 + d[15]);
+
+      reg0  = v_merge_lo_8(reg0,  reg1);
+      reg2  = v_merge_lo_8(reg2,  reg3);
+      reg4  = v_merge_lo_8(reg4,  reg5);
+      reg6  = v_merge_lo_8(reg6,  reg7);
+      reg8  = v_merge_lo_8(reg8,  reg9);
+      reg10 = v_merge_lo_8(reg10, reg11);
+      reg12 = v_merge_lo_8(reg12, reg13);
+      reg14 = v_merge_lo_8(reg14, reg15);
+
+      reg1 = reg0;
+      reg0 = v_merge_lo_16(reg0, reg2);
+      reg1 = v_merge_hi_16(reg1, reg2);
+      reg5 = reg4;
+      reg4 = v_merge_lo_16(reg4, reg6);
+      reg5 = v_merge_hi_16(reg5, reg6);
+      reg9 = reg8;
+      reg8 = v_merge_lo_16(reg8, reg10);
+      reg9 = v_merge_hi_16(reg9, reg10);
+      reg13 = reg12;
+      reg12 = v_merge_lo_16(reg12, reg14);
+      reg13 = v_merge_hi_16(reg13, reg14);
+
+      reg2  = reg0;
+      reg0  = v_merge_lo_32(reg0, reg4);
+      reg2  = v_merge_hi_32(reg2, reg4);
+      reg6  = reg1;
+      reg1  = v_merge_lo_32(reg1, reg5);
+      reg6  = v_merge_hi_32(reg6, reg5);
+      reg10 = reg8;
+      reg8  = v_merge_lo_32(reg8, reg12);
+      reg10 = v_merge_hi_32(reg10, reg12);
+      reg14 = reg9;
+      reg9  = v_merge_lo_32(reg9, reg13);
+      reg14 = v_merge_hi_32(reg14, reg13);
+
+      reg3  = reg0;
+      reg0  = v_merge_lo_64(reg0, reg8);
+      reg3  = v_merge_hi_64(reg3, reg8);
+      reg7  = reg2;
+      reg2  = v_merge_lo_64(reg2, reg10);
+      reg7  = v_merge_hi_64(reg7, reg10);
+      reg11 = reg1;
+      reg1  = v_merge_lo_64(reg1, reg9);
+      reg11 = v_merge_hi_64(reg11, reg9);
+      reg15 = reg6;
+      reg6  = v_merge_lo_64(reg6, reg14);
+      reg15 = v_merge_hi_64(reg15, reg14);
+
+      v_store(dprofile+16*j+512+  0, reg0);
+      v_store(dprofile+16*j+512+ 64, reg3);
+      v_store(dprofile+16*j+512+128, reg2);
+      v_store(dprofile+16*j+512+192, reg7);
+      v_store(dprofile+16*j+512+256, reg1);
+      v_store(dprofile+16*j+512+320, reg11);
+      v_store(dprofile+16*j+512+384, reg6);
+      v_store(dprofile+16*j+512+448, reg15);
+
+
+      reg0  = v_load_64(score_matrix + 16 + d[0 ]);
+      reg2  = v_load_64(score_matrix + 16 + d[2 ]);
+      reg4  = v_load_64(score_matrix + 16 + d[4 ]);
+      reg6  = v_load_64(score_matrix + 16 + d[6 ]);
+      reg8  = v_load_64(score_matrix + 16 + d[8 ]);
+      reg10 = v_load_64(score_matrix + 16 + d[10]);
+      reg12 = v_load_64(score_matrix + 16 + d[12]);
+      reg14 = v_load_64(score_matrix + 16 + d[14]);
+
+      reg0  = v_merge_lo_8(reg0,  *CAST_VECTOR_p(score_matrix + 16 + d[ 1]));
+      reg2  = v_merge_lo_8(reg2,  *CAST_VECTOR_p(score_matrix + 16 + d[ 3]));
+      reg4  = v_merge_lo_8(reg4,  *CAST_VECTOR_p(score_matrix + 16 + d[ 5]));
+      reg6  = v_merge_lo_8(reg6,  *CAST_VECTOR_p(score_matrix + 16 + d[ 7]));
+      reg8  = v_merge_lo_8(reg8,  *CAST_VECTOR_p(score_matrix + 16 + d[ 9]));
+      reg10 = v_merge_lo_8(reg10, *CAST_VECTOR_p(score_matrix + 16 + d[11]));
+      reg12 = v_merge_lo_8(reg12, *CAST_VECTOR_p(score_matrix + 16 + d[13]));
+      reg14 = v_merge_lo_8(reg14, *CAST_VECTOR_p(score_matrix + 16 + d[15]));
+
+      reg1 = reg0;
+      reg0 = v_merge_lo_16(reg0, reg2);
+      reg1 = v_merge_hi_16(reg1, reg2);
+      reg5 = reg4;
+      reg4 = v_merge_lo_16(reg4, reg6);
+      reg5 = v_merge_hi_16(reg5, reg6);
+      reg9 = reg8;
+      reg8 = v_merge_lo_16(reg8, reg10);
+      reg9 = v_merge_hi_16(reg9, reg10);
+      reg13 = reg12;
+      reg12 = v_merge_lo_16(reg12, reg14);
+      reg13 = v_merge_hi_16(reg13, reg14);
+
+      reg2  = reg0;
+      reg0  = v_merge_lo_32(reg0, reg4);
+      reg2  = v_merge_hi_32(reg2, reg4);
+      reg6  = reg1;
+      reg1  = v_merge_lo_32(reg1, reg5);
+      reg6  = v_merge_hi_32(reg6, reg5);
+      reg10 = reg8;
+      reg8  = v_merge_lo_32(reg8, reg12);
+      reg10 = v_merge_hi_32(reg10, reg12);
+      reg14 = reg9;
+      reg9  = v_merge_lo_32(reg9, reg13);
+      reg14 = v_merge_hi_32(reg14, reg13);
+
+      reg3  = reg0;
+      reg0  = v_merge_lo_64(reg0, reg8);
+      reg3  = v_merge_hi_64(reg3, reg8);
+      reg7  = reg2;
+      reg2  = v_merge_lo_64(reg2, reg10);
+      reg7  = v_merge_hi_64(reg7, reg10);
+      reg11 = reg1;
+      reg1  = v_merge_lo_64(reg1, reg9);
+      reg11 = v_merge_hi_64(reg11, reg9);
+      reg15 = reg6;
+      reg6  = v_merge_lo_64(reg6, reg14);
+      reg15 = v_merge_hi_64(reg15, reg14);
+
+      v_store(dprofile+16*j+1024+  0, reg0);
+      v_store(dprofile+16*j+1024+ 64, reg3);
+      v_store(dprofile+16*j+1024+128, reg2);
+      v_store(dprofile+16*j+1024+192, reg7);
+      v_store(dprofile+16*j+1024+256, reg1);
+      v_store(dprofile+16*j+1024+320, reg11);
+      v_store(dprofile+16*j+1024+384, reg6);
+      v_store(dprofile+16*j+1024+448, reg15);
+
+
+      // loads not aligned on 16 byte boundary, cannot load and unpack in one instr.
+
+      reg0  = v_load_64(score_matrix + 24 + d[ 0]);
+      reg1  = v_load_64(score_matrix + 24 + d[ 1]);
+      reg2  = v_load_64(score_matrix + 24 + d[ 2]);
+      reg3  = v_load_64(score_matrix + 24 + d[ 3]);
+      reg4  = v_load_64(score_matrix + 24 + d[ 4]);
+      reg5  = v_load_64(score_matrix + 24 + d[ 5]);
+      reg6  = v_load_64(score_matrix + 24 + d[ 6]);
+      reg7  = v_load_64(score_matrix + 24 + d[ 7]);
+      reg8  = v_load_64(score_matrix + 24 + d[ 8]);
+      reg9  = v_load_64(score_matrix + 24 + d[ 9]);
+      reg10 = v_load_64(score_matrix + 24 + d[10]);
+      reg11 = v_load_64(score_matrix + 24 + d[11]);
+      reg12 = v_load_64(score_matrix + 24 + d[12]);
+      reg13 = v_load_64(score_matrix + 24 + d[13]);
+      reg14 = v_load_64(score_matrix + 24 + d[14]);
+      reg15 = v_load_64(score_matrix + 24 + d[15]);
+
+      reg0  = v_merge_lo_8(reg0,  reg1);
+      reg2  = v_merge_lo_8(reg2,  reg3);
+      reg4  = v_merge_lo_8(reg4,  reg5);
+      reg6  = v_merge_lo_8(reg6,  reg7);
+      reg8  = v_merge_lo_8(reg8,  reg9);
+      reg10 = v_merge_lo_8(reg10, reg11);
+      reg12 = v_merge_lo_8(reg12, reg13);
+      reg14 = v_merge_lo_8(reg14, reg15);
+
+      reg1 = reg0;
+      reg0 = v_merge_lo_16(reg0, reg2);
+      reg1 = v_merge_hi_16(reg1, reg2);
+      reg5 = reg4;
+      reg4 = v_merge_lo_16(reg4, reg6);
+      reg5 = v_merge_hi_16(reg5, reg6);
+      reg9 = reg8;
+      reg8 = v_merge_lo_16(reg8, reg10);
+      reg9 = v_merge_hi_16(reg9, reg10);
+      reg13 = reg12;
+      reg12 = v_merge_lo_16(reg12, reg14);
+      reg13 = v_merge_hi_16(reg13, reg14);
+
+      reg2  = reg0;
+      reg0  = v_merge_lo_32(reg0, reg4);
+      reg2  = v_merge_hi_32(reg2, reg4);
+      reg6  = reg1;
+      reg1  = v_merge_lo_32(reg1, reg5);
+      reg6  = v_merge_hi_32(reg6, reg5);
+      reg10 = reg8;
+      reg8  = v_merge_lo_32(reg8, reg12);
+      reg10 = v_merge_hi_32(reg10, reg12);
+      reg14 = reg9;
+      reg9  = v_merge_lo_32(reg9, reg13);
+      reg14 = v_merge_hi_32(reg14, reg13);
+
+      reg3  = reg0;
+      reg0  = v_merge_lo_64(reg0, reg8);
+      reg3  = v_merge_hi_64(reg3, reg8);
+      reg7  = reg2;
+      reg2  = v_merge_lo_64(reg2, reg10);
+      reg7  = v_merge_hi_64(reg7, reg10);
+      reg11 = reg1;
+      reg1  = v_merge_lo_64(reg1, reg9);
+      reg11 = v_merge_hi_64(reg11, reg9);
+      reg15 = reg6;
+      reg6  = v_merge_lo_64(reg6, reg14);
+      reg15 = v_merge_hi_64(reg15, reg14);
+
+      v_store(dprofile+16*j+1536+  0, reg0);
+      v_store(dprofile+16*j+1536+ 64, reg3);
+      v_store(dprofile+16*j+1536+128, reg2);
+      v_store(dprofile+16*j+1536+192, reg7);
+      v_store(dprofile+16*j+1536+256, reg1);
+      v_store(dprofile+16*j+1536+320, reg11);
+      v_store(dprofile+16*j+1536+384, reg6);
+      v_store(dprofile+16*j+1536+448, reg15);
     }
-    dumpcounter++;
-  }
-  else
-  {
-    for(int i=0; i<16; i++)
-    {
-      printf("%.1000s\n", lines+4000*i);
-    }
-    exit(1);
-  }
-}
-
+#if 0
+  dprofile_dump8(dprofile);
 #endif
+}
 
-inline void dprofile_fill8(BYTE * dprofile,
-                           BYTE * score_matrix,
-                           BYTE * dseq)
+inline void onestep_8(VECTORTYPE & H,
+                      VECTORTYPE & N,
+                      VECTORTYPE & F,
+                      VECTORTYPE V,
+                      unsigned short * DIR,
+                      VECTORTYPE & E,
+                      VECTORTYPE QR,
+                      VECTORTYPE R)
 {
-  __m128i xmm0,  xmm1, xmm2,  xmm3,  xmm4,  xmm5,  xmm6,  xmm7;
-  __m128i xmm8,  xmm9, xmm10, xmm11, xmm12, xmm13, xmm14, xmm15;
-  
-  // 4 x 16 db symbols
-  // ca (60x2+68x2)x4 = 976 instructions
-
-  for(int j=0; j<CDEPTH; j++)
-  {
-    unsigned d[CHANNELS];
-    for(int i=0; i<CHANNELS; i++)
-      d[i] = dseq[j*CHANNELS+i] << 5;
-      
-    xmm0  = _mm_loadl_epi64((__m128i*)(score_matrix + d[0] ));
-    xmm2  = _mm_loadl_epi64((__m128i*)(score_matrix + d[2] ));
-    xmm4  = _mm_loadl_epi64((__m128i*)(score_matrix + d[4] ));
-    xmm6  = _mm_loadl_epi64((__m128i*)(score_matrix + d[6] ));
-    xmm8  = _mm_loadl_epi64((__m128i*)(score_matrix + d[8] ));
-    xmm10 = _mm_loadl_epi64((__m128i*)(score_matrix + d[10]));
-    xmm12 = _mm_loadl_epi64((__m128i*)(score_matrix + d[12]));
-    xmm14 = _mm_loadl_epi64((__m128i*)(score_matrix + d[14]));
-
-    xmm0  = _mm_unpacklo_epi8(xmm0,  *(__m128i*)(score_matrix + d[1] ));
-    xmm2  = _mm_unpacklo_epi8(xmm2,  *(__m128i*)(score_matrix + d[3] ));
-    xmm4  = _mm_unpacklo_epi8(xmm4,  *(__m128i*)(score_matrix + d[5] ));
-    xmm6  = _mm_unpacklo_epi8(xmm6,  *(__m128i*)(score_matrix + d[7] ));
-    xmm8  = _mm_unpacklo_epi8(xmm8,  *(__m128i*)(score_matrix + d[9] ));
-    xmm10 = _mm_unpacklo_epi8(xmm10, *(__m128i*)(score_matrix + d[11]));
-    xmm12 = _mm_unpacklo_epi8(xmm12, *(__m128i*)(score_matrix + d[13]));
-    xmm14 = _mm_unpacklo_epi8(xmm14, *(__m128i*)(score_matrix + d[15]));
-      
-    xmm1 = xmm0;
-    xmm0 = _mm_unpacklo_epi16(xmm0, xmm2);
-    xmm1 = _mm_unpackhi_epi16(xmm1, xmm2);
-    xmm5 = xmm4;
-    xmm4 = _mm_unpacklo_epi16(xmm4, xmm6);
-    xmm5 = _mm_unpackhi_epi16(xmm5, xmm6);
-    xmm9 = xmm8;
-    xmm8 = _mm_unpacklo_epi16(xmm8, xmm10);
-    xmm9 = _mm_unpackhi_epi16(xmm9, xmm10);
-    xmm13 = xmm12;
-    xmm12 = _mm_unpacklo_epi16(xmm12, xmm14);
-    xmm13 = _mm_unpackhi_epi16(xmm13, xmm14);
-
-    xmm2  = xmm0;
-    xmm0  = _mm_unpacklo_epi32(xmm0, xmm4);
-    xmm2  = _mm_unpackhi_epi32(xmm2, xmm4);
-    xmm6  = xmm1;
-    xmm1  = _mm_unpacklo_epi32(xmm1, xmm5);
-    xmm6  = _mm_unpackhi_epi32(xmm6, xmm5);
-    xmm10 = xmm8;
-    xmm8  = _mm_unpacklo_epi32(xmm8, xmm12);
-    xmm10 = _mm_unpackhi_epi32(xmm10, xmm12);
-    xmm14 = xmm9;
-    xmm9  = _mm_unpacklo_epi32(xmm9, xmm13);
-    xmm14 = _mm_unpackhi_epi32(xmm14, xmm13);
-      
-    xmm3  = xmm0;
-    xmm0  = _mm_unpacklo_epi64(xmm0, xmm8);
-    xmm3  = _mm_unpackhi_epi64(xmm3, xmm8);
-    xmm7  = xmm2;
-    xmm2  = _mm_unpacklo_epi64(xmm2, xmm10);
-    xmm7  = _mm_unpackhi_epi64(xmm7, xmm10);
-    xmm11 = xmm1;
-    xmm1  = _mm_unpacklo_epi64(xmm1, xmm9);
-    xmm11 = _mm_unpackhi_epi64(xmm11, xmm9);
-    xmm15 = xmm6;
-    xmm6  = _mm_unpacklo_epi64(xmm6, xmm14);
-    xmm15 = _mm_unpackhi_epi64(xmm15, xmm14);
-
-    _mm_store_si128((__m128i*)(dprofile+16*j+  0), xmm0);
-    _mm_store_si128((__m128i*)(dprofile+16*j+ 64), xmm3);
-    _mm_store_si128((__m128i*)(dprofile+16*j+128), xmm2);
-    _mm_store_si128((__m128i*)(dprofile+16*j+192), xmm7);
-    _mm_store_si128((__m128i*)(dprofile+16*j+256), xmm1);
-    _mm_store_si128((__m128i*)(dprofile+16*j+320), xmm11);
-    _mm_store_si128((__m128i*)(dprofile+16*j+384), xmm6);
-    _mm_store_si128((__m128i*)(dprofile+16*j+448), xmm15);
-
-
-    // loads not aligned on 16 byte boundary, cannot load and unpack in one instr.
-
-    xmm0  = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[0 ]));
-    xmm1  = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[1 ]));
-    xmm2  = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[2 ]));
-    xmm3  = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[3 ]));
-    xmm4  = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[4 ]));
-    xmm5  = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[5 ]));
-    xmm6  = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[6 ]));
-    xmm7  = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[7 ]));
-    xmm8  = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[8 ]));
-    xmm9  = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[9 ]));
-    xmm10 = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[10]));
-    xmm11 = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[11]));
-    xmm12 = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[12]));
-    xmm13 = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[13]));
-    xmm14 = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[14]));
-    xmm15 = _mm_loadl_epi64((__m128i*)(score_matrix + 8 + d[15]));
-
-    xmm0  = _mm_unpacklo_epi8(xmm0,  xmm1);
-    xmm2  = _mm_unpacklo_epi8(xmm2,  xmm3);
-    xmm4  = _mm_unpacklo_epi8(xmm4,  xmm5);
-    xmm6  = _mm_unpacklo_epi8(xmm6,  xmm7);
-    xmm8  = _mm_unpacklo_epi8(xmm8,  xmm9);
-    xmm10 = _mm_unpacklo_epi8(xmm10, xmm11);
-    xmm12 = _mm_unpacklo_epi8(xmm12, xmm13);
-    xmm14 = _mm_unpacklo_epi8(xmm14, xmm15);
-      
-    xmm1 = xmm0;
-    xmm0 = _mm_unpacklo_epi16(xmm0, xmm2);
-    xmm1 = _mm_unpackhi_epi16(xmm1, xmm2);
-    xmm5 = xmm4;
-    xmm4 = _mm_unpacklo_epi16(xmm4, xmm6);
-    xmm5 = _mm_unpackhi_epi16(xmm5, xmm6);
-    xmm9 = xmm8;
-    xmm8 = _mm_unpacklo_epi16(xmm8, xmm10);
-    xmm9 = _mm_unpackhi_epi16(xmm9, xmm10);
-    xmm13 = xmm12;
-    xmm12 = _mm_unpacklo_epi16(xmm12, xmm14);
-    xmm13 = _mm_unpackhi_epi16(xmm13, xmm14);
-
-    xmm2  = xmm0;
-    xmm0  = _mm_unpacklo_epi32(xmm0, xmm4);
-    xmm2  = _mm_unpackhi_epi32(xmm2, xmm4);
-    xmm6  = xmm1;
-    xmm1  = _mm_unpacklo_epi32(xmm1, xmm5);
-    xmm6  = _mm_unpackhi_epi32(xmm6, xmm5);
-    xmm10 = xmm8;
-    xmm8  = _mm_unpacklo_epi32(xmm8, xmm12);
-    xmm10 = _mm_unpackhi_epi32(xmm10, xmm12);
-    xmm14 = xmm9;
-    xmm9  = _mm_unpacklo_epi32(xmm9, xmm13);
-    xmm14 = _mm_unpackhi_epi32(xmm14, xmm13);
-      
-    xmm3  = xmm0;
-    xmm0  = _mm_unpacklo_epi64(xmm0, xmm8);
-    xmm3  = _mm_unpackhi_epi64(xmm3, xmm8);
-    xmm7  = xmm2;
-    xmm2  = _mm_unpacklo_epi64(xmm2, xmm10);
-    xmm7  = _mm_unpackhi_epi64(xmm7, xmm10);
-    xmm11 = xmm1;
-    xmm1  = _mm_unpacklo_epi64(xmm1, xmm9);
-    xmm11 = _mm_unpackhi_epi64(xmm11, xmm9);
-    xmm15 = xmm6;
-    xmm6  = _mm_unpacklo_epi64(xmm6, xmm14);
-    xmm15 = _mm_unpackhi_epi64(xmm15, xmm14);
-
-    _mm_store_si128((__m128i*)(dprofile+16*j+512+  0), xmm0);
-    _mm_store_si128((__m128i*)(dprofile+16*j+512+ 64), xmm3);
-    _mm_store_si128((__m128i*)(dprofile+16*j+512+128), xmm2);
-    _mm_store_si128((__m128i*)(dprofile+16*j+512+192), xmm7);
-    _mm_store_si128((__m128i*)(dprofile+16*j+512+256), xmm1);
-    _mm_store_si128((__m128i*)(dprofile+16*j+512+320), xmm11);
-    _mm_store_si128((__m128i*)(dprofile+16*j+512+384), xmm6);
-    _mm_store_si128((__m128i*)(dprofile+16*j+512+448), xmm15);
-
-
-    xmm0  = _mm_loadl_epi64((__m128i*)(score_matrix + 16 + d[0 ]));
-    xmm2  = _mm_loadl_epi64((__m128i*)(score_matrix + 16 + d[2 ]));
-    xmm4  = _mm_loadl_epi64((__m128i*)(score_matrix + 16 + d[4 ]));
-    xmm6  = _mm_loadl_epi64((__m128i*)(score_matrix + 16 + d[6 ]));
-    xmm8  = _mm_loadl_epi64((__m128i*)(score_matrix + 16 + d[8 ]));
-    xmm10 = _mm_loadl_epi64((__m128i*)(score_matrix + 16 + d[10]));
-    xmm12 = _mm_loadl_epi64((__m128i*)(score_matrix + 16 + d[12]));
-    xmm14 = _mm_loadl_epi64((__m128i*)(score_matrix + 16 + d[14]));
-
-    xmm0  = _mm_unpacklo_epi8(xmm0,  *(__m128i*)(score_matrix + 16 + d[1 ]));
-    xmm2  = _mm_unpacklo_epi8(xmm2,  *(__m128i*)(score_matrix + 16 + d[3 ]));
-    xmm4  = _mm_unpacklo_epi8(xmm4,  *(__m128i*)(score_matrix + 16 + d[5 ]));
-    xmm6  = _mm_unpacklo_epi8(xmm6,  *(__m128i*)(score_matrix + 16 + d[7 ]));
-    xmm8  = _mm_unpacklo_epi8(xmm8,  *(__m128i*)(score_matrix + 16 + d[9 ]));
-    xmm10 = _mm_unpacklo_epi8(xmm10, *(__m128i*)(score_matrix + 16 + d[11 ]));
-    xmm12 = _mm_unpacklo_epi8(xmm12, *(__m128i*)(score_matrix + 16 + d[13 ]));
-    xmm14 = _mm_unpacklo_epi8(xmm14, *(__m128i*)(score_matrix + 16 + d[15 ]));
-      
-    xmm1 = xmm0;
-    xmm0 = _mm_unpacklo_epi16(xmm0, xmm2);
-    xmm1 = _mm_unpackhi_epi16(xmm1, xmm2);
-    xmm5 = xmm4;
-    xmm4 = _mm_unpacklo_epi16(xmm4, xmm6);
-    xmm5 = _mm_unpackhi_epi16(xmm5, xmm6);
-    xmm9 = xmm8;
-    xmm8 = _mm_unpacklo_epi16(xmm8, xmm10);
-    xmm9 = _mm_unpackhi_epi16(xmm9, xmm10);
-    xmm13 = xmm12;
-    xmm12 = _mm_unpacklo_epi16(xmm12, xmm14);
-    xmm13 = _mm_unpackhi_epi16(xmm13, xmm14);
-
-    xmm2  = xmm0;
-    xmm0  = _mm_unpacklo_epi32(xmm0, xmm4);
-    xmm2  = _mm_unpackhi_epi32(xmm2, xmm4);
-    xmm6  = xmm1;
-    xmm1  = _mm_unpacklo_epi32(xmm1, xmm5);
-    xmm6  = _mm_unpackhi_epi32(xmm6, xmm5);
-    xmm10 = xmm8;
-    xmm8  = _mm_unpacklo_epi32(xmm8, xmm12);
-    xmm10 = _mm_unpackhi_epi32(xmm10, xmm12);
-    xmm14 = xmm9;
-    xmm9  = _mm_unpacklo_epi32(xmm9, xmm13);
-    xmm14 = _mm_unpackhi_epi32(xmm14, xmm13);
-      
-    xmm3  = xmm0;
-    xmm0  = _mm_unpacklo_epi64(xmm0, xmm8);
-    xmm3  = _mm_unpackhi_epi64(xmm3, xmm8);
-    xmm7  = xmm2;
-    xmm2  = _mm_unpacklo_epi64(xmm2, xmm10);
-    xmm7  = _mm_unpackhi_epi64(xmm7, xmm10);
-    xmm11 = xmm1;
-    xmm1  = _mm_unpacklo_epi64(xmm1, xmm9);
-    xmm11 = _mm_unpackhi_epi64(xmm11, xmm9);
-    xmm15 = xmm6;
-    xmm6  = _mm_unpacklo_epi64(xmm6, xmm14);
-    xmm15 = _mm_unpackhi_epi64(xmm15, xmm14);
-
-    _mm_store_si128((__m128i*)(dprofile+16*j+1024+  0), xmm0);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1024+ 64), xmm3);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1024+128), xmm2);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1024+192), xmm7);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1024+256), xmm1);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1024+320), xmm11);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1024+384), xmm6);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1024+448), xmm15);
-
-
-    // loads not aligned on 16 byte boundary, cannot load and unpack in one instr.
-
-    xmm0  = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[0 ]));
-    xmm1  = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[1 ]));
-    xmm2  = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[2 ]));
-    xmm3  = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[3 ]));
-    xmm4  = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[4 ]));
-    xmm5  = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[5 ]));
-    xmm6  = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[6 ]));
-    xmm7  = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[7 ]));
-    xmm8  = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[8 ]));
-    xmm9  = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[9 ]));
-    xmm10 = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[10]));
-    xmm11 = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[11]));
-    xmm12 = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[12]));
-    xmm13 = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[13]));
-    xmm14 = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[14]));
-    xmm15 = _mm_loadl_epi64((__m128i*)(score_matrix + 24 + d[15]));
-
-    xmm0  = _mm_unpacklo_epi8(xmm0,  xmm1);
-    xmm2  = _mm_unpacklo_epi8(xmm2,  xmm3);
-    xmm4  = _mm_unpacklo_epi8(xmm4,  xmm5);
-    xmm6  = _mm_unpacklo_epi8(xmm6,  xmm7);
-    xmm8  = _mm_unpacklo_epi8(xmm8,  xmm9);
-    xmm10 = _mm_unpacklo_epi8(xmm10, xmm11);
-    xmm12 = _mm_unpacklo_epi8(xmm12, xmm13);
-    xmm14 = _mm_unpacklo_epi8(xmm14, xmm15);
-      
-    xmm1 = xmm0;
-    xmm0 = _mm_unpacklo_epi16(xmm0, xmm2);
-    xmm1 = _mm_unpackhi_epi16(xmm1, xmm2);
-    xmm5 = xmm4;
-    xmm4 = _mm_unpacklo_epi16(xmm4, xmm6);
-    xmm5 = _mm_unpackhi_epi16(xmm5, xmm6);
-    xmm9 = xmm8;
-    xmm8 = _mm_unpacklo_epi16(xmm8, xmm10);
-    xmm9 = _mm_unpackhi_epi16(xmm9, xmm10);
-    xmm13 = xmm12;
-    xmm12 = _mm_unpacklo_epi16(xmm12, xmm14);
-    xmm13 = _mm_unpackhi_epi16(xmm13, xmm14);
-
-    xmm2  = xmm0;
-    xmm0  = _mm_unpacklo_epi32(xmm0, xmm4);
-    xmm2  = _mm_unpackhi_epi32(xmm2, xmm4);
-    xmm6  = xmm1;
-    xmm1  = _mm_unpacklo_epi32(xmm1, xmm5);
-    xmm6  = _mm_unpackhi_epi32(xmm6, xmm5);
-    xmm10 = xmm8;
-    xmm8  = _mm_unpacklo_epi32(xmm8, xmm12);
-    xmm10 = _mm_unpackhi_epi32(xmm10, xmm12);
-    xmm14 = xmm9;
-    xmm9  = _mm_unpacklo_epi32(xmm9, xmm13);
-    xmm14 = _mm_unpackhi_epi32(xmm14, xmm13);
-      
-    xmm3  = xmm0;
-    xmm0  = _mm_unpacklo_epi64(xmm0, xmm8);
-    xmm3  = _mm_unpackhi_epi64(xmm3, xmm8);
-    xmm7  = xmm2;
-    xmm2  = _mm_unpacklo_epi64(xmm2, xmm10);
-    xmm7  = _mm_unpackhi_epi64(xmm7, xmm10);
-    xmm11 = xmm1;
-    xmm1  = _mm_unpacklo_epi64(xmm1, xmm9);
-    xmm11 = _mm_unpackhi_epi64(xmm11, xmm9);
-    xmm15 = xmm6;
-    xmm6  = _mm_unpacklo_epi64(xmm6, xmm14);
-    xmm15 = _mm_unpackhi_epi64(xmm15, xmm14);
-
-    _mm_store_si128((__m128i*)(dprofile+16*j+1536+  0), xmm0);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1536+ 64), xmm3);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1536+128), xmm2);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1536+192), xmm7);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1536+256), xmm1);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1536+320), xmm11);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1536+384), xmm6);
-    _mm_store_si128((__m128i*)(dprofile+16*j+1536+448), xmm15);
-  }
-
-  //  dprofile_dump8(dprofile);
+  VECTORTYPE W;
+
+  H = v_add(H, V);
+  W = H;
+  H = v_min(H, F);
+  *((DIR) + 0) = v_mask_eq(W, H);
+  H = v_min(H, E);
+  *((DIR) + 1) = v_mask_eq(H, E);
+  N = H;
+  H = v_add(H, QR);
+  F = v_add(F, R);
+  E = v_add(E, R);
+  F = v_min(H, F);
+  *((DIR) + 2) = v_mask_eq(H, F);
+  E = v_min(H, E);
+  *((DIR) + 3) = v_mask_eq(H, E);
 }
 
-
-// Register usage
-// xmm0:  H0
-// xmm1:  H1
-// xmm2:  H2
-// xmm3:  H3
-// xmm4:  F0
-// xmm5:  F1
-// xmm6:  F2
-// xmm7:  F3
-// xmm8:  N0
-// xmm9:  N1
-// xmm10: N2
-// xmm11: N3
-// xmm12: E
-// xmm13: temporary
-// xmm14: Q 
-// xmm15: R
-
-
-/* 
-   initialize Q, R, H0-3, F0-3, 
-   loop index (r11), 
-   query length (loop end double) (r10), loop end single (r12) 
-*/
-
-/* 
-   Sorry for the assembler code below. This code was originally written
-   several years ago when compilers were not that good at compiling
-   intrinsics to optimal code.
-   Similar code using intrinsics instead of assembler is available in
-   the vsearch codebase.
-*/
-
-#define INITIALIZE                                      \
-  "        movq      %3, %%rax               \n"        \
-  "        movdqa    (%%rax), %%xmm14        \n"        \
-  "        movq      %4, %%rax               \n"        \
-  "        movdqa    (%%rax), %%xmm15        \n"        \
-  "        movq      %9, %%rax               \n"        \
-  "        movdqa    (%%rax), %%xmm0         \n"        \
-  "        movdqa    (%7), %%xmm7            \n"        \
-  "        movdqa    %%xmm7, %%xmm3          \n"        \
-  "        psubusb   %%xmm14, %%xmm3         \n"        \
-  "        movdqa    %%xmm3, %%xmm1          \n"        \
-  "        paddusb   %%xmm15, %%xmm3         \n"        \
-  "        movdqa    %%xmm3, %%xmm2          \n"        \
-  "        paddusb   %%xmm15, %%xmm3         \n"        \
-  "        movdqa    %%xmm7, %%xmm4          \n"        \
-  "        paddusb   %%xmm15, %%xmm7         \n"        \
-  "        movdqa    %%xmm7, %%xmm5          \n"        \
-  "        paddusb   %%xmm15, %%xmm7         \n"        \
-  "        movdqa    %%xmm7, %%xmm6          \n"        \
-  "        paddusb   %%xmm15, %%xmm7         \n"        \
-  "        movq      %5, %%r12               \n"        \
-  "        shlq      $3, %%r12               \n"        \
-  "        movq      %%r12, %%r10            \n"        \
-  "        andq      $-16, %%r10             \n"        \
-  "        xorq      %%r11, %%r11            \n" 
-
-#define ONESTEP(H, N, F, V, DIR)                \
-  "        paddusb   " V ", " H "          \n"  \
-  "        movdqa    " H ", %%xmm13        \n"  \
-  "        pminub    " F ", " H "          \n"  \
-  "        pcmpeqb   " H ", %%xmm13        \n"  \
-  "        pmovmskb  %%xmm13, %%edx        \n"  \
-  "        movw      %%dx, 0+" DIR "       \n"  \
-  "        pminub    %%xmm12, " H "        \n"  \
-  "        movdqa    " H ", %%xmm13        \n"  \
-  "        pcmpeqb   %%xmm12, %%xmm13      \n"  \
-  "        pmovmskb  %%xmm13, %%edx        \n"  \
-  "        movw      %%dx, 2+" DIR "       \n"  \
-  "        movdqa    " H ", " N "          \n"  \
-  "        paddusb   %%xmm14, " H "        \n"  \
-  "        paddusb   %%xmm15, " F "        \n"  \
-  "        paddusb   %%xmm15, %%xmm12      \n"  \
-  "        movdqa    " H ", %%xmm13        \n"  \
-  "        pminub    " H ", " F "          \n"  \
-  "        pcmpeqb   " F ", %%xmm13        \n"  \
-  "        pmovmskb  %%xmm13, %%edx        \n"  \
-  "        movw      %%dx, 4+" DIR "       \n"  \
-  "        movdqa    " H ", %%xmm13        \n"  \
-  "        pminub    " H ", %%xmm12        \n"  \
-  "        pcmpeqb   %%xmm12, %%xmm13      \n"  \
-  "        pmovmskb  %%xmm13, %%edx        \n"  \
-  "        movw      %%dx, 6+" DIR "       \n"
-
-
-inline void donormal8(__m128i * Sm,
-                      __m128i * hep,
-                      __m128i ** qp,
-                      __m128i * Qm,
-                      __m128i * Rm,
-                      long ql,
-                      __m128i * Zm,
-                      __m128i * F0,
-                      unsigned long * dir,
-                      __m128i * H0
-                      )
+void align_cells_regular_8(VECTORTYPE * Sm,
+                           VECTORTYPE * hep,
+                           VECTORTYPE ** qp,
+                           VECTORTYPE * Qm,
+                           VECTORTYPE * Rm,
+                           uint64_t ql,
+                           VECTORTYPE * F0,
+                           uint64_t * dir_long,
+                           VECTORTYPE * H0)
 {
-  __asm__
-    __volatile__
-    ( 
-     INITIALIZE
-     
-     "        jmp       2f                  \n"
-     
-     "1:      movq      0(%2,%%r11,1), %%rax    \n" // load x from qp[qi]
-     "        movdqa    0(%1,%%r11,4), %%xmm8   \n" // load N0
-     "        movdqa    16(%1,%%r11,4), %%xmm12 \n" // load E
-     
-     ONESTEP("%%xmm0",  "%%xmm9",         "%%xmm4", " 0(%%rax)", " 0(%8,%%r11,4)")
-     ONESTEP("%%xmm1",  "%%xmm10",        "%%xmm5", "16(%%rax)", " 8(%8,%%r11,4)")
-     ONESTEP("%%xmm2",  "%%xmm11",        "%%xmm6", "32(%%rax)", "16(%8,%%r11,4)")
-     ONESTEP("%%xmm3",  "0(%1,%%r11,4)",  "%%xmm7", "48(%%rax)", "24(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 16(%1,%%r11,4) \n" // save E
-     "        movq      8(%2,%%r11,1), %%rax    \n" // load x from qp[qi+1]
-     "        movdqa    32(%1,%%r11,4), %%xmm0  \n" // load H0
-     "        movdqa    48(%1,%%r11,4), %%xmm12 \n" // load E
-     
-     ONESTEP("%%xmm8",  "%%xmm1",         "%%xmm4", " 0(%%rax)", "32(%8,%%r11,4)")
-     ONESTEP("%%xmm9",  "%%xmm2",         "%%xmm5", "16(%%rax)", "40(%8,%%r11,4)")
-     ONESTEP("%%xmm10", "%%xmm3",         "%%xmm6", "32(%%rax)", "48(%8,%%r11,4)")
-     ONESTEP("%%xmm11", "32(%1,%%r11,4)", "%%xmm7", "48(%%rax)", "56(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 48(%1,%%r11,4) \n" // save E
-     "        addq      $16, %%r11              \n" // qi++
-     "2:      cmpq      %%r11, %%r10            \n" // qi = ql4 ?
-     "        jne       1b                      \n" // loop
-     
-     "4:      cmpq      %%r11, %%r12            \n" 
-     "        je        3f                      \n"
-     "        movq      0(%2,%%r11,1), %%rax    \n" // load x from qp[qi]
-     "        movdqa    16(%1,%%r11,4), %%xmm12 \n" // load E
-     
-     ONESTEP("%%xmm0",  "%%xmm9",         "%%xmm4", " 0(%%rax)", " 0(%8,%%r11,4)")
-     ONESTEP("%%xmm1",  "%%xmm10",        "%%xmm5", "16(%%rax)", " 8(%8,%%r11,4)")
-     ONESTEP("%%xmm2",  "%%xmm11",        "%%xmm6", "32(%%rax)", "16(%8,%%r11,4)")
-     ONESTEP("%%xmm3",  "0(%1,%%r11,4)",  "%%xmm7", "48(%%rax)", "24(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 16(%1,%%r11,4) \n" // save E
-     
-     "        movdqa    %%xmm9, %%xmm1          \n"
-     "        movdqa    %%xmm10, %%xmm2         \n"
-     "        movdqa    %%xmm11, %%xmm3         \n"
-     "        movdqa    0(%1,%%r11,4), %%xmm4   \n"
-     "        jmp       5f                      \n"
-     
-     "3:      movdqa    -32(%1,%%r11,4), %%xmm4 \n"
-     
-     "5:      movq      %0, %%rax               \n" // save final Hs
-     "        movdqa    %%xmm1, (%%rax)         \n"
-     "        addq      $16, %%rax              \n"
-     "        movdqa    %%xmm2, (%%rax)         \n"
-     "        addq      $16, %%rax              \n"
-     "        movdqa    %%xmm3, (%%rax)         \n"
-     "        addq      $16, %%rax              \n"
-     "        movdqa    %%xmm4, (%%rax)         \n"
-     
-     : 
-     : "m"(Sm), "r"(hep),  "r"(qp), "m"(Qm), 
-       "m"(Rm), "r"(ql),   "m"(Zm), "r"(F0),
-       "r"(dir),"m"(H0)
-       
-     : "xmm0",  "xmm1",  "xmm2",  "xmm3",
-       "xmm4",  "xmm5",  "xmm6",  "xmm7",
-       "xmm8",  "xmm9",  "xmm10", "xmm11", 
-       "xmm12", "xmm13", "xmm14", "xmm15",
-       "rax",   "r10",   "r11",   "r12",
-       "rdx",   "cc"
-      );
+  VECTORTYPE Q, R, E;
+  VECTORTYPE h0, h1, h2, h3, h4, h5, h6, h7, h8;
+  VECTORTYPE f0, f1, f2, f3;
+
+  unsigned short * dir = reinterpret_cast<unsigned short *>(dir_long);
+
+  Q = *Qm;
+  R = *Rm;
+
+  f0 = *F0;
+  f1 = v_add(f0, R);
+  f2 = v_add(f1, R);
+  f3 = v_add(f2, R);
+
+  h0 = *H0;
+  h1 = v_sub(f0, Q);
+  h2 = v_add(h1, R);
+  h3 = v_add(h2, R);
+  h4 = v_zero;
+  h5 = v_zero;
+  h6 = v_zero;
+  h7 = v_zero;
+  h8 = v_zero;
+
+  for(uint64_t i = 0; i < ql; i++)
+    {
+      VECTORTYPE * x;
+
+      x = qp[i + 0];
+      h4 = hep[2*i + 0];
+      E  = hep[2*i + 1];
+      onestep_8(h0, h5, f0, x[0], dir + 16*i +  0, E, Q, R);
+      onestep_8(h1, h6, f1, x[1], dir + 16*i +  4, E, Q, R);
+      onestep_8(h2, h7, f2, x[2], dir + 16*i +  8, E, Q, R);
+      onestep_8(h3, h8, f3, x[3], dir + 16*i + 12, E, Q, R);
+      hep[2*i + 0] = h8;
+      hep[2*i + 1] = E;
+      h0 = h4;
+      h1 = h5;
+      h2 = h6;
+      h3 = h7;
+    }
+
+  Sm[0] = h5;
+  Sm[1] = h6;
+  Sm[2] = h7;
+  Sm[3] = h8;
 }
 
-inline void domasked8(__m128i * Sm,
-                      __m128i * hep,
-                      __m128i ** qp,
-                      __m128i * Qm, 
-                      __m128i * Rm, 
-                      long ql,      
-                      __m128i * Zm,
-                      __m128i * F0,
-                      unsigned long * dir,
-                      __m128i * H0,
-                      __m128i * Mm,
-                      __m128i * MQ,
-                      __m128i * MR,
-                      __m128i * MQ0)
+void align_cells_masked_8(VECTORTYPE * Sm,
+                          VECTORTYPE * hep,
+                          VECTORTYPE ** qp,
+                          VECTORTYPE * Qm,
+                          VECTORTYPE * Rm,
+                          uint64_t ql,
+                          VECTORTYPE * F0,
+                          uint64_t * dir_long,
+                          VECTORTYPE * H0,
+                          VECTORTYPE * Mm,
+                          VECTORTYPE * MQ,
+                          VECTORTYPE * MR,
+                          VECTORTYPE * MQ0)
 {
-  __asm__
-    __volatile__
-    (
-     INITIALIZE
-     
-     "        jmp       2f                       \n"
-     
-     "1:      movq      0(%2,%%r11,1), %%rax     \n" // load x from qp[qi]
-     "        movdqa    0(%1,%%r11,4), %%xmm8    \n" // load N0
-     "        movdqa    16(%1,%%r11,4), %%xmm12  \n" // load E
-     "        movdqa    (%11), %%xmm13           \n" 
-     "        psubusb   (%10), %%xmm8            \n" // mask N0
-     "        psubusb   (%10), %%xmm12           \n" // mask E
-     "        paddusb   %%xmm13, %%xmm8          \n" // init N0
-     "        paddusb   %%xmm13, %%xmm12         \n" // init E
-     "        paddusb   (%13), %%xmm12           \n" // fix E
-     "        paddusb   (%12), %%xmm13           \n" // update
-     "        movdqa    %%xmm13, (%11)           \n"
-     
-     ONESTEP("%%xmm0",  "%%xmm9",         "%%xmm4", " 0(%%rax)", " 0(%8,%%r11,4)")
-     ONESTEP("%%xmm1",  "%%xmm10",        "%%xmm5", "16(%%rax)", " 8(%8,%%r11,4)")
-     ONESTEP("%%xmm2",  "%%xmm11",        "%%xmm6", "32(%%rax)", "16(%8,%%r11,4)")
-     ONESTEP("%%xmm3",  "0(%1,%%r11,4)",  "%%xmm7", "48(%%rax)", "24(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 16(%1,%%r11,4)  \n" // save E
-
-     "        movq      8(%2,%%r11,1), %%rax     \n" // load x from qp[qi+1]
-     "        movdqa    32(%1,%%r11,4), %%xmm0   \n" // load H0
-     "        movdqa    48(%1,%%r11,4), %%xmm12  \n" // load E
-     "        movdqa    (%11), %%xmm13           \n"
-     "        psubusb   (%10), %%xmm0            \n" // mask H0
-     "        psubusb   (%10), %%xmm12           \n" // mask E
-     "        paddusb   %%xmm13, %%xmm0          \n"
-     "        paddusb   %%xmm13, %%xmm12         \n"
-     "        paddusb   (%13), %%xmm12           \n" // fix E
-     "        paddusb   (%12), %%xmm13           \n"
-     "        movdqa    %%xmm13, (%11)           \n"
-     
-     ONESTEP("%%xmm8",  "%%xmm1",         "%%xmm4", " 0(%%rax)", "32(%8,%%r11,4)")
-     ONESTEP("%%xmm9",  "%%xmm2",         "%%xmm5", "16(%%rax)", "40(%8,%%r11,4)")
-     ONESTEP("%%xmm10", "%%xmm3",         "%%xmm6", "32(%%rax)", "48(%8,%%r11,4)")
-     ONESTEP("%%xmm11", "32(%1,%%r11,4)", "%%xmm7", "48(%%rax)", "56(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 48(%1,%%r11,4)  \n" // save E
-     "        addq      $16, %%r11               \n" // qi++
-     "2:      cmpq      %%r11, %%r10             \n" // qi = ql4 ?
-     "        jne       1b                       \n" // loop
-     
-     "        cmpq      %%r11, %%r12             \n" 
-     "        je        3f                       \n"
-     "        movq      0(%2,%%r11,1), %%rax     \n" // load x from qp[qi]
-     "        movdqa    16(%1,%%r11,4), %%xmm12  \n" // load E
-     "        movdqa    (%11), %%xmm13           \n"
-     "        psubusb   (%10), %%xmm12           \n" // mask E
-     "        paddusb   %%xmm13, %%xmm12         \n"
-     "        paddusb   (%13), %%xmm12           \n" // fix E
-     "        paddusb   (%12), %%xmm13           \n"
-     "        movdqa    %%xmm13, (%11)           \n"
-     
-     ONESTEP("%%xmm0",  "%%xmm9",          "%%xmm4", "0(%%rax)" , " 0(%8,%%r11,4)")
-     ONESTEP("%%xmm1",  "%%xmm10",         "%%xmm5", "16(%%rax)", " 8(%8,%%r11,4)")
-     ONESTEP("%%xmm2",  "%%xmm11",         "%%xmm6", "32(%%rax)", "16(%8,%%r11,4)")
-     ONESTEP("%%xmm3",  "0(%1,%%r11,4)",   "%%xmm7", "48(%%rax)", "24(%8,%%r11,4)")
-     
-     "        movdqa    %%xmm12, 16(%1,%%r11,4)  \n" // save E
-     
-     "        movdqa    %%xmm9, %%xmm1           \n"
-     "        movdqa    %%xmm10, %%xmm2          \n"
-     "        movdqa    %%xmm11, %%xmm3          \n"
-     "        movdqa    0(%1,%%r11,4), %%xmm4    \n"
-     "        jmp       5f                       \n"
-     
-     "3:      movdqa    -32(%1,%%r11,4), %%xmm4  \n"
-     
-     "5:      movq      %0, %%rax                \n" // save final Hs
-     "        movdqa    %%xmm1, (%%rax)          \n"
-     "        addq      $16, %%rax               \n"
-     "        movdqa    %%xmm2, (%%rax)          \n"
-     "        addq      $16, %%rax               \n"
-     "        movdqa    %%xmm3, (%%rax)          \n"
-     "        addq      $16, %%rax               \n"
-     "        movdqa    %%xmm4, (%%rax)          \n"
-     
-     : 
-     
-     : "m"(Sm), "r"(hep),"r"(qp), "m"(Qm), 
-       "m"(Rm), "r"(ql), "m"(Zm), "r"(F0),
-       "r"(dir),"m"(H0),
-       "r"(Mm), "r"(MQ), "r"(MR), "r"(MQ0)
-       
-     : "xmm0",  "xmm1",  "xmm2",  "xmm3",
-       "xmm4",  "xmm5",  "xmm6",  "xmm7",
-       "xmm8",  "xmm9",  "xmm10", "xmm11", 
-       "xmm12", "xmm13", "xmm14", "xmm15",
-       "rax",   "r10",   "r11",   "r12",
-       "rdx",   "cc"
-     );
+  VECTORTYPE Q, R, E;
+  VECTORTYPE h0, h1, h2, h3, h4, h5, h6, h7, h8;
+  VECTORTYPE f0, f1, f2, f3;
+
+  unsigned short * dir = reinterpret_cast<unsigned short *>(dir_long);
+
+  Q = *Qm;
+  R = *Rm;
+
+  f0 = *F0;
+  f1 = v_add(f0, R);
+  f2 = v_add(f1, R);
+  f3 = v_add(f2, R);
+
+  h0 = *H0;
+  h1 = v_sub(f0, Q);
+  h2 = v_add(h1, R);
+  h3 = v_add(h2, R);
+  h4 = v_zero;
+  h5 = v_zero;
+  h6 = v_zero;
+  h7 = v_zero;
+  h8 = v_zero;
+
+  for(uint64_t i = 0; i < ql; i++)
+    {
+      VECTORTYPE * x;
+
+      h4 = hep[2*i + 0];
+      E  = hep[2*i + 1];
+      x = qp[i + 0];
+
+      /* mask h4 and E */
+      h4 = v_sub(h4, *Mm);
+      E  = v_sub(E,  *Mm);
+
+      /* init h4 and E */
+      h4 = v_add(h4, *MQ);
+      E  = v_add(E,  *MQ);
+      E  = v_add(E,  *MQ0);
+
+      /* update MQ */
+      *MQ = v_add(*MQ,  *MR);
+
+      onestep_8(h0, h5, f0, x[0], dir + 16*i +  0, E, Q, R);
+      onestep_8(h1, h6, f1, x[1], dir + 16*i +  4, E, Q, R);
+      onestep_8(h2, h7, f2, x[2], dir + 16*i +  8, E, Q, R);
+      onestep_8(h3, h8, f3, x[3], dir + 16*i + 12, E, Q, R);
+      hep[2*i + 0] = h8;
+      hep[2*i + 1] = E;
+
+      h0 = h4;
+      h1 = h5;
+      h2 = h6;
+      h3 = h7;
+    }
+
+  Sm[0] = h5;
+  Sm[1] = h6;
+  Sm[2] = h7;
+  Sm[3] = h8;
 }
 
-unsigned long backtrack(char * qseq,
-                        char * dseq,
-                        unsigned long qlen,
-                        unsigned long dlen,
-                        unsigned long * dirbuffer,
-                        unsigned long offset,
-                        unsigned long dirbuffersize,
-                        unsigned long channel,
-                        unsigned long * alignmentlengthp)
+inline uint64_t backtrack_8(char * qseq,
+                            char * dseq,
+                            uint64_t qlen,
+                            uint64_t dlen,
+                            uint64_t * dirbuffer,
+                            uint64_t offset,
+                            uint64_t dirbuffersize,
+                            uint64_t channel,
+                            uint64_t * alignmentlengthp)
 {
-  unsigned long maskup      = 1UL << (channel+ 0);
-  unsigned long maskleft    = 1UL << (channel+16);
-  unsigned long maskextup   = 1UL << (channel+32);
-  unsigned long maskextleft = 1UL << (channel+48);
+  uint64_t maskup      = 1ULL << (channel+ 0);
+  uint64_t maskleft    = 1ULL << (channel+16);
+  uint64_t maskextup   = 1ULL << (channel+32);
+  uint64_t maskextleft = 1ULL << (channel+48);
 
 #if 0
 
   printf("Dumping backtracking array\n");
 
-  for(unsigned long i=0; i<qlen; i++)
-  {
-    for(unsigned long j=0; j<dlen; j++)
+  for(uint64_t i=0; i<qlen; i++)
     {
-      unsigned long d = dirbuffer[(offset + longestdbsequence*4*(j/4) + 4*i + (j&3)) % dirbuffersize];
-      if (d & maskleft)
-      {
-        printf("<");
-      }
-      else if (!(d & maskup))
-      {
-          printf("^");
-      }
-      else
-      {
-        printf("\\");
-      }
+      for(uint64_t j=0; j<dlen; j++)
+        {
+          uint64_t d = dirbuffer[(offset + longestdbsequence * 4 * (j / 4)
+                                  + 4 * i + (j & 3)) % dirbuffersize];
+          if (d & maskleft)
+            {
+              printf("<");
+            }
+          else if (!(d & maskup))
+            {
+              printf("^");
+            }
+          else
+            {
+              printf("\\");
+            }
+        }
+      printf("\n");
     }
-    printf("\n");
-  }
 
   printf("Dumping gap extension array\n");
 
-  for(unsigned long i=0; i<qlen; i++)
-  {
-    for(unsigned long j=0; j<dlen; j++)
+  for(uint64_t i=0; i<qlen; i++)
     {
-      unsigned long d = dirbuffer[(offset + longestdbsequence*4*(j/4)
-                                   + 4*i + (j&3)) % dirbuffersize];
-      if (!(d & maskextup))
-      {
-        if (!(d & maskextleft))
-          printf("+");
-        else
-          printf("^");
-      }
-      else if (!(d & maskextleft))
-      {
-        printf("<");
-      }
-      else
-      {
-        printf("\\");
-      }
+      for(uint64_t j=0; j<dlen; j++)
+        {
+          uint64_t d = dirbuffer[(offset + longestdbsequence * 4 * (j / 4)
+                                  + 4 * i + (j & 3)) % dirbuffersize];
+          if (!(d & maskextup))
+            {
+              if (!(d & maskextleft))
+                printf("+");
+              else
+                printf("^");
+            }
+          else if (!(d & maskextleft))
+            {
+              printf("<");
+            }
+          else
+            {
+              printf("\\");
+            }
+        }
+      printf("\n");
     }
-    printf("\n");
-  }
 
 #endif
 
-  long i = qlen - 1;
-  long j = dlen - 1;
-  unsigned long aligned = 0;
-  unsigned long matches = 0;
+  int64_t i = static_cast<int64_t>(qlen) - 1;
+  int64_t j = static_cast<int64_t>(dlen) - 1;
+  uint64_t aligned = 0;
+  uint64_t matches = 0;
   char op = 0;
 
 #undef SHOWALIGNMENT
@@ -757,47 +737,50 @@ unsigned long backtrack(char * qseq,
   printf("alignment, reversed: ");
 #endif
 
-  while ((i>=0) && (j>=0))
-  {
-    aligned++;
+  while ((i >= 0) && (j >= 0))
+    {
+      aligned++;
 
-    unsigned long d = dirbuffer[(offset + longestdbsequence*4*(j/4)
-                                 + 4*i + (j&3)) % dirbuffersize];
+      uint64_t d
+        = dirbuffer[(offset
+                     + longestdbsequence * 4 * static_cast<uint64_t>(j / 4)
+                     + static_cast<uint64_t>(4 * i + (j & 3)))
+                    % dirbuffersize];
 
-    if ((op == 'I') && (!(d & maskextleft)))
-    {
-      j--;
-    }
-    else if ((op == 'D') && (!(d & maskextup)))
-    {
-      i--;
-    }
-    else if (d & maskleft)
-    {
-      j--;
-      op = 'I';
-    }
-    else if (!(d & maskup))
-    {
-      i--;
-      op = 'D';
-    }
-    else
-    {
-      if (qseq[i] == dseq[j])
-        matches++;
-      i--;
-      j--;
-      op = 'M';
-    }
+      if ((op == 'I') && (!(d & maskextleft)))
+        {
+          j--;
+        }
+      else if ((op == 'D') && (!(d & maskextup)))
+        {
+          i--;
+        }
+      else if (d & maskleft)
+        {
+          j--;
+          op = 'I';
+        }
+      else if (!(d & maskup))
+        {
+          i--;
+          op = 'D';
+        }
+      else
+        {
+          if (nt_extract(qseq, static_cast<uint64_t>(i)) ==
+              nt_extract(dseq, static_cast<uint64_t>(j)))
+            matches++;
+          i--;
+          j--;
+          op = 'M';
+        }
 
 #ifdef SHOWALIGNMENT
-    printf("%c", op);
+      printf("%c", op);
 #endif
+    }
 
-  }
-
-  while (i>=0)
+  while (i >= 0)
     {
       aligned++;
       i--;
@@ -806,7 +789,7 @@ unsigned long backtrack(char * qseq,
 #endif
     }
 
-  while (j>=0)
+  while (j >= 0)
     {
       aligned++;
       j--;
@@ -829,226 +812,238 @@ void search8(BYTE * * q_start,
              BYTE * score_matrix,
              BYTE * dprofile,
              BYTE * hearray,
-             unsigned long sequences,
-             unsigned long * seqnos,
-             unsigned long * scores,
-             unsigned long * diffs,
-             unsigned long * alignmentlengths,
-             unsigned long qlen,
-             unsigned long dirbuffersize,
-             unsigned long * dirbuffer)
+             uint64_t sequences,
+             uint64_t * seqnos,
+             uint64_t * scores,
+             uint64_t * diffs,
+             uint64_t * alignmentlengths,
+             uint64_t qlen,
+             uint64_t dirbuffersize,
+             uint64_t * dirbuffer)
 {
-  __m128i Q, R, T, M, T0, MQ, MR, MQ0;
-  __m128i *hep, **qp;
-
-  BYTE * d_begin[CHANNELS];
-  BYTE * d_end[CHANNELS];
-  unsigned long d_offset[CHANNELS];
-  BYTE * d_address[CHANNELS];
-  unsigned long d_length[CHANNELS];
-  
-  __m128i dseqalloc[CDEPTH];
-  
-  __m128i H0, F0;
-  __m128i S[4];
-
-  BYTE * dseq = (BYTE*) & dseqalloc;
-  BYTE zero;
-
-  long seq_id[CHANNELS];
-  unsigned long next_id = 0;
-  unsigned long done;
-  
-  T0 = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -1);
-  Q  = _mm_set1_epi8(gap_open_penalty+gap_extend_penalty);
-  R  = _mm_set1_epi8(gap_extend_penalty);
-
-  zero = 0;
+  VECTORTYPE Q, R, T, M, T0, MQ, MR, MQ0;
+  VECTORTYPE *hep, **qp;
+
+  uint64_t d_pos[CHANNELS];
+  uint64_t d_offset[CHANNELS];
+  char * d_address[CHANNELS];
+  uint64_t d_length[CHANNELS];
+
+  VECTORTYPE dseqalloc[CDEPTH];
+
+  VECTORTYPE H0;
+  VECTORTYPE F0;
+  VECTORTYPE S[4];
+
+  BYTE * dseq = reinterpret_cast<BYTE*>(& dseqalloc);
+
+  int64_t seq_id[CHANNELS];
+  uint64_t next_id = 0;
+  uint64_t done;
+
+#ifdef __aarch64__
+  const VECTORTYPE T0_init = { -1, 0, 0, 0, 0, 0, 0, 0,
+                                0, 0, 0, 0, 0, 0, 0, 0 };
+#elif defined __x86_64__
+  const VECTORTYPE T0_init = _mm_set_epi8(0, 0, 0, 0, 0, 0, 0, 0,
+                                          0, 0, 0, 0, 0, 0, 0, -1);
+#elif defined __PPC__
+  const VECTORTYPE T0_init = { (unsigned char)(-1), 0, 0, 0, 0, 0, 0, 0,
+                               0, 0, 0, 0, 0, 0, 0, 0 };
+#endif
+
+  T0 = T0_init;
+
+  Q  = v_dup(static_cast<char>(gap_open_penalty + gap_extend_penalty));
+  R  = v_dup(static_cast<char>(gap_extend_penalty));
+
   done = 0;
 
-  hep = (__m128i*) hearray;
-  qp = (__m128i**) q_start;
+  hep = CAST_VECTOR_p(hearray);
+  qp = reinterpret_cast<VECTORTYPE**>(q_start);
 
   for (int c=0; c<CHANNELS; c++)
-  {
-    d_begin[c] = &zero;
-    d_end[c] = d_begin[c];
-    seq_id[c] = -1;
-  }
-  
-  F0 = _mm_setzero_si128();
-  H0 = _mm_setzero_si128();
-  
+    {
+      d_address[c] = nullptr;
+      d_pos[c] = 0;
+      d_length[c] = 0;
+      seq_id[c] = -1;
+    }
+
+  F0 = v_zero;
+  H0 = v_zero;
+
   int easy = 0;
 
-  unsigned long * dir = dirbuffer;
+  uint64_t * dir = dirbuffer;
 
   while(1)
-  {
-
-    if (easy)
     {
-      // fill all channels
-
-      for(int c=0; c<CHANNELS; c++)
-      {
-        for(int j=0; j<CDEPTH; j++)
+      if (easy)
         {
-          if (d_begin[c] < d_end[c])
-            dseq[CHANNELS*j+c] = *(d_begin[c]++);
-          else
-            dseq[CHANNELS*j+c] = 0;
-        }
-        if (d_begin[c] == d_end[c])
-          easy = 0;
-      }
+          // fill all channels
 
-      if (ssse3_present)
-        dprofile_shuffle8(dprofile, score_matrix, dseq);
-      else
-        dprofile_fill8(dprofile, score_matrix, dseq);
-      
-      donormal8(S, hep, qp, &Q, &R, qlen, 0, &F0, dir, &H0);
-    }
-    else
-    {
-      // One or more sequences ended in the previous block 
-      // We have to switch over to a new sequence
+          for(int c=0; c<CHANNELS; c++)
+            {
+              for(int j=0; j<CDEPTH; j++)
+                {
+                  if (d_pos[c] < d_length[c])
+                    dseq[CHANNELS*j+c]
+                      = 1 + nt_extract(d_address[c], d_pos[c]++);
+                  else
+                    dseq[CHANNELS*j+c] = 0;
+                }
+              if (d_pos[c] == d_length[c])
+                easy = 0;
+            }
 
-      easy = 1;
+#ifdef __x86_64__
+          if (ssse3_present)
+            dprofile_shuffle8(dprofile, score_matrix, dseq);
+          else
+#endif
+            dprofile_fill8(dprofile, score_matrix, dseq);
 
-      M = _mm_setzero_si128();
-      T = T0;
-      for (int c=0; c<CHANNELS; c++)
-      {
-        if (d_begin[c] < d_end[c])
-        {
-          // this channel has more sequence
-
-          for(int j=0; j<CDEPTH; j++)
-          {
-            if (d_begin[c] < d_end[c])
-              dseq[CHANNELS*j+c] = *(d_begin[c]++);
-            else
-              dseq[CHANNELS*j+c] = 0;
-          }
-          if (d_begin[c] == d_end[c])
-            easy = 0;
+          align_cells_regular_8(S, hep, qp, &Q, &R, qlen, &F0, dir, &H0);
         }
-        else
+      else
         {
-          // sequence in channel c ended
-          // change of sequence
-
-          M = _mm_xor_si128(M, T);
-
-          long cand_id = seq_id[c];
-          
-          
-          if (cand_id >= 0)
-          {
-            // save score
-
-            char * dbseq = (char*) d_address[c];
-            long dbseqlen = d_length[c];
-            long z = (dbseqlen+3) % 4;
-            long score = ((BYTE*)S)[z*16+c];
-            scores[cand_id] = score;
-            
-            unsigned long diff;
-
-            if (score < 255)
-            {
-              long offset = d_offset[c];
-              diff = backtrack(query.seq, dbseq, qlen, dbseqlen,
-                               dirbuffer,
-                               offset,
-                               dirbuffersize, c,
-                               alignmentlengths + cand_id);
-            }
-            else
-            {
-              diff = 255;
-            }
+          // One or more sequences ended in the previous block
+          // We have to switch over to a new sequence
 
-            diffs[cand_id] = diff;
-
-            done++;
-          }
-
-          if (next_id < sequences)
-          {
-            // get next sequence
-            seq_id[c] = next_id;
-            long seqno = seqnos[next_id];
-            char* address;
-            long length;
-
-            db_getsequenceandlength(seqno, & address, & length);
-                      
-            d_address[c] = (BYTE*) address;
-            d_length[c] = length;
-
-            d_begin[c] = (unsigned char*) address;
-            d_end[c] = (unsigned char*) address + length;
-            d_offset[c] = dir - dirbuffer;
-            next_id++;
-            
-            ((BYTE*)&H0)[c] = 0;
-            ((BYTE*)&F0)[c] = 2 * gap_open_penalty + 2 * gap_extend_penalty;
-            
-            // fill channel
-            for(int j=0; j<CDEPTH; j++)
+          easy = 1;
+
+          M = v_zero;
+          T = T0;
+          for (unsigned int c = 0; c < CHANNELS; c++)
             {
-              if (d_begin[c] < d_end[c])
-                dseq[CHANNELS*j+c] = *(d_begin[c]++);
+              if (d_pos[c] < d_length[c])
+                {
+                  // this channel has more sequence
+
+                  for(unsigned int j = 0; j < CDEPTH; j++)
+                    {
+                      if (d_pos[c] < d_length[c])
+                        dseq[CHANNELS * j + c]
+                          = 1 + nt_extract(d_address[c], d_pos[c]++);
+                      else
+                        dseq[CHANNELS*j+c] = 0;
+                    }
+                  if (d_pos[c] == d_length[c])
+                    easy = 0;
+                }
               else
-                dseq[CHANNELS*j+c] = 0;
+                {
+                  // sequence in channel c ended
+                  // change of sequence
+
+                  M = v_xor(M, T);
+
+                  int64_t cand_id = seq_id[c];
+
+                  if (cand_id >= 0)
+                    {
+                      // save score
+
+                      char * dbseq = reinterpret_cast<char*>(d_address[c]);
+                      uint64_t dbseqlen = d_length[c];
+                      uint64_t z = (dbseqlen+3) % 4;
+                      uint64_t score
+                        = (reinterpret_cast<BYTE*>(S))[z * CHANNELS + c];
+                      scores[cand_id] = score;
+
+                      uint64_t diff;
+
+                      if (score < 255)
+                        {
+                          uint64_t offset = d_offset[c];
+                          diff = backtrack_8(query.seq, dbseq, qlen, dbseqlen,
+                                             dirbuffer,
+                                             offset,
+                                             dirbuffersize, c,
+                                             alignmentlengths + cand_id);
+                        }
+                      else
+                        {
+                          diff = 255;
+                        }
+
+                      diffs[cand_id] = diff;
+
+                      done++;
+                    }
+
+                  if (next_id < sequences)
+                    {
+                      // get next sequence
+                      seq_id[c] = static_cast<int64_t>(next_id);
+                      uint64_t seqno = seqnos[next_id];
+                      char* address;
+                      unsigned int length;
+
+                      db_getsequenceandlength(seqno, & address, & length);
+
+                      d_address[c] = address;
+                      d_length[c] = length;
+
+                      d_pos[c] = 0;
+                      d_offset[c] = static_cast<uint64_t>(dir - dirbuffer);
+                      next_id++;
+
+                      (reinterpret_cast<BYTE*>(&H0))[c] = 0;
+                      (reinterpret_cast<BYTE*>(&F0))[c] = 2 * gap_open_penalty + 2 * gap_extend_penalty;
+
+                      // fill channel
+                      for(unsigned int j = 0; j < CDEPTH; j++)
+                        {
+                          if (d_pos[c] < d_length[c])
+                            dseq[CHANNELS*j+c] = 1 + nt_extract(d_address[c], d_pos[c]++);
+                          else
+                            dseq[CHANNELS*j+c] = 0;
+                        }
+                      if (d_pos[c] == d_length[c])
+                        easy = 0;
+                    }
+                  else
+                    {
+                      // no more sequences, empty channel
+                      seq_id[c] = -1;
+                      d_address[c] = nullptr;
+                      d_pos[c] = 0;
+                      d_length[c] = 0;
+                      for (unsigned int j=0; j<CDEPTH; j++)
+                        dseq[CHANNELS*j+c] = 0;
+                    }
+                }
+
+              T = v_shift_left(T);
             }
-            if (d_begin[c] == d_end[c])
-              easy = 0;
-          }
+
+          if (done == sequences)
+            break;
+
+#ifdef __x86_64__
+          if (ssse3_present)
+            dprofile_shuffle8(dprofile, score_matrix, dseq);
           else
-          {
-            // no more sequences, empty channel
-            seq_id[c] = -1;
-            d_begin[c] = &zero;
-            d_end[c] = d_begin[c];
-            for (int j=0; j<CDEPTH; j++)
-              dseq[CHANNELS*j+c] = 0;
-          }
+#endif
+            dprofile_fill8(dprofile, score_matrix, dseq);
 
+          MQ = v_and(M, Q);
+          MR = v_and(M, R);
+          MQ0 = MQ;
 
+          align_cells_masked_8(S, hep, qp, &Q, &R, qlen, &F0, dir, &H0, &M, &MQ, &MR, &MQ0);
         }
 
-        T = _mm_slli_si128(T, 1);
-      }
+      F0 = v_add(F0, R);
+      F0 = v_add(F0, R);
+      F0 = v_add(F0, R);
+      H0 = v_sub(F0, Q);
+      F0 = v_add(F0, R);
 
-      if (done == sequences)
-        break;
-          
-      if (ssse3_present)
-        dprofile_shuffle8(dprofile, score_matrix, dseq);
-      else
-        dprofile_fill8(dprofile, score_matrix, dseq);
-
-      MQ = _mm_and_si128(M, Q);
-      MR = _mm_and_si128(M, R);
-      MQ0 = MQ;
-      
-      domasked8(S, hep, qp, &Q, &R, qlen, 0, &F0, dir, &H0, &M, &MQ, &MR,
-                &MQ0);
+      dir += 4*longestdbsequence;
+      if (dir >= dirbuffer + dirbuffersize)
+        dir -= dirbuffersize;
     }
-    
-    F0 = _mm_adds_epu8(F0, R);
-    F0 = _mm_adds_epu8(F0, R);
-    F0 = _mm_adds_epu8(F0, R);
-    H0 = _mm_subs_epu8(F0, Q);
-    F0 = _mm_adds_epu8(F0, R);
-
-
-    dir += 4*longestdbsequence;
-    if (dir >= dirbuffer + dirbuffersize)
-      dir -= dirbuffersize;
-  }
 }
diff --git a/src/ssse3.cc b/src/ssse3.cc
index 503f91bf..86bbe1ec 100644
--- a/src/ssse3.cc
+++ b/src/ssse3.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -34,7 +34,7 @@ void dprofile_shuffle8(BYTE * dprofile,
                        BYTE * dseq_byte)
 {
   __m128i m0, m1, m2, m3, t0, t1, t2, t3, t4;
-  __m128i * dseq = (__m128i*) dseq_byte;
+  __m128i * dseq = CAST_m128i_ptr(dseq_byte);
 
   m0 = _mm_load_si128(dseq);
   m1 = _mm_load_si128(dseq+1);
@@ -42,15 +42,15 @@ void dprofile_shuffle8(BYTE * dprofile,
   m3 = _mm_load_si128(dseq+3);
 
 #define profline8(j)                                    \
-  t0 = _mm_load_si128((__m128i*)(score_matrix)+2*j);    \
+  t0 = _mm_load_si128(CAST_m128i_ptr(score_matrix)+2*j);    \
   t1 = _mm_shuffle_epi8(t0, m0);                        \
   t2 = _mm_shuffle_epi8(t0, m1);                        \
   t3 = _mm_shuffle_epi8(t0, m2);                        \
   t4 = _mm_shuffle_epi8(t0, m3);                        \
-  _mm_store_si128((__m128i*)(dprofile)+4*j+0, t1);      \
-  _mm_store_si128((__m128i*)(dprofile)+4*j+1, t2);      \
-  _mm_store_si128((__m128i*)(dprofile)+4*j+2, t3);      \
-  _mm_store_si128((__m128i*)(dprofile)+4*j+3, t4)
+  _mm_store_si128(CAST_m128i_ptr(dprofile)+4*j+0, t1);      \
+  _mm_store_si128(CAST_m128i_ptr(dprofile)+4*j+1, t2);      \
+  _mm_store_si128(CAST_m128i_ptr(dprofile)+4*j+2, t3);      \
+  _mm_store_si128(CAST_m128i_ptr(dprofile)+4*j+3, t4)
 
   profline8(0);
   profline8(1);
@@ -70,7 +70,7 @@ void dprofile_shuffle16(WORD * dprofile,
   __m128i t0, t1, t2, t3, t4, t5;
   __m128i u0, u1, u2, u3, u4;
   __m128i zero, one;
-  __m128i * dseq = (__m128i*) dseq_byte;
+  __m128i * dseq = CAST_m128i_ptr(dseq_byte);
 
   zero = _mm_setzero_si128();
   one  = _mm_set1_epi16(1);
@@ -104,15 +104,15 @@ void dprofile_shuffle16(WORD * dprofile,
   m3 = _mm_or_si128(m3, t5);
 
 #define profline16(j)                                   \
-  u0 = _mm_load_si128((__m128i*)(score_matrix)+4*j);    \
+  u0 = _mm_load_si128(CAST_m128i_ptr(score_matrix)+4*j);    \
   u1 = _mm_shuffle_epi8(u0, m0);                        \
   u2 = _mm_shuffle_epi8(u0, m1);                        \
   u3 = _mm_shuffle_epi8(u0, m2);                        \
   u4 = _mm_shuffle_epi8(u0, m3);                        \
-  _mm_store_si128((__m128i*)(dprofile)+4*j+0, u1);      \
-  _mm_store_si128((__m128i*)(dprofile)+4*j+1, u2);      \
-  _mm_store_si128((__m128i*)(dprofile)+4*j+2, u3);      \
-  _mm_store_si128((__m128i*)(dprofile)+4*j+3, u4)
+  _mm_store_si128(CAST_m128i_ptr(dprofile)+4*j+0, u1);      \
+  _mm_store_si128(CAST_m128i_ptr(dprofile)+4*j+1, u2);      \
+  _mm_store_si128(CAST_m128i_ptr(dprofile)+4*j+2, u3);      \
+  _mm_store_si128(CAST_m128i_ptr(dprofile)+4*j+3, u4)
 
   profline16(0);
   profline16(1);
diff --git a/src/swarm.cc b/src/swarm.cc
index 545dee6d..8994c0ea 100644
--- a/src/swarm.cc
+++ b/src/swarm.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -25,7 +25,7 @@
 
 /* OPTIONS */
 
-char * progname;
+static char * progname;
 char * input_filename;
 
 char * opt_internal_structure;
@@ -35,57 +35,69 @@ char * opt_seeds;
 char * opt_statistics_file;
 char * opt_uclust_file;
 
-long opt_append_abundance;
-long opt_bloom_bits;
-long opt_boundary;
-long opt_ceiling;
-long opt_differences;
-long opt_fastidious;
-long opt_gap_extension_penalty;
-long opt_gap_opening_penalty;
-long opt_help;
-long opt_match_reward;
-long opt_mismatch_penalty;
-long opt_mothur;
-long opt_no_otu_breaking;
-long opt_threads;
-long opt_usearch_abundance;
-long opt_version;
-
-long penalty_factor;
-long penalty_gapextend;
-long penalty_gapopen;
-long penalty_mismatch;
+int64_t opt_append_abundance;
+int64_t opt_bloom_bits;
+int64_t opt_boundary;
+int64_t opt_ceiling;
+int64_t opt_differences;
+int64_t opt_fastidious;
+int64_t opt_gap_extension_penalty;
+int64_t opt_gap_opening_penalty;
+int64_t opt_help;
+int64_t opt_match_reward;
+int64_t opt_mismatch_penalty;
+int64_t opt_mothur;
+int64_t opt_no_otu_breaking;
+int64_t opt_threads;
+int64_t opt_usearch_abundance;
+int64_t opt_version;
+
+int64_t penalty_factor;
+int64_t penalty_gapextend;
+int64_t penalty_gapopen;
+int64_t penalty_mismatch;
 
 /* Other variables */
 
-long mmx_present = 0;
-long sse_present = 0;
-long sse2_present = 0;
-long sse3_present = 0;
-long ssse3_present = 0;
-long sse41_present = 0;
-long sse42_present = 0;
-long popcnt_present = 0;
-long avx_present = 0;
-long avx2_present = 0;
+int64_t mmx_present = 0;
+int64_t sse_present = 0;
+int64_t sse2_present = 0;
+int64_t sse3_present = 0;
+int64_t ssse3_present = 0;
+int64_t sse41_present = 0;
+int64_t sse42_present = 0;
+int64_t popcnt_present = 0;
+int64_t avx_present = 0;
+int64_t avx2_present = 0;
 
-unsigned long dbsequencecount = 0;
+static uint64_t dbsequencecount = 0;
 
-unsigned long duplicates_found = 0;
+uint64_t duplicates_found = 0;
 
-FILE * outfile;
-FILE * statsfile;
-FILE * uclustfile;
-FILE * logfile = stderr;
-FILE * internal_structure_file;
-FILE * fp_seeds = 0;
+FILE * outfile = nullptr;
+FILE * statsfile = nullptr;
+FILE * uclustfile = nullptr;
+FILE * logfile = nullptr;
+FILE * internal_structure_file = nullptr;
+FILE * fp_seeds = nullptr;
 
 char sym_nt[] = "-ACGT                           ";
 
-char * DASH_FILENAME = (char*) "-";
-char * STDIN_NAME = (char*) "/dev/stdin";
-char * STDOUT_NAME = (char*) "/dev/stdout";
+static char dash[] = "-";
+static char * DASH_FILENAME = dash;
+
+#ifdef __x86_64__
+
+void cpuid(unsigned int f1,
+           unsigned int f2,
+           unsigned int & a,
+           unsigned int & b,
+           unsigned int & c,
+           unsigned int & d);
+
+void cpu_features_detect();
+
+void cpu_features_show();
 
 void cpuid(unsigned int f1,
            unsigned int f2,
@@ -153,44 +165,60 @@ void cpu_features_show()
   fprintf(logfile, "\n");
 }
 
-long args_long(char * str, const char * option)
+#endif
+
+int64_t args_long(char * str, const char * option);
+void args_show();
+void args_usage();
+void show_header();
+void args_init(int argc, char **argv);
+void open_files();
+void close_files();
+
+int64_t args_long(char * str, const char * option)
 {
   char * endptr;
-  long temp = strtol(str, & endptr, 10);
+  int64_t temp = strtol(str, & endptr, 10);
   if (*endptr)
-    fatal("Invalid numeric argument for option %s", option);
+    {
+      fprintf(stderr, "\nInvalid numeric argument for option %s\n", option);
+      exit(1);
+    }
   return temp;
 }
 
 void args_show()
 {
+#ifdef __x86_64__
   cpu_features_show();
+#endif
+
   fprintf(logfile, "Database file:     %s\n", input_filename);
   fprintf(logfile, "Output file:       %s\n", opt_output_file);
   if (opt_statistics_file)
     fprintf(logfile, "Statistics file:   %s\n", opt_statistics_file);
   if (opt_uclust_file)
     fprintf(logfile, "Uclust file:       %s\n", opt_uclust_file);
-  fprintf(logfile, "Resolution (d):    %ld\n", opt_differences);
-  fprintf(logfile, "Threads:           %ld\n", opt_threads);
+  fprintf(logfile, "Resolution (d):    %" PRId64 "\n", opt_differences);
+  fprintf(logfile, "Threads:           %" PRId64 "\n", opt_threads);
 
   if (opt_differences > 1)
     {
       fprintf(logfile,
-              "Scores:            match: %ld, mismatch: %ld\n",
+              "Scores:            match: %" PRId64 ", mismatch: %" PRId64 "\n",
               opt_match_reward, opt_mismatch_penalty);
       fprintf(logfile,
-              "Gap penalties:     opening: %ld, extension: %ld\n",
+              "Gap penalties:     opening: %" PRId64 ", extension: %" PRId64 "\n",
               opt_gap_opening_penalty, opt_gap_extension_penalty);
       fprintf(logfile,
-              "Converted costs:   mismatch: %ld, gap opening: %ld, "
-              "gap extension: %ld\n",
+              "Converted costs:   mismatch: %" PRId64 ", gap opening: %" PRId64 ", "
+              "gap extension: %" PRId64 "\n",
               penalty_mismatch, penalty_gapopen, penalty_gapextend);
     }
   fprintf(logfile, "Break OTUs:        %s\n",
           opt_no_otu_breaking ? "No" : "Yes");
   if (opt_fastidious)
-    fprintf(logfile, "Fastidious:        Yes, with boundary %ld\n",
+    fprintf(logfile, "Fastidious:        Yes, with boundary %" PRId64 "\n",
             opt_boundary);
   else
     fprintf(logfile, "Fastidious:        No\n");
@@ -213,20 +241,20 @@ void args_usage()
   fprintf(stderr, " -n, --no-otu-breaking               never break OTUs (not recommended!)\n");
   fprintf(stderr, "\n");
   fprintf(stderr, "Fastidious options (only when d = 1):\n");
-  fprintf(stderr, " -b, --boundary INTEGER              min mass of large OTU for fastidious (3)\n");
-  fprintf(stderr, " -c, --ceiling INTEGER               max memory in MB used for fastidious\n");
+  fprintf(stderr, " -b, --boundary INTEGER              min mass of large OTUs (3)\n");
+  fprintf(stderr, " -c, --ceiling INTEGER               max memory in MB for Bloom filter (unlim.)\n");
   fprintf(stderr, " -f, --fastidious                    link nearby low-abundance swarms\n");
   fprintf(stderr, " -y, --bloom-bits INTEGER            bits used per Bloom filter entry (16)\n");
   fprintf(stderr, "\n");
   fprintf(stderr, "Input/output options:\n");
   fprintf(stderr, " -a, --append-abundance INTEGER      value to use when abundance is missing\n");
-  fprintf(stderr, " -i, --internal-structure FILENAME   write internal swarm structure to file\n");
+  fprintf(stderr, " -i, --internal-structure FILENAME   write internal OTU structure to file\n");
   fprintf(stderr, " -l, --log FILENAME                  log to file, not to stderr\n");
-  fprintf(stderr, " -o, --output-file FILENAME          output result filename (stdout)\n");
-  fprintf(stderr, " -r, --mothur                        output in mothur list file format\n");
+  fprintf(stderr, " -o, --output-file FILENAME          output result to file (stdout)\n");
+  fprintf(stderr, " -r, --mothur                        output using mothur-like format\n");
   fprintf(stderr, " -s, --statistics-file FILENAME      dump OTU statistics to file\n");
-  fprintf(stderr, " -u, --uclust-file FILENAME          output in UCLUST-like format to file\n");
-  fprintf(stderr, " -w, --seeds FILENAME                write seed seqs with abundances to FASTA\n");
+  fprintf(stderr, " -u, --uclust-file FILENAME          output using UCLUST-like format to file\n");
+  fprintf(stderr, " -w, --seeds FILENAME                write OTU representatives to FASTA file\n");
   fprintf(stderr, " -z, --usearch-abundance             abundance annotation in usearch style\n");
   fprintf(stderr, "\n");
   fprintf(stderr, "Pairwise alignment advanced options (only when d > 1):\n");
@@ -234,8 +262,11 @@ void args_usage()
   fprintf(stderr, " -p, --mismatch-penalty INTEGER      penalty for nucleotide mismatch (4)\n");
   fprintf(stderr, " -g, --gap-opening-penalty INTEGER   gap open penalty (12)\n");
   fprintf(stderr, " -e, --gap-extension-penalty INTEGER gap extension penalty (4)\n");
+
+#ifndef __WIN32
   fprintf(stderr, "\n");
   fprintf(stderr, "See 'man swarm' for more details.\n");
+#endif
 }
 
 void show_header()
@@ -243,8 +274,8 @@ void show_header()
   char title[] = "Swarm " SWARM_VERSION;
   char ref[] = "Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe";
   char url[] = "https://github.com/torognes/swarm";
-  fprintf(logfile, "%s [%s %s]\n%s\n%s\n\n",
-          title, __DATE__, __TIME__, ref, url);
+  fprintf(logfile, "%s\n%s\n%s\n\n",
+          title, ref, url);
   fprintf(logfile, "Mahe F, Rognes T, Quince C, de Vargas C, Dunthorn M (2014)\n");
   fprintf(logfile, "Swarm: robust and fast clustering method for amplicon-based studies\n");
   fprintf(logfile, "PeerJ 2:e593 https://doi.org/10.7717/peerj.593\n");
@@ -271,17 +302,17 @@ void args_init(int argc, char **argv)
   opt_gap_extension_penalty = 4;
   opt_gap_opening_penalty = 12;
   opt_help = 0;
-  opt_internal_structure = 0;
-  opt_log = 0;
+  opt_internal_structure = nullptr;
+  opt_log = nullptr;
   opt_match_reward = 5;
   opt_mismatch_penalty = 4;
   opt_mothur = 0;
   opt_no_otu_breaking = 0;
-  opt_output_file = 0;
-  opt_seeds = 0;
-  opt_statistics_file = 0;
+  opt_output_file = DASH_FILENAME;
+  opt_seeds = nullptr;
+  opt_statistics_file = nullptr;
   opt_threads = 1;
-  opt_uclust_file = 0;
+  opt_uclust_file = nullptr;
   opt_usearch_abundance = 0;
   opt_version = 0;
 
@@ -293,29 +324,29 @@ void args_init(int argc, char **argv)
 
   static struct option long_options[] =
   {
-    {"append-abundance",      required_argument, NULL, 'a' },
-    {"boundary",              required_argument, NULL, 'b' },
-    {"ceiling",               required_argument, NULL, 'c' },
-    {"differences",           required_argument, NULL, 'd' },
-    {"gap-extension-penalty", required_argument, NULL, 'e' },
-    {"fastidious",            no_argument,       NULL, 'f' },
-    {"gap-opening-penalty",   required_argument, NULL, 'g' },
-    {"help",                  no_argument,       NULL, 'h' },
-    {"internal-structure",    required_argument, NULL, 'i' },
-    {"log",                   required_argument, NULL, 'l' },
-    {"match-reward",          required_argument, NULL, 'm' },
-    {"no-otu-breaking",       no_argument,       NULL, 'n' },
-    {"output-file",           required_argument, NULL, 'o' },
-    {"mismatch-penalty",      required_argument, NULL, 'p' },
-    {"mothur",                no_argument,       NULL, 'r' },
-    {"statistics-file",       required_argument, NULL, 's' },
-    {"threads",               required_argument, NULL, 't' },
-    {"uclust-file",           required_argument, NULL, 'u' },
-    {"version",               no_argument,       NULL, 'v' },
-    {"seeds",                 required_argument, NULL, 'w' },
-    {"bloom-bits",            required_argument, NULL, 'y' },
-    {"usearch-abundance",     no_argument,       NULL, 'z' },
-    { 0, 0, 0, 0 }
+    {"append-abundance",      required_argument, nullptr, 'a' },
+    {"boundary",              required_argument, nullptr, 'b' },
+    {"ceiling",               required_argument, nullptr, 'c' },
+    {"differences",           required_argument, nullptr, 'd' },
+    {"gap-extension-penalty", required_argument, nullptr, 'e' },
+    {"fastidious",            no_argument,       nullptr, 'f' },
+    {"gap-opening-penalty",   required_argument, nullptr, 'g' },
+    {"help",                  no_argument,       nullptr, 'h' },
+    {"internal-structure",    required_argument, nullptr, 'i' },
+    {"log",                   required_argument, nullptr, 'l' },
+    {"match-reward",          required_argument, nullptr, 'm' },
+    {"no-otu-breaking",       no_argument,       nullptr, 'n' },
+    {"output-file",           required_argument, nullptr, 'o' },
+    {"mismatch-penalty",      required_argument, nullptr, 'p' },
+    {"mothur",                no_argument,       nullptr, 'r' },
+    {"statistics-file",       required_argument, nullptr, 's' },
+    {"threads",               required_argument, nullptr, 't' },
+    {"uclust-file",           required_argument, nullptr, 'u' },
+    {"version",               no_argument,       nullptr, 'v' },
+    {"seeds",                 required_argument, nullptr, 'w' },
+    {"bloom-bits",            required_argument, nullptr, 'y' },
+    {"usearch-abundance",     no_argument,       nullptr, 'z' },
+    {nullptr,                 0,                 nullptr, 0 }
   };
 
   int used_options[26] = { 0, 0, 0, 0, 0,
@@ -471,7 +502,6 @@ void args_init(int argc, char **argv)
         show_header();
         args_usage();
         exit(1);
-        break;
     }
   }
 
@@ -480,15 +510,18 @@ void args_init(int argc, char **argv)
 
   if ((opt_threads < 1) || (opt_threads > MAX_THREADS))
     {
-      fprintf(stderr, "\nError: Illegal number of threads specified with -t or --threads, must be in the range 1 to %d.\n", MAX_THREADS);
+      fprintf(stderr, "\nError: Illegal number of threads specified with "
+              "-t or --threads, must be in the range 1 to %d.\n", MAX_THREADS);
       exit(1);
     }
 
   if ((opt_differences < 0) || (opt_differences > 255))
-    fatal("Illegal number of differences specified with -d or --differences, must be in the range 0 to 255.");
+    fatal("Illegal number of differences specified with -d or --differences, "
+          "must be in the range 0 to 255.");
 
   if (opt_fastidious && (opt_differences != 1))
-    fatal("Fastidious mode (specified with -f or --fastidious) only works when the resolution (specified with -d or --differences) is 1.");
+    fatal("Fastidious mode (specified with -f or --fastidious) only works "
+          "when the resolution (specified with -d or --differences) is 1.");
 
   if (!opt_fastidious)
     {
@@ -513,55 +546,40 @@ void args_init(int argc, char **argv)
     }
 
   if (opt_gap_opening_penalty < 0)
-    fatal("Illegal gap opening penalty specified with -g or --gap-opening-penalty, must not be negative.");
+    fatal("Illegal gap opening penalty specified with -g or "
+          "--gap-opening-penalty, must not be negative.");
 
   if (opt_gap_extension_penalty < 0)
-    fatal("Illegal gap extension penalty specified with -e or --gap-extension-penalty, must not be negative.");
+    fatal("Illegal gap extension penalty specified with -e or "
+          "--gap-extension-penalty, must not be negative.");
 
   if ((opt_gap_opening_penalty + opt_gap_extension_penalty) < 1)
-    fatal("Illegal gap penalties specified, the sum of the gap open and the gap extension penalty must be at least 1.");
+    fatal("Illegal gap penalties specified, the sum of the gap open and "
+          "the gap extension penalty must be at least 1.");
 
   if (opt_match_reward < 1)
-    fatal("Illegal match reward specified with -m or --match-reward, must be at least 1.");
+    fatal("Illegal match reward specified with -m or --match-reward, "
+          "must be at least 1.");
 
   if (opt_mismatch_penalty < 1)
-    fatal("Illegal mismatch penalty specified with -p or --mismatch-penalty, must be at least 1.");
+    fatal("Illegal mismatch penalty specified with -p or --mismatch-penalty, "
+          "must be at least 1.");
 
   if (opt_boundary < 2)
-    fatal("Illegal boundary specified with -b or --boundary, must be at least 2.");
+    fatal("Illegal boundary specified with -b or --boundary, "
+          "must be at least 2.");
 
   if (used_options[2] && ((opt_ceiling < 8) || (opt_ceiling > 1073741824)))
-    fatal("Illegal memory ceiling specified with -c or --ceiling, must be in the range 8 to 1073741824 MB.");
+    fatal("Illegal memory ceiling specified with -c or --ceiling, "
+          "must be in the range 8 to 1073741824 MB.");
 
   if ((opt_bloom_bits < 2) || (opt_bloom_bits > 64))
-    fatal("Illegal number of Bloom filter bits specified with -y or --bloom-bits, must be in the range 2 to 64.");
+    fatal("Illegal number of Bloom filter bits specified with -y or "
+          "--bloom-bits, must be in the range 2 to 64.");
 
   if (used_options[0] && (opt_append_abundance < 1))
-    fatal("Illegal abundance value specified with -a or --append-abundance, must be at least 1.");
-
-
-  /* replace filename "-" by "/dev/stdin" for input file options */
-
-  if (!strcmp(input_filename, DASH_FILENAME))
-    input_filename = STDIN_NAME;
-
-  /* replace filename "-" by "/dev/stdout" for output file options */
-
-  char * * stdout_options[] =
-    {
-      & opt_internal_structure,
-      & opt_log,
-      & opt_output_file,
-      & opt_statistics_file,
-      & opt_uclust_file,
-      & opt_seeds,
-      0
-    };
-
-  int o = 0;
-  while(char * * stdout_opt = stdout_options[o++])
-    if ((*stdout_opt) && (!strcmp(*stdout_opt, DASH_FILENAME)))
-      *stdout_opt = STDOUT_NAME;
+    fatal("Illegal abundance value specified with -a or --append-abundance, "
+          "must be at least 1.");
 }
 
 void open_files()
@@ -570,62 +588,47 @@ void open_files()
 
   if (opt_log)
     {
-      logfile = fopen(opt_log, "w");
+      logfile = fopen_output(opt_log);
       if (! logfile)
         fatal("Unable to open log file for writing.");
     }
-  else
-    logfile = stderr;
 
-  if (opt_output_file)
-    {
-      outfile = fopen(opt_output_file, "w");
-      if (! outfile)
-        fatal("Unable to open output file for writing.");
-    }
-  else
-    outfile = stdout;
+  outfile = fopen_output(opt_output_file);
+  if (! outfile)
+    fatal("Unable to open output file for writing.");
 
   if (opt_seeds)
     {
-      fp_seeds = fopen(opt_seeds, "w");
+      fp_seeds = fopen_output(opt_seeds);
       if (! fp_seeds)
         fatal("Unable to open seeds file for writing.");
     }
-  else
-    fp_seeds = 0;
 
   if (opt_statistics_file)
     {
-      statsfile = fopen(opt_statistics_file, "w");
+      statsfile = fopen_output(opt_statistics_file);
       if (! statsfile)
         fatal("Unable to open statistics file for writing.");
     }
-  else
-    statsfile = 0;
 
   if (opt_uclust_file)
     {
-      uclustfile = fopen(opt_uclust_file, "w");
+      uclustfile = fopen_output(opt_uclust_file);
       if (! uclustfile)
         fatal("Unable to open uclust file for writing.");
     }
-  else
-    uclustfile = 0;
 
   if (opt_internal_structure)
     {
-      internal_structure_file = fopen(opt_internal_structure, "w");
+      internal_structure_file = fopen_output(opt_internal_structure);
       if (! internal_structure_file)
         fatal("Unable to open internal structure file for writing.");
     }
-  else
-    internal_structure_file = 0;
 }
 
 void close_files()
 {
-  if (opt_internal_structure)
+  if (internal_structure_file)
     fclose(internal_structure_file);
 
   if (uclustfile)
@@ -634,22 +637,28 @@ void close_files()
   if (statsfile)
     fclose(statsfile);
 
-  if (opt_seeds)
+  if (fp_seeds)
     fclose(fp_seeds);
 
-  if (opt_output_file)
+  if (outfile)
     fclose(outfile);
 
-  if (opt_log)
+  if (logfile)
     fclose(logfile);
 }
 
 int main(int argc, char** argv)
 {
+  logfile = stderr;
+
+#ifdef __x86_64__
   cpu_features_detect();
 
   if (!sse2_present)
     fatal("This program requires a processor with SSE2 instructions.\n");
+#endif
+
+  arch_srandom(1);
 
   args_init(argc, argv);
 
@@ -682,16 +691,14 @@ int main(int argc, char** argv)
 
   db_read(input_filename);
 
-  fprintf(logfile, "Database info:     %lu nt", db_getnucleotidecount());
-  fprintf(logfile, " in %lu sequences,", db_getsequencecount());
-  fprintf(logfile, " longest %lu nt\n", db_getlongestsequence());
+  fprintf(logfile, "Database info:     %" PRIu64 " nt", db_getnucleotidecount());
+  fprintf(logfile, " in %u sequences,", db_getsequencecount());
+  fprintf(logfile, " longest %u nt\n", db_getlongestsequence());
 
   dbsequencecount = db_getsequencecount();
 
   score_matrix_init();
 
-  search_begin();
-
   switch (opt_differences)
     {
     case 0:
@@ -707,8 +714,6 @@ int main(int argc, char** argv)
       break;
     }
 
-  search_end();
-
   score_matrix_free();
 
   db_free();
diff --git a/src/swarm.h b/src/swarm.h
index b92deddb..77840ba0 100644
--- a/src/swarm.h
+++ b/src/swarm.h
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -21,6 +21,27 @@
     PO Box 1080 Blindern, NO-0316 Oslo, Norway
 */
 
+#include <inttypes.h>
+
+#ifndef PRIu64
+#ifdef _WIN32
+#define PRIu64 "I64u"
+#else
+#define PRIu64 "lu"
+#endif
+#endif
+
+#ifndef PRId64
+#ifdef _WIN32
+#define PRId64 "I64d"
+#else
+#define PRId64 "ld"
+#endif
+#endif
+
+//#define NDEBUG
+
+#include <assert.h>
 #include <stdio.h>
 #include <string.h>
 #include <pthread.h>
@@ -28,18 +49,31 @@
 #include <stdlib.h>
 #include <regex.h>
 #include <limits.h>
-#include <city.h>
 #include <stdarg.h>
+#include <fcntl.h>
+#include <unistd.h>
 #include <sys/types.h>
-#include <sys/resource.h>
 #include <sys/stat.h>
 
+#include "city.h"
+
 #ifdef __APPLE__
+#include <sys/resource.h>
 #include <sys/sysctl.h>
+#elif defined _WIN32
+#include <windows.h>
+#include <psapi.h>
 #else
+#include <sys/resource.h>
 #include <sys/sysinfo.h>
 #endif
 
+#ifdef __aarch64__
+
+#include <arm_neon.h>
+
+#elif defined __x86_64__
+
 #ifdef __SSE2__
 #include <emmintrin.h>
 #endif
@@ -48,13 +82,28 @@
 #include <tmmintrin.h>
 #endif
 
+#define CAST_m128i_ptr(x) (reinterpret_cast<__m128i*>(x))
+
+#elif defined __PPC__
+
+#ifdef __LITTLE_ENDIAN__
+#include <altivec.h>
+#else
+#error Big endian ppc64 CPUs not supported
+#endif
+
+#else
+
+#error Unknown architecture
+#endif
+
 /* constants */
 
 #ifndef LINE_MAX
 #define LINE_MAX 2048
 #endif
 
-#define SWARM_VERSION "2.2.2"
+#define SWARM_VERSION "3.0.0"
 #define WIDTH 32
 #define WIDTH_SHIFT 5
 #define BLOCKWIDTH 32
@@ -71,6 +120,10 @@
 #define MIN(x,y) ((x)<(y)?(x):(y))
 #endif
 
+#ifndef MAX
+#define MAX(x,y) ((x)>(y)?(x):(y))
+#endif
+
 #define QGRAMLENGTH 5
 #define QGRAMVECTORBITS (1<<(2*QGRAMLENGTH))
 #define QGRAMVECTORBYTES (QGRAMVECTORBITS/8)
@@ -88,14 +141,15 @@ struct seqinfo_s
 {
   char * header;
   char * seq;
+  uint64_t abundance;
+  uint64_t hdrhash;
+  uint64_t seqhash;
   int headerlen;
   unsigned int seqlen;
-  unsigned long abundance;
   unsigned int clusterid;
-  unsigned long hdrhash;
-  unsigned long seqhash;
   int abundance_start;
   int abundance_end;
+  int dummy; /* alignment padding only */
 };
 
 typedef struct seqinfo_s seqinfo_t;
@@ -105,8 +159,8 @@ extern qgramvector_t * qgrams;
 
 struct queryinfo
 {
-  unsigned long qno;
-  long len;
+  uint64_t qno;
+  int64_t len;
   char * seq;
 };
 
@@ -120,22 +174,22 @@ extern char * opt_output_file;
 extern char * opt_seeds;
 extern char * opt_statistics_file;
 extern char * opt_uclust_file;
-extern long opt_append_abundance;
-extern long opt_bloom_bits;
-extern long opt_boundary;
-extern long opt_ceiling;
-extern long opt_differences;
-extern long opt_fastidious;
-extern long opt_gap_extension_penalty;
-extern long opt_gap_opening_penalty;
-extern long opt_help;
-extern long opt_match_reward;
-extern long opt_mismatch_penalty;
-extern long opt_mothur;
-extern long opt_no_otu_breaking;
-extern long opt_threads;
-extern long opt_usearch_abundance;
-extern long opt_version;
+extern int64_t opt_append_abundance;
+extern int64_t opt_bloom_bits;
+extern int64_t opt_boundary;
+extern int64_t opt_ceiling;
+extern int64_t opt_differences;
+extern int64_t opt_fastidious;
+extern int64_t opt_gap_extension_penalty;
+extern int64_t opt_gap_opening_penalty;
+extern int64_t opt_help;
+extern int64_t opt_match_reward;
+extern int64_t opt_mismatch_penalty;
+extern int64_t opt_mothur;
+extern int64_t opt_no_otu_breaking;
+extern int64_t opt_threads;
+extern int64_t opt_usearch_abundance;
+extern int64_t opt_version;
 
 extern char * queryname;
 extern char * matrixname;
@@ -147,10 +201,10 @@ extern char map_ncbi_nt16[];
 extern char map_ncbi_aa[];
 extern char map_sound[];
 
-extern long penalty_factor;
-extern long penalty_gapextend;
-extern long penalty_gapopen;
-extern long penalty_mismatch;
+extern int64_t penalty_factor;
+extern int64_t penalty_gapextend;
+extern int64_t penalty_gapopen;
+extern int64_t penalty_mismatch;
 
 extern FILE * outfile;
 extern FILE * statsfile;
@@ -160,61 +214,79 @@ extern FILE * logfile;
 extern FILE * fp_seeds;
 
 
-extern long SCORELIMIT_7;
-extern long SCORELIMIT_8;
-extern long SCORELIMIT_16;
-extern long SCORELIMIT_32;
-extern long SCORELIMIT_63;
+extern int64_t SCORELIMIT_7;
+extern int64_t SCORELIMIT_8;
+extern int64_t SCORELIMIT_16;
+extern int64_t SCORELIMIT_32;
+extern int64_t SCORELIMIT_63;
 extern char BIAS;
 
-extern long mmx_present;
-extern long sse_present;
-extern long sse2_present;
-extern long sse3_present;
-extern long ssse3_present;
-extern long sse41_present;
-extern long sse42_present;
-extern long popcnt_present;
-extern long avx_present;
-extern long avx2_present;
+extern int64_t mmx_present;
+extern int64_t sse_present;
+extern int64_t sse2_present;
+extern int64_t sse3_present;
+extern int64_t ssse3_present;
+extern int64_t sse41_present;
+extern int64_t sse42_present;
+extern int64_t popcnt_present;
+extern int64_t avx_present;
+extern int64_t avx2_present;
 
 extern unsigned char * score_matrix_8;
 extern unsigned short * score_matrix_16;
-extern long * score_matrix_63;
+extern int64_t * score_matrix_63;
 
 extern char sym_nt[];
 
-extern unsigned long longestdbsequence;
+extern uint64_t longestdbsequence;
 
 extern queryinfo_t query;
 
-extern unsigned long duplicates_found;
+extern uint64_t duplicates_found;
+
+/* inline functions */
+
+inline unsigned char nt_extract(char * seq, uint64_t i)
+{
+  // Extract compressed nucleotide in sequence seq at position i
+  return (((reinterpret_cast<uint64_t*>(seq))[i >> 5]) >> ((i & 31) << 1)) & 3;
+}
+
+inline unsigned int nt_bytelength(unsigned int len)
+{
+  // Compute number of bytes used for compressed sequence of length len
+  return ((len+31) >> 5) << 3;
+}
 
 /* functions in util.cc */
 
-long gcd(long a, long b);
-void fatal(const char * msg);
-void fatal(const char * format, const char * message);
+int64_t gcd(int64_t a, int64_t b);
+[[ noreturn ]] void fatal(const char * msg);
 void * xmalloc(size_t size);
 void * xrealloc(void * ptr, size_t size);
-unsigned long hash_fnv_1a_64(unsigned char * s, unsigned long n);
-unsigned int hash_fnv_1a_32(unsigned char * s, unsigned long n);
-unsigned long hash_djb2(unsigned char * s, unsigned long n);
-unsigned long hash_djb2a(unsigned char * s, unsigned long n);
-unsigned long hash_cityhash64(unsigned char * s, unsigned long n);
-void progress_init(const char * prompt, unsigned long size);
-void progress_update(unsigned long progress);
+void xfree(void * ptr);
+uint64_t hash_fnv_1a_64(unsigned char * s, uint64_t n);
+unsigned int hash_fnv_1a_32(unsigned char * s, uint64_t n);
+uint64_t hash_djb2(unsigned char * s, uint64_t n);
+uint64_t hash_djb2a(unsigned char * s, uint64_t n);
+uint64_t hash_cityhash64(unsigned char * s, uint64_t n);
+uint64_t hash_xor64len(unsigned char * s, uint64_t n);
+uint64_t hash64shift(uint64_t key);
+void progress_init(const char * prompt, uint64_t size);
+void progress_update(uint64_t progress);
 void progress_done();
+FILE * fopen_input(const char * filename);
+FILE * fopen_output(const char * filename);
 
 /* functions in qgram.cc */
 
-void findqgrams(unsigned char * seq, unsigned long seqlen,
+void findqgrams(unsigned char * seq, uint64_t seqlen,
                 unsigned char * qgramvector);
-unsigned long qgram_diff(unsigned long a, unsigned long b);
-void qgram_diff_fast(unsigned long seed,
-                     unsigned long listlen,
-                     unsigned long * amplist,
-                     unsigned long * difflist);
+uint64_t qgram_diff(uint64_t a, uint64_t b);
+void qgram_diff_fast(uint64_t seed,
+                     uint64_t listlen,
+                     uint64_t * amplist,
+                     uint64_t * difflist);
 void qgram_diff_init();
 void qgram_diff_done();
 
@@ -222,47 +294,48 @@ void qgram_diff_done();
 
 void db_read(const char * filename);
 
-unsigned long db_getsequencecount();
-unsigned long db_getnucleotidecount();
+unsigned int db_getsequencecount();
+uint64_t db_getnucleotidecount();
+
+unsigned int db_getlongestheader();
+unsigned int db_getlongestsequence();
 
-unsigned long db_getlongestheader();
-unsigned long db_getlongestsequence();
+seqinfo_t * db_getseqinfo(uint64_t seqno);
 
-seqinfo_t * db_getseqinfo(unsigned long seqno);
+char * db_getsequence(uint64_t seqno);
+unsigned int db_getsequencelen(uint64_t seqno);
 
-char * db_getsequence(unsigned long seqno);
-unsigned long db_getsequencelen(unsigned long seqno);
+uint64_t db_gethash(uint64_t seqno);
 
-void db_getsequenceandlength(unsigned long seqno,
+void db_getsequenceandlength(uint64_t seqno,
                              char ** address,
-                             long * length);
+                             unsigned int * length);
 
-char * db_getheader(unsigned long seqno);
-unsigned long db_getheaderlen(unsigned long seqno);
+char * db_getheader(uint64_t seqno);
+unsigned int db_getheaderlen(uint64_t seqno);
 
-unsigned long db_getabundance(unsigned long seqno);
+uint64_t db_getabundance(uint64_t seqno);
 
-void db_showsequence(unsigned long seqno);
 void db_showall();
 void db_free();
 
-void db_putseq(long seqno);
+void db_putseq(int64_t seqno);
 
 void db_qgrams_init();
 void db_qgrams_done();
 
-void db_fprintseq(FILE * fp, int a, int width);
+void db_fprintseq(FILE * fp, unsigned int a, unsigned int width);
 
-inline unsigned char * db_getqgramvector(unsigned long seqno)
+inline unsigned char * db_getqgramvector(uint64_t seqno)
 {
-  return (unsigned char*)(qgrams + seqno);
+  return reinterpret_cast<unsigned char*>(qgrams + seqno);
 }
 
-void fprint_id(FILE * stream, unsigned long x);
-void fprint_id_noabundance(FILE * stream, unsigned long x);
+void fprint_id(FILE * stream, uint64_t x);
+void fprint_id_noabundance(FILE * stream, uint64_t x);
 void fprint_id_with_new_abundance(FILE * stream,
-                                  unsigned long seqno,
-                                  unsigned long abundance);
+                                  uint64_t seqno,
+                                  uint64_t abundance);
 
 
 /* functions in ssse3.cc */
@@ -284,14 +357,14 @@ void search8(BYTE * * q_start,
              BYTE * score_matrix,
              BYTE * dprofile,
              BYTE * hearray,
-             unsigned long sequences,
-             unsigned long * seqnos,
-             unsigned long * scores,
-             unsigned long * diffs,
-             unsigned long * alignmentlengths,
-             unsigned long qlen,
-             unsigned long dirbuffersize,
-             unsigned long * dirbuffer);
+             uint64_t sequences,
+             uint64_t * seqnos,
+             uint64_t * scores,
+             uint64_t * diffs,
+             uint64_t * alignmentlengths,
+             uint64_t qlen,
+             uint64_t dirbuffersize,
+             uint64_t * dirbuffer);
 
 
 /* functions in search16.cc */
@@ -302,33 +375,33 @@ void search16(WORD * * q_start,
               WORD * score_matrix,
               WORD * dprofile,
               WORD * hearray,
-              unsigned long sequences,
-              unsigned long * seqnos,
-              unsigned long * scores,
-              unsigned long * diffs,
-              unsigned long * alignmentlengths,
-              unsigned long qlen,
-              unsigned long dirbuffersize,
-              unsigned long * dirbuffer);
+              uint64_t sequences,
+              uint64_t * seqnos,
+              uint64_t * scores,
+              uint64_t * diffs,
+              uint64_t * alignmentlengths,
+              uint64_t qlen,
+              uint64_t dirbuffersize,
+              uint64_t * dirbuffer);
 
 
 /* functions in nw.cc */
 
 void nw(char * dseq,
-        char * dend,
+        int64_t dlen,
         char * qseq,
-        char * qend,
-        long * score_matrix,
-        unsigned long gapopen,
-        unsigned long gapextend,
-        unsigned long * nwscore,
-        unsigned long * nwdiff,
-        unsigned long * nwalignmentlength,
+        int64_t qlen,
+        int64_t * score_matrix,
+        int64_t gapopen,
+        int64_t gapextend,
+        int64_t * nwscore,
+        int64_t * nwdiff,
+        int64_t * nwalignmentlength,
         char ** nwalignment,
         unsigned char * dir,
-        unsigned long * hearray,
-        unsigned long queryno,
-        unsigned long dbseqno);
+        int64_t * hearray,
+        uint64_t queryno,
+        uint64_t dbseqno);
 
 
 /* functions in matrix.cc */
@@ -339,14 +412,14 @@ void score_matrix_free();
 
 /* functions in scan.cc */
 
-void search_all(unsigned long query_no);
-void search_do(unsigned long query_no, 
-               unsigned long listlength,
-               unsigned long * targets,
-               unsigned long * scores,
-               unsigned long * diffs,
-               unsigned long * alignlengths,
-               long bits);
+void search_all(uint64_t query_no);
+void search_do(uint64_t query_no,
+               uint64_t listlength,
+               uint64_t * targets,
+               uint64_t * scores,
+               uint64_t * diffs,
+               uint64_t * alignlengths,
+               int bits);
 void search_begin();
 void search_end();
 
@@ -364,12 +437,17 @@ void dereplicate();
 
 /* functions in arch.cc */
 
-unsigned long arch_get_memused();
-unsigned long arch_get_memtotal();
+uint64_t arch_get_memused();
+uint64_t arch_get_memtotal();
+void arch_srandom(unsigned int seed);
+uint64_t arch_random();
 
 
 /* new header files */
 
-#include "bitmap.h"
-#include "bloom.h"
 #include "threads.h"
+#include "zobrist.h"
+#include "bloompat.h"
+#include "bloomflex.h"
+#include "variants.h"
+#include "hashtable.h"
diff --git a/src/threads.h b/src/threads.h
index e6b017a7..c95b43a2 100644
--- a/src/threads.h
+++ b/src/threads.h
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -24,24 +24,24 @@
 class ThreadRunner
 {
 private:
-  
-  long thread_count;
+
+  int64_t thread_count;
 
   pthread_attr_t attr;
 
   struct thread_s
   {
-    long t;
-    void (*fun)(long t);
+    int64_t t;
+    void (*fun)(int64_t t);
     pthread_t pthread;
     pthread_mutex_t workmutex;
     pthread_cond_t workcond;
-    int work; /* 1: work available, 0: wait, -1: quit */
+    int64_t work; /* 1: work available, 0: wait, -1: quit */
   } * thread_array;
 
   static void * worker(void * vp)
   {
-    struct thread_s * tip = (struct thread_s *) vp;
+    struct thread_s * tip = static_cast<struct thread_s *>(vp);
 
     pthread_mutex_lock(&tip->workmutex);
 
@@ -61,35 +61,35 @@ class ThreadRunner
       }
 
     pthread_mutex_unlock(&tip->workmutex);
-    return 0;
+    return nullptr;
   }
 
 public:
 
-  ThreadRunner(int t, void (*f)(long t))
+  ThreadRunner(int t, void (*f)(int64_t t))
   {
     thread_count = t;
 
     pthread_attr_init(&attr);
     pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
-        
+
     /* allocate memory for thread data */
-    thread_array = (struct thread_s *) 
-      xmalloc(thread_count * sizeof(struct thread_s));
-  
+    thread_array = static_cast<struct thread_s *>
+      (xmalloc(static_cast<uint64_t>(thread_count) * sizeof(struct thread_s)));
+
     /* init and create worker threads */
-    for(long i=0; i<thread_count; i++)
+    for(int64_t i=0; i<thread_count; i++)
       {
         struct thread_s * tip = thread_array + i;
         tip->t = i;
         tip->work = 0;
         tip->fun = f;
-        pthread_mutex_init(&tip->workmutex, NULL);
-        pthread_cond_init(&tip->workcond, NULL);
+        pthread_mutex_init(&tip->workmutex, nullptr);
+        pthread_cond_init(&tip->workcond, nullptr);
         if (pthread_create(&tip->pthread,
                            &attr,
                            worker,
-                           (void*)(thread_array + i)))
+                           static_cast<void*>(thread_array + i)))
           fatal("Cannot create thread");
       }
   }
@@ -101,10 +101,10 @@ class ThreadRunner
     /* destroy threads */
     /* finish and clean up worker threads */
 
-    for(long i=0; i<thread_count; i++)
+    for(int64_t i=0; i<thread_count; i++)
       {
         struct thread_s * tip = thread_array + i;
-      
+
         /* tell worker to quit */
         pthread_mutex_lock(&tip->workmutex);
         tip->work = -1;
@@ -112,21 +112,21 @@ class ThreadRunner
         pthread_mutex_unlock(&tip->workmutex);
 
         /* wait for worker to quit */
-        if (pthread_join(tip->pthread, NULL))
+        if (pthread_join(tip->pthread, nullptr))
           fatal("Cannot join thread");
 
         pthread_cond_destroy(&tip->workcond);
         pthread_mutex_destroy(&tip->workmutex);
       }
 
-    free(thread_array);
+    xfree(thread_array);
     pthread_attr_destroy(&attr);
   }
 
   void run()
   {
     /* wake up threads */
-    for(long i=0; i<thread_count; i++)
+    for(int64_t i=0; i<thread_count; i++)
       {
         struct thread_s * tip = thread_array + i;
         pthread_mutex_lock(&tip->workmutex);
@@ -136,7 +136,7 @@ class ThreadRunner
       }
 
     /* wait for threads to finish their work */
-    for(long i=0; i<thread_count; i++)
+    for(int64_t i=0; i<thread_count; i++)
       {
         struct thread_s * tip = thread_array + i;
         pthread_mutex_lock(&tip->workmutex);
diff --git a/src/util.cc b/src/util.cc
index b5f28426..012fe243 100644
--- a/src/util.cc
+++ b/src/util.cc
@@ -1,7 +1,7 @@
 /*
     SWARM
 
-    Copyright (C) 2012-2017 Torbjorn Rognes and Frederic Mahe
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
 
     This program is free software: you can redistribute it and/or modify
     it under the terms of the GNU Affero General Public License as
@@ -24,31 +24,34 @@
 #include "swarm.h"
 
 static const char * progress_prompt;
-static unsigned long progress_next;
-static unsigned long progress_size;
-static unsigned long progress_chunk;
-static const unsigned long progress_granularity = 200;
+static uint64_t progress_next;
+static uint64_t progress_size;
+static uint64_t progress_chunk;
+static const uint64_t progress_granularity = 200;
+const size_t memalignment = 16;
 
-void progress_init(const char * prompt, unsigned long size)
+void progress_init(const char * prompt, uint64_t size)
 {
   progress_prompt = prompt;
   progress_size = size;
-  progress_chunk = size < progress_granularity ? 
+  progress_chunk = size < progress_granularity ?
     1 : size / progress_granularity;
-  progress_next = 0;
+  progress_next = 1;
   if (opt_log)
     fprintf(logfile, "%s", prompt);
   else
     fprintf(logfile, "%s %.0f%%", prompt, 0.0);
 }
 
-void progress_update(unsigned long progress)
+void progress_update(uint64_t progress)
 {
   if ((!opt_log) && (progress >= progress_next))
     {
       fprintf(logfile, "  \r%s %.0f%%", progress_prompt,
-              100.0 * progress / progress_size);
+              100.0 * static_cast<double>(progress)
+              / static_cast<double>(progress_size));
       progress_next = progress + progress_chunk;
+      fflush(logfile);
     }
 }
 
@@ -58,9 +61,10 @@ void progress_done()
     fprintf(logfile, " %.0f%%\n", 100.0);
   else
     fprintf(logfile, "  \r%s %.0f%%\n", progress_prompt, 100.0);
+  fflush(logfile);
 }
 
-long gcd(long a, long b)
+int64_t gcd(int64_t a, int64_t b)
 {
   if (b == 0)
   {
@@ -72,26 +76,23 @@ long gcd(long a, long b)
   }
 }
 
-void fatal(const char * msg)
+[[ noreturn ]] void fatal(const char * msg)
 {
   fprintf(stderr, "\nError: %s\n", msg);
   exit(1);
 }
 
-void fatal(const char * format, const char * message)
-{
-  fprintf(stderr, "\n");
-  fprintf(stderr, format, message);
-  fprintf(stderr, "\n");
-  exit(1);
-}
-
 void * xmalloc(size_t size)
 {
-  const size_t alignment = 16;
-  void * t = NULL;
-  if (posix_memalign(& t, alignment, size))
-    fatal("Unable to allocate enough memory.");
+  if (size == 0)
+    size = 1;
+  void * t = nullptr;
+#ifdef _WIN32
+  t = _aligned_malloc(size, memalignment);
+#else
+  if (posix_memalign(& t, memalignment, size))
+    t = nullptr;
+#endif
   if (!t)
     fatal("Unable to allocate enough memory.");
   return t;
@@ -99,24 +100,44 @@ void * xmalloc(size_t size)
 
 void * xrealloc(void *ptr, size_t size)
 {
+  if (size == 0)
+    size = 1;
+#ifdef _WIN32
+  void * t = _aligned_realloc(ptr, size, memalignment);
+#else
   void * t = realloc(ptr, size);
+#endif
   if (!t)
-    fatal("Unable to allocate enough memory.");
+    fatal("Unable to reallocate enough memory.");
   return t;
 }
 
+void xfree(void * ptr)
+{
+  if (ptr)
+    {
+#ifdef _WIN32
+      _aligned_free(ptr);
+#else
+      free(ptr);
+#endif
+    }
+  else
+    fatal("Trying to free a null pointer");
+}
+
 #if 0
 
 /* never used */
 
-unsigned long hash_fnv_1a_64(unsigned char * s, unsigned long n)
+uint64_t hash_fnv_1a_64(unsigned char * s, uint64_t n)
 {
-  const unsigned long fnv_offset = 14695981039346656037UL;
-  const unsigned long fnv_prime = 1099511628211; /* 2^40 - 435 */
+  const uint64_t fnv_offset = 14695981039346656037ULL;
+  const uint64_t fnv_prime = 1099511628211; /* 2^40 - 435 */
 
-  unsigned long hash = fnv_offset;
+  uint64_t hash = fnv_offset;
 
-  for(unsigned long i = 0; i < n; i++)
+  for(uint64_t i = 0; i < n; i++)
     {
       unsigned char c = *s++;
       hash = (hash ^ c) * fnv_prime;
@@ -125,14 +146,14 @@ unsigned long hash_fnv_1a_64(unsigned char * s, unsigned long n)
   return hash;
 }
 
-unsigned int hash_fnv_1a_32(unsigned char * s, unsigned long n)
+unsigned int hash_fnv_1a_32(unsigned char * s, uint64_t n)
 {
   const unsigned int fnv_offset = 2166136261;
   const unsigned int fnv_prime = 16777619;
 
   unsigned int hash = fnv_offset;
 
-  for(unsigned long i = 0; i < n; i++)
+  for(uint64_t i = 0; i < n; i++)
     {
       unsigned char c = *s++;
       hash = (hash ^ c) * fnv_prime;
@@ -141,13 +162,14 @@ unsigned int hash_fnv_1a_32(unsigned char * s, unsigned long n)
   return hash;
 }
 
-unsigned long hash_djb2(unsigned char * s, unsigned long n)
+
+uint64_t hash_djb2(unsigned char * s, uint64_t n)
 {
-  const unsigned long djb2_offset = 5381;
+  const uint64_t djb2_offset = 5381;
 
-  unsigned long hash = djb2_offset;
+  uint64_t hash = djb2_offset;
 
-  for(unsigned long i = 0; i < n; i++)
+  for(uint64_t i = 0; i < n; i++)
     {
       unsigned char c = *s++;
       hash = ((hash << 5) + hash) + c; /* hash = hash * 33 + c */
@@ -156,13 +178,13 @@ unsigned long hash_djb2(unsigned char * s, unsigned long n)
   return hash;
 }
 
-unsigned long hash_djb2a(unsigned char * s, unsigned long n)
+uint64_t hash_djb2a(unsigned char * s, uint64_t n)
 {
-  const unsigned long djb2_offset = 5381;
+  const uint64_t djb2_offset = 5381;
 
-  unsigned long hash = djb2_offset;
+  uint64_t hash = djb2_offset;
 
-  for(unsigned long i = 0; i < n; i++)
+  for(uint64_t i = 0; i < n; i++)
     {
       unsigned char c = *s++;
       hash = ((hash << 5) + hash) ^ c; /* hash = hash * 33 ^ c */
@@ -171,9 +193,69 @@ unsigned long hash_djb2a(unsigned char * s, unsigned long n)
   return hash;
 }
 
+uint64_t hash64shift(uint64_t key)
+{
+  key = (~key) + (key << 21); // key = (key << 21) - key - 1;
+  key = key ^ (key >> 24);
+  key = (key + (key << 3)) + (key << 8); // key * 265
+  key = key ^ (key >> 14);
+  key = (key + (key << 2)) + (key << 4); // key * 21
+  key = key ^ (key >> 28);
+  key = key + (key << 31);
+  return key;
+}
+
+uint64_t hash_xor64len(unsigned char * s, uint64_t n)
+{
+  uint64_t hash;
+
+  hash = 8 * n;
+  uint64_t * p = (uint64_t*) s;
+  for(uint64_t i = 0; i < n/8; i++)
+    hash ^= *p++;
+
+  // Only the lowest (right-most) bits are used for indexing the hash table.
+  // Make sure these bits are representative of all bits in the hash.
+
+  hash = hash64shift(hash);
+
+  return hash;
+}
+
 #endif
 
-unsigned long hash_cityhash64(unsigned char * s, unsigned long n)
+uint64_t hash_cityhash64(unsigned char * s, uint64_t n)
+{
+  return CityHash64(reinterpret_cast<const char *>(s), n);
+}
+
+
+FILE * fopen_input(const char * filename)
+{
+  /* open the input stream given by filename, but use stdin if name is - */
+  if (strcmp(filename, "-") == 0)
+    {
+      int fd = dup(STDIN_FILENO);
+      if (fd < 0)
+        return nullptr;
+      else
+        return fdopen(fd, "rb");
+    }
+  else
+    return fopen(filename, "rb");
+}
+
+FILE * fopen_output(const char * filename)
 {
-  return CityHash64((const char*)s, n);
+  /* open the output stream given by filename, but use stdout if name is - */
+  if (strcmp(filename, "-") == 0)
+    {
+      int fd = dup(STDOUT_FILENO);
+      if (fd < 0)
+        return nullptr;
+      else
+        return fdopen(fd, "w");
+    }
+  else
+    return fopen(filename, "w");
 }
diff --git a/src/variants.cc b/src/variants.cc
new file mode 100644
index 00000000..1833a219
--- /dev/null
+++ b/src/variants.cc
@@ -0,0 +1,251 @@
+/*
+    SWARM
+
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as
+    published by the Free Software Foundation, either version 3 of the
+    License, or (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
+    Department of Informatics, University of Oslo,
+    PO Box 1080 Blindern, NO-0316 Oslo, Norway
+*/
+
+#include "swarm.h"
+
+inline void nt_set(char * seq, unsigned int pos, unsigned int base)
+{
+  unsigned int whichlong = pos >> 5;
+  uint64_t shift = (pos & 31) << 1;
+  uint64_t mask = 3ULL << shift;
+  uint64_t x = (reinterpret_cast<uint64_t *>(seq))[whichlong];
+  x &= ~ mask;
+  x |= (static_cast<uint64_t>(base)) << shift;
+  (reinterpret_cast<uint64_t *>(seq))[whichlong] = x;
+}
+
+inline void seq_copy(char * a,
+                     unsigned int a_start,
+                     char * b,
+                     unsigned int b_start,
+                     unsigned int length)
+{
+  /* copy part of the compressed sequence b to a */
+  for(unsigned int i = 0; i < length; i++)
+    nt_set(a, a_start + i, nt_extract(b, b_start + i));
+}
+
+inline bool seq_identical(char * a,
+                          unsigned int a_start,
+                          char * b,
+                          unsigned int b_start,
+                          unsigned int length)
+{
+  /* compare parts of two compressed sequences a and b */
+  /* return false if different, true if identical */
+
+  for(unsigned int i = 0; i < length; i++)
+    if (nt_extract(a, a_start + i) != nt_extract(b, b_start + i))
+      return false;
+  return true;
+}
+
+void generate_variant_sequence(char * seed_sequence,
+                               unsigned int seed_seqlen,
+                               struct var_s * var,
+                               char * seq,
+                               unsigned int * seqlen)
+{
+  /* generate the actual sequence of a variant */
+
+  switch (var->type)
+    {
+    case identical:
+      memcpy(seq, seed_sequence, nt_bytelength(seed_seqlen));
+      * seqlen = seed_seqlen;
+      break;
+
+    case substitution:
+      memcpy(seq, seed_sequence, nt_bytelength(seed_seqlen));
+      nt_set(seq, var->pos, var->base);
+      * seqlen = seed_seqlen;
+      break;
+
+    case deletion:
+      seq_copy(seq, 0,
+               seed_sequence, 0,
+               var->pos);
+      seq_copy(seq, var->pos,
+               seed_sequence, var->pos + 1,
+               seed_seqlen - var->pos - 1);
+      * seqlen = seed_seqlen - 1;
+      break;
+
+    case insertion:
+      seq_copy(seq, 0,
+               seed_sequence, 0,
+               var->pos);
+      nt_set(seq, var->pos, var->base);
+      seq_copy(seq, var->pos + 1,
+               seed_sequence, var->pos,
+               seed_seqlen - var->pos);
+      * seqlen = seed_seqlen + 1;
+      break;
+
+    default:
+      fatal("Unknown variant");
+    }
+}
+
+
+bool check_variant(char * seed_sequence,
+                   unsigned int seed_seqlen,
+                   var_s * var,
+                   char * amp_sequence,
+                   unsigned int amp_seqlen)
+{
+  /* make sure seed with given variant is really identical to amp */
+  /* we know the hashes are identical */
+
+  switch (var->type)
+    {
+    case identical:
+      if (seed_seqlen != amp_seqlen)
+        return false;
+      return seq_identical(seed_sequence, 0,
+                           amp_sequence, 0,
+                           seed_seqlen);
+
+    case substitution:
+      if (seed_seqlen != amp_seqlen)
+        return false;
+      if (! seq_identical(seed_sequence, 0,
+                          amp_sequence, 0,
+                          var->pos))
+        return false;
+      if (nt_extract(amp_sequence, var->pos) != var->base)
+        return false;
+      return seq_identical(seed_sequence, var->pos + 1,
+                           amp_sequence,  var->pos + 1,
+                           seed_seqlen - var->pos - 1);
+
+    case deletion:
+      if ((seed_seqlen - 1) != amp_seqlen)
+        return false;
+      if (! seq_identical(seed_sequence, 0,
+                          amp_sequence, 0,
+                          var->pos))
+        return false;
+      return seq_identical(seed_sequence, var->pos + 1,
+                           amp_sequence,  var->pos,
+                           seed_seqlen - var->pos - 1);
+
+    case insertion:
+      if ((seed_seqlen + 1) != amp_seqlen)
+        return false;
+      if (! seq_identical(seed_sequence, 0,
+                          amp_sequence, 0,
+                          var->pos))
+        return false;
+      if (nt_extract(amp_sequence, var->pos) != var->base)
+        return false;
+      return seq_identical(seed_sequence, var->pos,
+                           amp_sequence,  var->pos + 1,
+                           seed_seqlen - var->pos);
+
+    default:
+      fatal("Unknown variant");
+    }
+}
+
+inline void add_variant(uint64_t hash,
+                        unsigned char type,
+                        unsigned int pos,
+                        unsigned char base,
+                        var_s * variant_list,
+                        unsigned int * variant_count)
+{
+#ifdef HASHSTATS
+  tries++;
+#endif
+  var_s * v = variant_list + (*variant_count)++;
+  v->hash = hash;
+  v->type = type;
+  v->pos = pos;
+  v->base = base;
+}
+
+void generate_variants(char * sequence,
+                       unsigned int seqlen,
+                       uint64_t hash,
+                       var_s * variant_list,
+                       unsigned int * variant_count,
+                       bool include_identical)
+{
+  /* identical non-variant */
+
+  if (include_identical)
+    add_variant(hash, identical, 0, 0, variant_list, variant_count);
+
+  /* substitutions */
+
+  for(unsigned int i = 0; i < seqlen; i++)
+    {
+      unsigned char base = nt_extract(sequence, i);
+      uint64_t hash1 = hash ^ zobrist_value(i, base);
+      for (unsigned char v = 0; v < 4; v ++)
+        if (v != base)
+          {
+            uint64_t hash2 = hash1 ^ zobrist_value(i, v);
+            add_variant(hash2, substitution, i, v,
+                        variant_list, variant_count);
+          }
+    }
+
+  /* deletions */
+
+  hash = zobrist_hash_delete_first(reinterpret_cast<unsigned char *>(sequence), seqlen);
+  add_variant(hash, deletion, 0, 0, variant_list, variant_count);
+  unsigned char base_deleted = nt_extract(sequence, 0);
+  for(unsigned int i = 1; i < seqlen; i++)
+    {
+      unsigned char v = nt_extract(sequence, i);
+      if (v != base_deleted)
+        {
+          hash ^= zobrist_value(i - 1, base_deleted) ^ zobrist_value(i - 1, v);
+          add_variant(hash, deletion, i, 0, variant_list, variant_count);
+          base_deleted = v;
+        }
+    }
+
+  /* insertions */
+
+  hash = zobrist_hash_insert_first(reinterpret_cast<unsigned char *>(sequence), seqlen);
+  for (unsigned char v = 0; v < 4; v++)
+    {
+      uint64_t hash1 = hash ^ zobrist_value(0, v);
+      add_variant(hash1, insertion, 0, v, variant_list, variant_count);
+    }
+  for (unsigned int i = 0; i < seqlen; i++)
+    {
+      unsigned char base = nt_extract(sequence, i);
+      hash ^= zobrist_value(i, base) ^ zobrist_value(i+1, base);
+      for (unsigned char v = 0; v < 4; v++)
+        if (v != base)
+          {
+            uint64_t hash1 = hash ^ zobrist_value(i + 1, v);
+            add_variant(hash1, insertion, i + 1, v,
+                        variant_list, variant_count);
+          }
+    }
+}
diff --git a/src/variants.h b/src/variants.h
new file mode 100644
index 00000000..70cc0f87
--- /dev/null
+++ b/src/variants.h
@@ -0,0 +1,57 @@
+/*
+    SWARM
+
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as
+    published by the Free Software Foundation, either version 3 of the
+    License, or (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
+    Department of Informatics, University of Oslo,
+    PO Box 1080 Blindern, NO-0316 Oslo, Norway
+*/
+
+/* Variant information */
+
+#define identical 0
+#define substitution 1
+#define deletion 2
+#define insertion 3
+
+struct var_s
+{
+  uint64_t hash;
+  unsigned int pos;
+  unsigned char type;
+  unsigned char base;
+  unsigned short dummy; /* for alignment padding only */
+};
+
+void generate_variant_sequence(char * seed_sequence,
+                               unsigned int seed_seqlen,
+                               struct var_s * var,
+                               char * seq,
+                               unsigned int * seqlen);
+
+bool check_variant(char * seed_sequence,
+                   unsigned int seed_seqlen,
+                   struct var_s * var,
+                   char * amp_sequence,
+                   unsigned int amp_seqlen);
+
+void generate_variants(char * sequence,
+                       unsigned int seqlen,
+                       uint64_t hash,
+                       struct var_s * variant_list,
+                       unsigned int * variant_count,
+                       bool include_identical);
diff --git a/src/zobrist.cc b/src/zobrist.cc
new file mode 100644
index 00000000..001d7c23
--- /dev/null
+++ b/src/zobrist.cc
@@ -0,0 +1,119 @@
+/*
+    SWARM
+
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as
+    published by the Free Software Foundation, either version 3 of the
+    License, or (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
+    Department of Informatics, University of Oslo,
+    PO Box 1080 Blindern, NO-0316 Oslo, Norway
+*/
+
+#include "swarm.h"
+
+uint64_t * zobrist_tab_base = nullptr;
+
+void zobrist_init(unsigned int n)
+{
+  /*
+    Generate 4n random 64-bit numbers. They will represent the four
+    different bases in any position (1 to n) of a sequence.
+    They will be XOR'ed together to form the hash of that sequence.
+    The number n should be the length of the longest sequence to be
+    hashed including potential additional insertions.
+
+    The number is generated by xor'ing together four shifted
+    31-bit random numbers.
+  */
+
+  zobrist_tab_base = static_cast<uint64_t *>
+    (xmalloc(4 * n * sizeof(uint64_t)));
+
+  for (unsigned int i = 0; i < 4 * n; i++)
+    {
+      uint64_t z;
+      z = arch_random();
+      z <<= 16;
+      z ^= arch_random();
+      z <<= 16;
+      z ^= arch_random();
+      z <<= 16;
+      z ^= arch_random();
+      zobrist_tab_base[i] = z;
+    }
+}
+
+void zobrist_exit()
+{
+  xfree(zobrist_tab_base);
+}
+
+uint64_t zobrist_hash(unsigned char * s, unsigned int len)
+{
+  /* compute the Zobrist hash function of sequence s of length len. */
+  /* len is the actual number of bases in the sequence */
+  /* it is encoded in (len+3)/4 bytes */
+
+  uint64_t * q = reinterpret_cast<uint64_t *>(s);
+  uint64_t x = 0;
+  uint64_t z = 0;
+  for(unsigned int p = 0; p < len; p++)
+    {
+      if ((p & 31) == 0)
+        x = q[p / 32];
+      else
+        x >>= 2;
+      z ^= zobrist_value(p, x & 3);
+    }
+  return z;
+}
+
+uint64_t zobrist_hash_delete_first(unsigned char * s, unsigned int len)
+{
+  /* compute the Zobrist hash function of sequence s,
+     but delete the first base */
+
+  uint64_t * q = reinterpret_cast<uint64_t *>(s);
+  uint64_t x = q[0];
+  uint64_t z = 0;
+  for(unsigned int p = 1; p < len; p++)
+    {
+      if ((p & 31) == 0)
+        x = q[p / 32];
+      else
+        x >>= 2;
+      z ^= zobrist_value(p - 1, x & 3);
+    }
+  return z;
+}
+
+uint64_t zobrist_hash_insert_first(unsigned char * s, unsigned int len)
+{
+  /* compute the Zobrist hash function of sequence s,
+     but insert a gap (no value) before the first base */
+
+  uint64_t * q = reinterpret_cast<uint64_t *>(s);
+  uint64_t x = 0;
+  uint64_t z = 0;
+  for(unsigned int p = 0; p < len; p++)
+    {
+      if ((p & 31) == 0)
+        x = q[p / 32];
+      else
+        x >>= 2;
+      z ^= zobrist_value(p + 1, x & 3);
+    }
+  return z;
+}
diff --git a/src/zobrist.h b/src/zobrist.h
new file mode 100644
index 00000000..5bbf414a
--- /dev/null
+++ b/src/zobrist.h
@@ -0,0 +1,37 @@
+/*
+    SWARM
+
+    Copyright (C) 2012-2019 Torbjorn Rognes and Frederic Mahe
+
+    This program is free software: you can redistribute it and/or modify
+    it under the terms of the GNU Affero General Public License as
+    published by the Free Software Foundation, either version 3 of the
+    License, or (at your option) any later version.
+
+    This program is distributed in the hope that it will be useful,
+    but WITHOUT ANY WARRANTY; without even the implied warranty of
+    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+    GNU Affero General Public License for more details.
+
+    You should have received a copy of the GNU Affero General Public License
+    along with this program.  If not, see <http://www.gnu.org/licenses/>.
+
+    Contact: Torbjorn Rognes <torognes@ifi.uio.no>,
+    Department of Informatics, University of Oslo,
+    PO Box 1080 Blindern, NO-0316 Oslo, Norway
+*/
+
+extern uint64_t * zobrist_tab_base;
+
+void zobrist_init(unsigned int longest);
+
+void zobrist_exit();
+
+uint64_t zobrist_hash(unsigned char * s, unsigned int len);
+uint64_t zobrist_hash_delete_first(unsigned char * s, unsigned int len);
+uint64_t zobrist_hash_insert_first(unsigned char * s, unsigned int len);
+
+inline uint64_t zobrist_value(unsigned int pos, unsigned char x)
+{
+  return zobrist_tab_base[4 * pos + x];
+}