Skip to content

Commit

Permalink
Merge branch 'swarm3'
Browse files Browse the repository at this point in the history
  • Loading branch information
torognes committed Oct 24, 2019
2 parents 02ad79a + 9aa56c5 commit c8f18cf
Show file tree
Hide file tree
Showing 39 changed files with 5,168 additions and 4,006 deletions.
15 changes: 15 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
language: c++

os: linux

dist: bionic

compiler: gcc

before_install:
- sudo apt-get install -y valgrind

script:
- make
- export PATH=$PWD/bin:$PATH
- git clone https://github.com/frederic-mahe/swarm-tests.git && cd swarm-tests && bash ./run_all_tests.sh | tee tests.log && ! grep -q FAIL tests.log
58 changes: 31 additions & 27 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
[![Build Status](https://travis-ci.org/torognes/swarm.svg?branch=swarm3)](https://travis-ci.org/torognes/swarm)

# swarm

A robust and fast clustering method for amplicon-based studies.
Expand All @@ -16,21 +18,32 @@ To help users, we describe
starting from raw fastq files, clustering with **swarm** and producing
a filtered OTU table.

swarm 2.0 introduces several novelties and improvements over swarm
swarm 3.0 introduces:
* a much faster default algorithm,
* a reduced memory footprint,
* binaries for Windows x86-64, GNU/Linux ARM 64, and GNU/Linux POWER8,
* an updated, hardened, and thoroughly tested code.

Please note that:
* strict dereplication of input sequences is now mandatory,
* \-\-seeds option (\-w) now outputs results sorted by decreasing
abundance, and then by alphabetical order of sequence labels.

swarm 2.0 introduced several novelties and improvements over swarm
1.0:
* built-in breaking phase now performed automatically,
* possibility to output OTU representatives in fasta format (option
`-w`),
* fast algorithm now used by default for *d* = 1 (linear time
complexity),
* a new option called *fastidious* that refines *d* = 1 results and
reduces the number of small OTUs,
reduces the number of small OTUs.

## Common misconceptions

**swarm** is a single-linkage clustering method, with some superficial
similarities with other clustering methods (e.g.,
[Huse et al, 2010](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909393/)). **swarm**'s
similarities with other clustering methods (e.g., [Huse et al,
2010](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2909393/)). **swarm**'s
novelty is its iterative growth process and the use of sequence
abundance values to delineate OTUs. **swarm** properly delineates
large OTUs (high recall), and can distinguish OTUs with as little as
Expand Down Expand Up @@ -76,8 +89,8 @@ cgtcgtcgtcgtcgt

where sequence identifiers are unique and end with a value indicating
the number of occurrences of the sequence (e.g., `_1000`). Alternative
format is possible with the option `-z`, please see the
[user manual](https://github.com/torognes/swarm/blob/master/man/swarm_manual.pdf). Swarm
format is possible with the option `-z`, please see the [user
manual](https://github.com/torognes/swarm/blob/master/man/swarm_manual.pdf). Swarm
**requires** each fasta entry to present a number of occurrences to
work properly. That crucial information can be produced during the
[dereplication](#dereplication-mandatory) step.
Expand All @@ -87,7 +100,7 @@ Use `swarm -h` to get a short help, or see the
for a complete description of input/output formats and command line
options.

The memory footprint of **swarm** is roughly 1.6 times the size of the
The memory footprint of **swarm** is roughly 0.6 times the size of the
input fasta file. When using the fastidious option, memory footprint
can increase significantly. See options `-c` and `-y` to control and
cap swarm's memory consumption.
Expand Down Expand Up @@ -210,15 +223,10 @@ from two different sets have the same hash code, it means that the
sequences they represent are identical.

If for some reason your fasta entries don't have abundance values, and
you still want to run swarm, you can easily add fake abundance values:

```sh
sed '/^>/ s/$/_1/' amplicons.fasta > amplicons_with_abundances.fasta
```

Alternatively, you may specify a default abundance value with
**swarm**'s `--append-abundance` (`-a`) option to be used when
abundance information is missing from a sequence.
you still want to run swarm (not recommended), you can specify a
default abundance value with **swarm**'s `--append-abundance` (`-a`)
option to be used when abundance information is missing from a
sequence.


### Launch swarm ###
Expand Down Expand Up @@ -305,15 +313,6 @@ rm "${AMPLICONS}"
```


## Troubleshooting ##

If **swarm** exits with an error message saying `This program
requires a processor with SSE2`, your computer is too old to run
**swarm** (or based on a non x86-64 architecture). **swarm** only runs
on CPUs with the SSE2 instructions, i.e. most Intel and AMD CPUs
released since 2004.


## Citation ##

To cite **swarm**, please refer to:
Expand All @@ -333,7 +332,7 @@ You are welcome to:

* submit suggestions and bug-reports at: https://github.com/torognes/swarm/issues
* send a pull request on: https://github.com/torognes/swarm/
* compose a friendly e-mail to: Frédéric Mahé <mahe@rhrk.uni-kl.de> and Torbjørn Rognes <[email protected]>
* compose a friendly e-mail to: Frédéric Mahé <frederic.mahe@cirad.fr> and Torbjørn Rognes <[email protected]>


## Third-party pipelines ##
Expand All @@ -356,7 +355,7 @@ You are welcome to:
If you want to try alternative free and open-source clustering
methods, here are some links:

* [VSEARCH](https://github.com/torognes/vsearch)
* [vsearch](https://github.com/torognes/vsearch)
* [Oligotyping](http://merenlab.org/projects/oligotyping/)
* [DNAclust](http://dnaclust.sourceforge.net/)
* [Sumaclust](http://metabarcoding.org/sumatra)
Expand All @@ -365,6 +364,11 @@ methods, here are some links:

## Version history ##

### version 3.0 ###

**swarm** 3.0 is much faster when _d_ = 1, and consumes less memory.
Strict dereplication is now mandatory.

### version 2.2.2 ###

**swarm** 2.2.2 fixes a bug causing Swarm to wait forever in very rare
Expand Down
58 changes: 41 additions & 17 deletions man/swarm.1
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
.\" ============================================================================
.TH swarm 1 "December 12, 2017" "version 2.2.2" "USER COMMANDS"
.TH swarm 1 "October 24, 2019" "version 3.0.0" "USER COMMANDS"
.\" ============================================================================
.SH NAME
swarm \(em find clusters of nearly-identical nucleotide amplicons
Expand Down Expand Up @@ -110,8 +110,9 @@ results obtained during the clustering process allows \fBswarm\fR to
avoid most of the amplicon comparisons needed in a naïve approach. To
speed up the remaining amplicon comparisons, \fBswarm\fR implements an
extremely fast Needleman-Wunsch algorithm making use of the Streaming
SIMD Extensions (SSE2) of modern x86-64 CPUs. If SSE2 instructions are
not available, \fBswarm\fR exits with an error message.
SIMD Extensions (SSE2) of modern x86-64 CPUs, or NEON instructions of
ARM-64 CPUs. If SSE2 instructions are not available, \fBswarm\fR exits
with an error message.
.PP
\fBswarm\fR can read nucleotide amplicons in fasta format from a
normal file or from the standard input (using a pipe or a
Expand All @@ -138,7 +139,19 @@ defined as a string of [ACGT] or [ACGU] symbols (case insensitive, 'U'
is replaced with 'T' internally), starting after the end of the header
line and ending before the next header line or the file end;
\fBswarm\fR silently removes newline symbols ('\\n' or '\\r') and
exits with an error message if any other symbol is present.
exits with an error message if any other symbol is present. Lastly, if
sequences are not all unique, i.e. were not properly dereplicated,
swarm will exit with an error message.
.PP
Clusters are written to output files (specified with \-i, \-o, \-s and
\-u) by decreasing abundance of their seed sequences, and then by
alphabetical order of seed sequence labels. An exception to that is
the \-w (\-\-seeds) output, which is sorted by decreasing \fIcluster
abundance\fR (sum of abundances of all sequences in the cluster), and
then by alphabetical order of seed sequence labels. This is
particularly useful for post-clustering steps, such as \fIde novo\fR
chimera detection, that require clusters to be sorted by decreasing
abundances.
.\" ----------------------------------------------------------------------------
.SS General options
.TP 9
Expand Down Expand Up @@ -286,7 +299,7 @@ in situations where writing to \fIstandard error\fR is problematic
output clustering results to \fIfilename\fR. Results consist of a list
of OTUs, one OTU per line. An OTU is a list of amplicon headers
separated by spaces. That output format can be modified by the option
\-\-mothur (\-r). Default is to write to standard output.
\-\-mothur (\-r). Default is to write to \fIstandard output\fR.
.TP
.B \-r\fP,\fB\ \-\-mothur
output clustering results in a format compatible with Mothur. That
Expand All @@ -305,7 +318,7 @@ total abundance of amplicons in the OTU,
.IP \n+[step].
label of the initial seed (header without abundance annotations),
.IP \n+[step].
initial seed abundance,
abundance of the initial seed,
.IP \n+[step].
number of amplicons with an abundance of 1 in the OTU,
.IP \n+[step].
Expand Down Expand Up @@ -363,13 +376,15 @@ output OTU representative sequences to \fIfilename\fR in fasta
format. The abundance value of each OTU representative is the sum of
the abundances of all the amplicons in the OTU. Fasta headers are
formated as follows: '>label_\fIinteger\fR',
or '>label;size=\fIinteger\fR;' if the \-z option is used.
or '>label;size=\fIinteger\fR;' if the \-z option is used, and
sequences are uppercased. Sequences are sorted by decreasing
abundance, and then by alphabetical order of sequence labels.
.TP
.B \-z\fP,\fB\ \-\-usearch\-abundance
accept amplicon abundance values in usearch/vsearch's style
(>label;size=\fIinteger\fR[;]). That option influences the abundance
annotation style used in swarm's standard output (\-o), as well as the
ouput of options \-r, \-u and \-w.
annotation style used in swarm's \fIstandard output\fR (\-o), as well
as the output of options \-r, \-u and \-w.
.LP
.\" ----------------------------------------------------------------------------
.SS Pairwise alignment advanced options
Expand Down Expand Up @@ -410,7 +425,7 @@ zcat myfile.fasta.gz | \\
\-t 4 \\
\-f \\
\-w myfile.representatives.fasta \\
\-o myfile.swarms
\-o /dev/null
.RE
.EE
.\" ============================================================================
Expand Down Expand Up @@ -475,7 +490,7 @@ License along with this program. If not, see
.\" ============================================================================
.SH SEE ALSO
\fBswipe\fR, an extremely fast Smith-Waterman database search tool by
Torbjørn Rognes (available from
Torbjørn Rognes (available at
.UR https://github.com/torognes/swipe
.UE ).
.PP
Expand All @@ -492,8 +507,17 @@ New features and important modifications of \fBswarm\fR (short lived
or minor bug releases are not mentioned):
.RS
.TP
.BR v3.0.0\~ "released October 24, 2019"
Version 3.0.0 introduces a faster algorithm for \fId\fR = 1, and a
reduced memory footprint. Swarm has been ported to Windows x86-64,
GNU/Linux ARM 64, and GNU/Linux POWER8. Internal code has been
modernized, hardened, and thoroughly tested. Strict dereplication of
input sequences is now mandatory. The \-\-seeds option (\-w) now
outputs results sorted by decreasing abundance, and then by
alphabetical order of sequence labels.
.TP
.BR v2.2.2\~ "released December 12, 2017"
Version 2.2.2 fixes a bug that would cause Swarm to wait forever in
Version 2.2.2 fixes a bug that would cause swarm to wait forever in
very rare cases when multiple threads were used.
.TP
.BR v2.2.1\~ "released October 27, 2017"
Expand Down Expand Up @@ -527,7 +551,7 @@ bug only applies when \fId\fR > 1.
.BR v2.1.10\~ "released December 22, 2016"
Version 2.1.10 fixes two bugs related to gap penalties of alignments.
The first bug may lead to wrong aligments and similarity percentages
reported in UCLUST (.uc) files. The second bug makes Swarm use a
reported in UCLUST (.uc) files. The second bug makes swarm use a
slightly higher gap extension penalty than specified. The default gap
extension penalty used have actually been 4.5 instead of 4.
.TP
Expand Down Expand Up @@ -679,10 +703,10 @@ not. Only basic SSE2 instructions are now required to run \fBswarm\fR.
.TP
.BR v1.2.4\~ "released January 30, 2014"
Version 1.2.4 introduces an option \-\-break\-swarms to output all
pairs of amplicons with \fId\fR differences to standard error. That
option is used by the companion script `swarm_breaker.py` to refine
\fBswarm\fR results. The syntax of the inline assembly code is changed
for compatibility with more compilers.
pairs of amplicons with \fId\fR differences to \fIstandard
error\fR. That option is used by the companion script
`swarm_breaker.py` to refine \fBswarm\fR results. The syntax of the
inline assembly code is changed for compatibility with more compilers.
.TP
.BR v1.2\~ "released May 16, 2013"
Version 1.2 greatly improves speed by using alignment-free comparisons
Expand Down
Binary file modified man/swarm_manual.pdf
Binary file not shown.
16 changes: 7 additions & 9 deletions scripts/amplicon_contingency_table.py
Original file line number Diff line number Diff line change
@@ -1,15 +1,13 @@
#!/usr/bin/env python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Read all fasta files and build a sorted amplicon contingency
table. Usage: python amplicon_contingency_table.py samples_*.fas
table. Usage: python3 amplicon_contingency_table.py samples_*.fas
"""

from __future__ import print_function

__author__ = "Frédéric Mahé <[email protected]>"
__date__ = "2016/03/12"
__version__ = "$Revision: 2.1"
__author__ = "Frédéric Mahé <[email protected]>"
__date__ = "2019/09/24"
__version__ = "$Revision: 3.0"

import os
import sys
Expand All @@ -35,7 +33,7 @@ def fasta_parse():
sample = os.path.basename(fasta_file)
sample = os.path.splitext(sample)[0]
samples[sample] = samples.get(sample, 0) + 1
with open(fasta_file, "rU") as fasta_file:
with open(fasta_file, "r") as fasta_file:
for line in fasta_file:
if line.startswith(">"):
amplicon, abundance = line.strip(">;\n").split(separator)
Expand Down Expand Up @@ -65,7 +63,7 @@ def main():
all_amplicons, amplicons2samples, samples = fasta_parse()

# Sort amplicons by decreasing abundance (and by amplicon name)
sorted_all_amplicons = sorted(all_amplicons.iteritems(),
sorted_all_amplicons = sorted(iter(all_amplicons.items()),
key=operator.itemgetter(1, 0))
sorted_all_amplicons.reverse()

Expand Down
29 changes: 12 additions & 17 deletions scripts/graph_plot.py
Original file line number Diff line number Diff line change
@@ -1,29 +1,24 @@
#!/usr/bin/env python
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Visualize the internal structure of a swarm (color vertices by
abundance). Requires the module igraph and python 2.7+.
Limitations: amplicons grafted with the fastidious option will be
discarded and will not be visualized.
abundance). Requires the module igraph and python 3.
"""

from __future__ import print_function

__author__ = "Frédéric Mahé <[email protected]>"
__date__ = "2016/11/09"
__version__ = "$Revision: 3.1"
__author__ = "Frédéric Mahé <[email protected]>"
__date__ = "2019/09/24"
__version__ = "$Revision: 4.0"

import sys
import os.path
from igraph import Graph, plot
from optparse import OptionParser

#*****************************************************************************#
# *************************************************************************** #
# #
# Functions #
# #
#*****************************************************************************#
# *************************************************************************** #


def option_parse():
Expand Down Expand Up @@ -76,7 +71,7 @@ def parse_files(swarms, internal_structure, OTU, drop):
"""
# List amplicon ids and abundances
amplicons = list()
with open(swarms, "rU") as swarms:
with open(swarms, "r") as swarms:
for i, swarm in enumerate(swarms):
if i == OTU - 1:
# Deal with ";size=" in a rather clumsy way... but it works
Expand All @@ -100,7 +95,7 @@ def parse_files(swarms, internal_structure, OTU, drop):

# List pairwise relations
relations = list()
with open(internal_structure, "rU") as internal_structure:
with open(internal_structure, "r") as internal_structure:
print("Parsing amplicon relationships", file=sys.stdout)
for line in internal_structure:
# Get the first four elements of the line
Expand Down Expand Up @@ -138,7 +133,7 @@ def build_graph(amplicons, relations):

amplicon_ids = [amplicon[0] for amplicon in amplicons]
abundances = [int(amplicon[1]) for amplicon in amplicons]
minimum, maximum = min(abundances), max(abundances)
maximum = max(abundances)

# Determine canvas size
if len(abundances) < 500:
Expand Down Expand Up @@ -214,11 +209,11 @@ def main():
return


#*****************************************************************************#
# *************************************************************************** #
# #
# Body #
# #
#*****************************************************************************#
# *************************************************************************** #

if __name__ == '__main__':

Expand Down
Loading

0 comments on commit c8f18cf

Please sign in to comment.