SANS ambages

Symmetric Alignment-free phylogeNomic Splits
--- phylogenomics with Abundance-filter, Multi-threading and Bootstrapping on Amino-acid or GEnomic Sequences

Reference-free
Alignment-free
Input: assembled genomes / reads, or coding sequences / amino acid sequences
Output: phylogenetic splits or tree
NEW: Abundance-filter
NEW: Bootstrapping
NEW: Multi-threading
NEW: Labeled PDF and Nexus output
NEW: Better performance by handling singleton k-mers separately
NEW: Output core k-mers

Dos and Don'ts

The genomes should not be too diverged. SANS works well on species level.
Be careful with outliers and outgroups (for the reason above).
The sequences should not be too short. Provide whole-genome data or as many coding sequences as possible.
Be careful with viruses (for the reasons above).
Have a look at the network (weakly compatible or 2-tree). It does not make much sense to extract a tree, if the split network is a hairball.
Reconstructed phylogenies are unrooted, even though a Newick file (-N) suggests a root.
In case of problems, contact us (see below).

Publications

Rempel, A., Wittler, R.: SANS serif: alignment-free, whole-genome based phylogenetic reconstruction. Bioinformatics. (2021).

Wittler, R.: Alignment- and reference-free phylogenomics with colored de Bruijn graphs. Algorithms for Molecular Biology. 15: 4 (2020).

Wittler, R.: Alignment- and reference-free phylogenomics with colored de Bruijn graphs. In: Huber, K. and Gusfield, D. (eds.) Proceedings of WABI 2019. LIPIcs. 143, Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik, Dagstuhl, Germany (2019).

Requirements

For the main program, there are no strict dependencies other than C++ version 14.
To read in compressed fasta/fastq files, it could be necessary to install zlib:

sudo apt install libz-dev

In addition, there are some optional features:

To read in a colored de Bruijn graph, SANS uses the API of Bifrost.
To convert the output into NEXUS format, the provided script requires Python 3.
To visualize the splits, we recommend the tool SplitsTree (version 4).

Windows is currently supported with full basic functionality but limited features only. Confer to branch "windows".

Compilation

git clone https://gitlab.ub.uni-bielefeld.de/gi/sans.git
cd sans
make

By default, the installation creates:

a binary (SANS)

You may want to make the binary (SANS) accessible via your PATH variable.

Optional: If Bifrost should be used, change the SANS makefile accordingly (easy to see how). Please note the installation instructions regarding the default maximum k-mer size of Bifrost in its README. If during the compilation, the Bifrost library files are not found, make sure that the corresponding folder is found as include path by the C++ compiler. You may have to add -I/usr/local/include (with the corresponding folder) to the compiler flags in the makefile. We also recommend to have a look at the FAQs of Bifrost.

Usage

Use SANS --help to obtain a detailed list of options.

Input files

Specify your input by -i <list> where <list> is either a file-of-files or in kmtricks format. Each file can be in fasta, multiple fasta or fastq format.

File-of-files:
```
genome_a.fa
genome_b.fa
...
```
Files can be in subfolders and/or compressed:
```
dataset_1/genome_a.fa.gz
dataset_1/genome_b.fa.gz
...
```
One genome can also be composed of several files (the first one will be used as identifier in the output):
```
reads_a_forward.fa reads_a_reverse.fa
genome_b_chr_1.fa genome_b_chr_2.fa
...
```
kmtricks format: In this format, you can specify individual identifiers and, optionally, abundance thresholds (see "read data as input"):
```
genome_A : reads_a_forward.fa ; reads_a_reverse.fa ! 2
genome_B : genome_b_chr_1.fa ; genome_b_chr_2.fa ! 1
...
```

Input paramters

genomes/assemblies as input: just use -i <list>
read data as input: to filter out k-mers of low abundance, either use -q 2 (or higher thresholds) to specify a global threshold for all input files, or use the kmtricks file-of-files format to specify (individual) thresholds.
mix of assemblies and read data as input: use the kmtricks file-of-files format to specify individual thresholds.
coding sequences as input: add -a if input is provided as translated sequences, or add -c if translation is required. See usage information (SANS --help) for further details.

Output

The output file, specified by -o <split-file>, lists all splits line by line, sorted by their weight in a tab-separated format where the first column is the split weight and the remaining entries are the identifiers of genomes split from the others.
For large data sets, the list of splits can become very long. We recommend to restrict the output for n genomes as input to the 10n strongest splits in the output using -t 10n.
We recommend to filter the splits using -f <filter>. Then, the sorted list of splits is greedily filtered, i.e., splits are iterated from strongest to weakest and a split is kept if and only if the filter criterion is met.

If you want a tree, use -f strict. In this case, -N <newick-file> can be used to write the resulting tree into a newick file; instead or additionally to -o <split-file>. If you want a network, use one of the following filters:

weakly: a split is kept if it is weakly compatible to all previously filtered splits (see publication for definition of "weak compatibility").
2-tree: two sets of compatible splits (=trees) are maintained. A split is added to the first if possible (compatible); if not to the second if possible. 3-tree: three sets of compatible splits (=trees) are maintained. A split is added to the first if possible (compatible); if not to the second if possible; if not to the third if possible.

To visualize the splits, we recommend the tool SplitsTree (version 4). Make it accessible via your PATH variabble to enable the following options.

To generate SANS output readable for SplitsTree, use option -X <nexus-file>. In SplitsTree, after opening the <nexus-file>, select "Draw" > "EqualAngle" > "Apply". To produce a PDF using SplitsTree, use option -p <pdf-file>.

To depict the phylogeny on a higher level, taxa can be assigned to groups. Each group is then represented by a color and individual text labels of taxa are replaced by colored circles accordingly. Use option -l <groups.tsv> to provide a mapping of some or all genome identifiers to arbitrary group names. Optionally, provide a second file <colors.tsv> to provide a mapping of group names to custom RGB values, e.g. 255 0 0 for red.

Further parameters

To observe the progress of SANS during computation, use -v to switch to verbose mode.
You may want to try different values for the k-mer length using -k <integer>. On shorter sequences, e.g. virus data, use a smaller k, e.g., -k 11.
If your input contains 'N's or other ambiguous IUPAC characters, affected k-mers are skipped by default. Option -x <small_integer> can be used to replace these with the corresponding DNA or AA bases, considering all possibilities.
By default, all available threads are used for parallel processing. The number of threads can be limited by -T <integer>.

Bootstrapping To assess the robustness of reconstructed splits with respect to noise in the input data, bootstrap replicates can be constructed by randomly varying the observed k-mer content. To compare the originally determined splits to, e.g., 1000 bootstrap replicates, use -b 1000. An additional output file <split-file>.bootstrap containing the bootstrap support values will be created. To include them in the nexus file for visualization, use scripts/sans2conf_nexus.py <split-file> <split-file>.bootstrap <list> > <nexus-file>.

To generate a consensus tree from bootstrapped trees, use -f tree -b 1000 -C. To generate a consensus network from bootstrapped trees, use -f tree -b 1000 -C weakly. See usage information (SANS --help) for further options.

Examples

Split network from assemblies

SANS -i list.txt -o sans.splits - X sans.nexus -t 10n -f weakly

Tree in newick format from assemblies

SANS -i list.txt -N sans.new -f strict

Split network (and tree) from read data

SANS -i list.txt -o sans.splits -X sans.nexus -t 10n -f weakly -q 2
(SANS -i list.txt -s sans.splits -f strict -N sans.new)

Drosophila example data

# go to example directory
cd example_data/drosophila

# download data: whole genome (or coding sequences)
./download_WG.sh
(./download_CDS.sh)

# compute splits
../../SANS -i WG_list.kmt -o WG_weakly.splits -f weakly -v
(../../SANS -i CDS_list.kmt -o CDS_weakly.splits -f weakly -v -c)

# generate PDF (if SplitsTree installed)
../../SANS -i WG_list.kmt -s WG_weakly.splits -p WG_weakly.pdf
(../../SANS -i CDS_list.kmt -s CDS_weakly.splits -p CDS_weakly.pdf)

# generate labeled PDF (if SplitsTree installed)
../../SANS -i WG_list.kmt -s WG_weakly.splits -l groups.tsv -p WG_weakly_groups.pdf
(../../SANS -i CDS_list.kmt -s CDS_weakly.splits -l groups.tsv -p CDS_weakly_groups.pdf)

# filter for tree
../../SANS -i WG_list.kmt -s WG_weakly.splits -N WG.new -f strict
(../../SANS -i CDS_list.kmt -s CDS_weakly.splits -N CDS.new -f strict)

# generate consensus network from bootstrapped trees
../../SANS -i WG_list.kmt -f strict -b 1000 -C weakly -p WG_weakly.pdf -v
(../../SANS -i CDS_list.kmt -f strict -b 1000 -C weakly -p CDS_weakly.pdf -v)

Virus example data

# go to example directory
cd example_data/prasinoviruses

# download data
./download.sh

# compute splits
../../SANS -i list.txt -o weakly.splits -f weakly -k 11 -v 

# compute splits and generate PDF (if SplitsTree installed)
../../SANS -i list.txt -p weakly.pdf -f weakly -k 11 -v

Performance evaluation on predicted open reading frames

SANS-serif can predict phylogenies based on amino acid sequences as input. Processing coding sequences is faster than processing whole genome data. Experiments show that the reconstruction quality is comparable.

If you want to use selected marker genes, the number of genes should be as high as possible to provide sufficient sequence information for extracting phylogenetic signals.

If the genomes at hand are not annotated, you can use a tool to predict open reading frames. The following experiment shows that the reconstruction quality does not suffer from such a simple pre-processing or even improves, while saving total running time, especially because genomes can easily be pre-processed in parallel.

The following tools have been used with the parameters shown in the table.

Tool	Parameters	Reference
SANS	-k (see below) -m geom2 -t 10n -filter strict [-a (if preprocessed)]
Getorf (EMBOSS)	-find (see below) -t 11	Gary Williams, 2000
ORFfinder	-n true -g 11	NCBI
Prodigal	-q	V2.6.3,Doug Hyatt, 2016

For estimating the reconstruction accuracy, the (weighted) F1 score has been determined as follows.

Measure
F1-score	harmonic mean of precision and recall
precision	(number of called splits that are also in reference) / (total number of called splits)
recall	(number of reference splits that are also called) / (total number of reference splits)
weighted precision	(total weight of called splits that are also in reference) / (total weight of all called splits)
weighted recall	(total weight of reference splits that are also called) / (total weight of all reference splits)

Further information on the datasets can be found in the initial publication of SANS (Wittler, 2019), see above.

Salmonella enterica Para C

220 genomes, k=31

Preprocessing	Running time pre-processing	Running time SANS	Running time both (parall. pre-proc., factor 16)	F1-Score (weighted)
none (whole genome)	--	610s	--	0.878 (0.999)
Getorf (-find 0)	247s	459s	706s (474s)	0.881 (0.999)
Getorf (-find 1)	199s	339s	538s (351s)	0.868 (0.999)
ORFfinder	5195s	166s	5361s (491s)	0.858 (0.997)
Prodigal	6292s	134s	6326s (521s)	0.853 (0.998)

Salmonella enterica subspecies enterica

2964 genomes, k=21

Preprocessing	Running time pre-processing	Running time SANS	Running time both (parall. pre-proc., factor 16)	F1-Score (weighted)
none (whole genome)	--	190min	--	0.587 (0.792)
Getorf (-find 0)	55min	220min	274min (223min)	0.624 (0.807)
Getorf (-find 1)	48min	165min	213min (168min)	0.620 (0.799)
ORFfinder	1621min	78min	1698min (179min)	0.594 (0.766)
Prodigal	1322min	50min	1372min (133min)	0.587 (0.792)

Clustering / dereplication of metagenome assembled genomes (MAGs)

For clustering of highly similar sequences, a tree can be constructed which is then chopped into many small subtrees such that the taxa in each subtree correspond to one cluster. This procedure has been successfully applied for dereplication of metagenome assembled genomes (MAGs). Here, the input are MAGs, and the goal is to cluster these such that the MAGs in each cluster belong to the same strain.

The general procedure is:

# reconstruct a tree
SANS --input <list_of_files> --newick <tree_to_cluster.new> --filter strict --kmer 15 (--verbose) --window W --top T (see below)

# determine clusters from tree
scripts/newick2clusters.py <tree_to_cluster.new> > <clusters.tsv>

Due to the usually very high number of input sequences, we recommend the usage of parameters --window (-w) and --top (-t) in order to save time and memory. (The experimental parameter --window is not mentioned in the usage, because it can lower the accuracy of reconstructed phylogenies considerably. But in this case, the reconstructed tree does not need to be an accurate phylogeny and the parameter has only reasonable effect on the clustering.)

Setting	Parameters
quick	--window 25 --top 50n
thoroughly	--window 10 --top 100n

The tree is chopped into clusters as follows:

Re-root tree to maximum degree node
In post order traversal:
- ignore non-branching node (merge edges)
- get clusters from sub-trees (recursively)
- if edge longer than parent edge:
  - remove found clusters from current leaf set
  - remaining leaf set =: new cluster

Dereplication efficiency and accuracy

This dereplication approach has been evaluated on a data set from the CAMI challenge [Meyer et al. Critical Assessment of Metagenome Interpretation - the second round of challenges, bioRxiv, 2021, doi: https://doi.org/10.1101/2021.07.12.451567], a simulated mouse gut metagenome representing 64 metagenome samples. Alexander Sczyrba and Peter Belmann filtered the MAGs, provided them for our clustering and compared the clustering to the gold standard.

Filter	# MAGs	# Strains
MIMAG medium	5,786	686
MIMAG high	2,510	349
no filter	11,602	791

For the comparison, each called cluster is mapped to a gold standard cluster with maximum intersection, i.e., maximum agreement of contained MAGs. Then, the purity and completeness of the clusters are determined by investigating the number of correct (TP), false (FP) and missing (FN) MAGs in each cluster:

purity := TP / (TP + FP)

completeness := TP / (TP + FN)

These per-cluster measures were then averaged (weighted and unweighted). The following table shows the results for the different input and clustering settings.

Input	Setting	Running time	Memory	average purity (weighted)	average completeness (weighted)
MIMAG medium	quick	11h	56G	0.973 (0.956)	0.884 (0.998)
	thoroughly	50h	130G	0.972 (0.952)	0.881 (0.998)
MIMAG high	quick	2h	16G	0.983 (0.978)	0.890 (0.991)
	thoroughly	6h	36G	0.983 (0.979)	0.912 (0.993)
no filter	quick	59h	127G	0.996 (0.983)	0.173 (0.668)
	thoroughly	185h	290G	0.995 (0.979)	0.190 (0.700)

Contact

For any question, feedback, or problem, please feel free to file an issue on this Git repository or write an email and we will get back to you as soon as possible.

[email protected]

SANS is provided as a service of the German Network for Bioinformatics Infrastructure (de.NBI). We would appriciate if you would participate in the evaluation of SANS by completing this very short survey.

License

The sparse-map library is licensed under the MIT license.
The Bifrost library is licensed under the BSD-2 license.
SANS uses gzstream, licensed under the LGPL license
SANS is licensed under the GNU general public license.

Privacy

We use the open source software Matomo for web analysis in order to collect anonymized usage statistics for this repository. Please refer to our Privacy Notice for details.

Name		Name	Last commit message	Last commit date
Latest commit History 668 Commits
config		config
example_data		example_data
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
PrivacyNotice.pdf		PrivacyNotice.pdf
README.md		README.md
SANS-autoN.BAT		SANS-autoN.BAT
SANS-autoN.sh		SANS-autoN.sh
makefile		makefile

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SANS ambages

Dos and Don'ts

Publications

Table of Contents

Requirements

Compilation

Usage

Examples

Performance evaluation on predicted open reading frames

Salmonella enterica Para C

Salmonella enterica subspecies enterica

Clustering / dereplication of metagenome assembled genomes (MAGs)

Dereplication efficiency and accuracy

Contact

License

Privacy

About

Releases 1

Contributors 5

Languages

License

gi-bielefeld/sans

Folders and files

Latest commit

History

Repository files navigation

SANS ambages

Dos and Don'ts

Publications

Table of Contents

Requirements

Compilation

Usage

Examples

Performance evaluation on predicted open reading frames

Salmonella enterica Para C

Salmonella enterica subspecies enterica

Clustering / dereplication of metagenome assembled genomes (MAGs)

Dereplication efficiency and accuracy

Contact

License

Privacy

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Contributors 5

Languages