TaxSBP

Implementation of the approximation algorithm for the hierarchically structured bin packing problem [1] adapted for the NCBI Taxonomy database [2] (uses LCA script from [3]).

Installation

conda install -c bioconda -c conda-forge taxsbp

or manual installation without conda

Usage

Input

A tab-separated file:

sequence id <tab> sequence length <tab> taxonomic id [ <tab> specialization]
nodes.dmp and merged.dmp from NCBI Taxonomy (ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz)
specialization can be used to further cluster sequences by groups beyond the taxonomy (e.g. strain name, assembly accession, ...)
if specialization is set but missing in the file, a placeholder will be used "specialization-sequence id"

Output

A tab-separated file:

sequence id <tab> seq.start <tab> seq.end <tab> sequence length <tab> taxonomic id <tab> bin id [ <tab> specialization]

Examples with sample data:

The sample data is comprised of:

- hierarchy with 5 levels (rank1-rank5), where the root node is "1" and leaf nodes are marked with a "*"
- 8 Specializations (S1-S9) and 13 targets (A-M) with equal length (100) 

rank-1                   1 ________________
                        / \           \    \
rank-2                2.1 2.2 ______   \    \
                      / \    \      \   \    \
rank-3             3.1  3.2   3.4    \   \    \
                    /   / \     \     \   \    \
rank-4          *4.1 *4.2 *4.3  *4.4  *4.5 *4.6 \
                 /    /   /     / | \   \   \    \
rank-5          /    /   / *5.1 *5.2 \   \   \    \
               /    /   /     /   |   \   \   \    \
spec.         S1   S2  S3   S4   S5   S6  S7  S8   S9
             /  \   |   |   / \   |   |   |   /|\   |
target      A    B  C   D  E   F  G   H   I  J K L  M

Clusters of size 400

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400

Results

#id	st	end	len	tax	bin	
F	1	100	100	5.1	0
E	1	100	100	5.1	0
H	1	100	100	4.4	0
G	1	100	100	5.2	0
D	1	100	100	4.3	1
C	1	100	100	4.2	1
B	1	100	100	4.1	1
A	1	100	100	4.1	1
L	1	100	100	4.6	2
K	1	100	100	4.6	2
J	1	100	100	4.6	2
M	1	100	100	1	2
I	1	100	100	4.5	3

Pre-clustered by "rank-2"

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -p "rank-2"

ABCD (2.1) and EFGHI (2.2) are pre-clustered together

Results

#id	st	end	len	tax	bin	
I	1	100	100	4.5	0
G	1	100	100	5.2	0
E	1	100	100	5.1	0
F	1	100	100	5.1	0
H	1	100	100	4.4	0
C	1	100	100	4.2	1
D	1	100	100	4.3	1
A	1	100	100	4.1	1
B	1	100	100	4.1	1
J	1	100	100	4.6	2
K	1	100	100	4.6	2
L	1	100	100	4.6	2
M	1	100	100	1	2

Bin exclusive clusters by "rank-4"

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -e "rank-4"

Clusters are generated for each sub-tree of nodes from rank-4. The taxid of the exclusive rank is printed instead of the original.

Results

#id	st	end	len	tax	bin	
F	1	100	100	4.4	0
E	1	100	100	4.4	0
H	1	100	100	4.4	0
G	1	100	100	4.4	0
L	1	100	100	4.6	1
K	1	100	100	4.6	1
J	1	100	100	4.6	1
B	1	100	100	4.1	2
A	1	100	100	4.1	2
I	1	100	100	4.5	3
D	1	100	100	4.3	4
C	1	100	100	4.2	5
M	1	100	100	1	6

Cluster with specialization

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -s MySpec -e MySpec

Clusters are exclusive by specialization

Results

#id	st	end	len	tax	bin	spec	
L	1	100	100	4.6	0	S8
K	1	100	100	4.6	0	S8
J	1	100	100	4.6	0	S8
F	1	100	100	5.1	1	S4
E	1	100	100	5.1	1	S4
B	1	100	100	4.1	2	S1
A	1	100	100	4.1	2	S1
G	1	100	100	5.2	3	S5
D	1	100	100	4.3	4	S3
M	1	100	100	1	5	S9
H	1	100	100	4.4	6	S6
I	1	100	100	4.5	7	S7
C	1	100	100	4.2	8	S2

Cluster with fragmentation

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 150 -f 50 -a 5 -e "rank-3"

Clusters of size 150. Fragment inputs in 50 with overlap of 5. Cluster exclusive of "rank-3"

Results

#id	st	end	len	tax	bin	
F	1	55	55	3.4	0
E	1	55	55	3.4	0
B	1	55	55	3.1	1
A	1	55	55	3.1	1
L	1	55	55	4.6	2
K	1	55	55	4.6	2
G	1	55	55	3.4	3
G	51	100	50	3.4	3
H	1	55	55	3.4	4
H	51	100	50	3.4	4
D	1	55	55	3.2	5
D	51	100	50	3.2	5
C	1	55	55	3.2	6
C	51	100	50	3.2	6
I	1	55	55	4.5	7
I	51	100	50	4.5	7
J	1	55	55	4.6	8
L	51	100	50	4.6	8
M	1	55	55	1	9
M	51	100	50	1	9
F	51	100	50	3.4	10
E	51	100	50	3.4	10
B	51	100	50	3.1	11
A	51	100	50	3.1	11
K	51	100	50	4.6	12
J	51	100	50	4.6	12

Examples with real data:

Prepare data:

gzip -d sample_data/*.gz

Clustering:

taxsbp.py -i sample_data/20181219_abfv_refseq_cg.tsv -n sample_data/20181219_abfv_refseq_cg_nodes.dmp -l 10000000 -f 999500 -a 500

Parameters:

$ taxsbp -h

usage: taxsbp [-h] [-i <input_file>] [-o <output_file>] [-n <nodes_file>] [-m <merged_file>] [-l <bin_len>] [-b <bins>]
              [-f <fragment_len>] [-a <overlap_len>] [-p <pre_cluster>] [-e <bin_exclusive>] [-s <specialization>]
              [-u <update_file>] [-w] [-t] [-v]

optional arguments:
  -h, --help            show this help message and exit
  -i <input_file>, --input-file <input_file>
                        Tab-separated with the fields: sequence id <tab> sequence length <tab> taxonomic id [<tab>
                        specialization]
  -o <output_file>, --output-file <output_file>
                        Path to the output tab-separated file. Fields: sequence id <tab> sequence start <tab> sequence
                        end <tab> sequence length <tab> taxonomic id <tab> bin id [<tab> specialization]. Default: STDOUT
  -n <nodes_file>, --nodes-file <nodes_file>
                        nodes.dmp from NCBI Taxonomy
  -m <merged_file>, --merged-file <merged_file>
                        merged.dmp from NCBI Taxonomy
  -l <bin_len>, --bin-len <bin_len>
                        Maximum bin length (in bp). Use this parameter insted of -b to define the number of bins.
                        Default: length of the biggest group [Mutually exclusive -b]
  -b <bins>, --bins <bins>
                        Approximate number of bins (estimated by total length/bin number). [Mutually exclusive -l]
  -f <fragment_len>, --fragment-len <fragment_len>
                        Fragment sequences into pieces
  -a <overlap_len>, --overlap-len <overlap_len>
                        Overlap length between fragments [Only valid with -a]
  -p <pre_cluster>, --pre-cluster <pre_cluster>
                        Pre-cluster sequences into any existing rank, leaves or specialization. Entries will not be
                        divided in bins ['leaves',specialization name,rank name]
  -e <bin_exclusive>, --bin-exclusive <bin_exclusive>
                        Make bins rank, leaves or specialization exclusive. Bins will not have mixed entries. When the
                        chosen rank is not present on a sequence lineage, this sequence will be leaf/specialization
                        exclusive. ['leaves',specialization name,rank name]
  -s <specialization>, --specialization <specialization>
                        Specialization name (e.g. assembly, strain). If given, TaxSBP will cluster entries on a
                        specialized level after the leaf. The specialization identifier should be provided as an extra
                        collumn in the input_file and should respect the taxonomic hiercharchy: One leaf can have
                        multiple specializations but a specialization is present in only one leaf
  -u <update_file>, --update-file <update_file>
                        Previously generated clusters to be updated. Output only new sequences
  -w, --allow-merge     When updating, allow merging of existing bins. Will output the whole set, not only new bins
  -t, --silent          Do not print warning to STDERR
  -v, --version         show program's version number and exit

Manual Installation

Dependencies:

python>=3.5
binpacking==1.4.3
pylca==1.0.0
pandaspandas>=0.22.0 (optional)

Pylca:

git clone https://github.com/pirovc/pylca
cd pylca
python setup.py install

TaxSBP + binpacking:

git clone https://github.com/pirovc/taxsbp.git
cd taxsbp
python setup.py install
taxsbp -h

Testing:

pip install "pandas>=0.22.0"
cd taxsbp
python3 -m unittest discover -s tests/taxsbp/integration/

References:

[1] Codenotti, B., De Marco, G., Leoncini, M., Montangero, M., & Santini, M. (2004). Approximation algorithms for a hierarchically structured bin packing problem. Information Processing Letters, 89(5), 215–221. http://doi.org/10.1016/j.ipl.2003.12.001

[2] Federhen, S. (2012). The NCBI Taxonomy database. Nucleic Acids Research, 40(D1), D136–D143. http://doi.org/10.1093/nar/gkr1178

[3] https://www.ics.uci.edu/~eppstein/ in the package https://github.com/pirovc/pylca

Name		Name	Last commit message	Last commit date
Latest commit History 134 Commits
sample_data		sample_data
taxsbp		taxsbp
tests/taxsbp/integration		tests/taxsbp/integration
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
taxsbp.py		taxsbp.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TaxSBP

Installation

Usage

Input

Output

Examples with sample data:

Clusters of size 400

Pre-clustered by "rank-2"

Bin exclusive clusters by "rank-4"

Cluster with specialization

Cluster with fragmentation

Examples with real data:

Prepare data:

Clustering:

Parameters:

Manual Installation

Dependencies:

Pylca:

TaxSBP + binpacking:

Testing:

References:

About

Releases 8

Packages

Contributors 2

Languages

License

pirovc/taxsbp

Folders and files

Latest commit

History

Repository files navigation

TaxSBP

Installation

Usage

Input

Output

Examples with sample data:

Clusters of size 400

Pre-clustered by "rank-2"

Bin exclusive clusters by "rank-4"

Cluster with specialization

Cluster with fragmentation

Examples with real data:

Prepare data:

Clustering:

Parameters:

Manual Installation

Dependencies:

Pylca:

TaxSBP + binpacking:

Testing:

References:

About

Resources

License

Stars

Watchers

Forks

Releases 8

Packages 0

Contributors 2

Languages

Packages