Merge pull request #29 from pirovc/dev

Version 1.0.0
pirovc · Aug 5, 2020 · 6edcc30 · 6edcc30
2 parents 6e14819 + 81d792a
commit 6edcc30
Show file tree

Hide file tree

Showing 33 changed files with 1,789 additions and 896 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -0,0 +1,22 @@
+language: python
+python:
+  - "3.4"
+  - "3.5"
+  - "3.6"
+  - "3.7"
+  - "3.8"
+#  - "3.8-dev"
+
+install:
+  - pip install "pandas>=0.22.0"
+  - pip install binpacking==1.4.3
+  - git clone https://github.com/pirovc/pylca.git
+  - cd pylca
+  - python setup.py install
+  - cd ..
+
+script:
+  - python -m unittest discover -s tests/taxsbp/integration/
+
+notifications:
+  email: false
diff --git a/LICENSE b/LICENSE
@@ -1,7 +1,6 @@
 The MIT License (MIT)
 
-Copyright (c) 2019 Vitor C. Piro - [email protected] - [email protected]
-Robert Koch-Institut, Germany
+Copyright (c) 2020 Vitor C. Piro - [email protected]
 
 All rights reserved.
 

diff --git a/README.md b/README.md
@@ -2,23 +2,17 @@
 
 [![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/taxsbp/README.html)
 
-Vitor C. Piro ([email protected])
+[![Build Status](https://travis-ci.org/pirovc/taxsbp.svg?branch=master)](https://travis-ci.org/pirovc/taxsbp) 
 
-Implementation of the approximation algorithm for the hierarchically structured bin packing problem [1] based on the NCBI Taxonomy database [2] (uses LCA script from [3]).
-
-## Dependencies:
-
-- python>=3.4
-- [binpacking](https://pypi.org/project/binpacking/)==1.4.1
+Implementation of the approximation algorithm for the hierarchically structured bin packing problem [1] adapted for the NCBI Taxonomy database [2] (uses LCA script from [3]).
 
 ## Installation
 
 ```shh
-git clone https://github.com/pirovc/taxsbp.git
-cd taxsbp
-python setup.py install
-taxsbp -h
+conda -c bioconda -c conda-forge taxsbp
 ```
+or [manual installation](#manual-installation) without conda
+
 ## Usage
 
 ### Input
@@ -28,57 +22,266 @@ taxsbp -h
 	`sequence id <tab> sequence length <tab> taxonomic id [ <tab> specialization]`
 
  * nodes.dmp and merged.dmp from NCBI Taxonomy (ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz)
+ * specialization can be used to further cluster sequences by groups beyond the taxonomy (e.g. strain name, assembly accession, ...)
 
 ### Output 
 
  * A tab-separated file:
 
- 	`sequence id [/seq.start:seq.end] <tab> sequence length <tab> taxonomic id [ <tab> specialization] <tab> bin id`
+ 	`sequence id <tab> seq.start <tab> seq.end <tab> sequence length <tab> taxonomic id <tab> bin id [ <tab> specialization] `
+
+## Examples with sample data:
+
+The sample data is comprised of:
+
+	- hierarchy with 5 levels (rank1-rank5), where the root node is "1" and leaf nodes are marked with a "*"
+	- 8 Specializations (S1-S9) and 13 targets (A-M) with equal length (100) 
+
+	rank-1                   1 ________________
+	                        / \           \    \
+	rank-2                2.1 2.2 ______   \    \
+	                      / \    \      \   \    \
+	rank-3             3.1  3.2   3.4    \   \    \
+	                    /   / \     \     \   \    \
+	rank-4          *4.1 *4.2 *4.3  *4.4  *4.5 *4.6 \
+	                 /    /   /     / | \   \   \    \
+	rank-5          /    /   / *5.1 *5.2 \   \   \    \
+	               /    /   /     /   |   \   \   \    \
+	spec.         S1   S2  S3   S4   S5   S6  S7  S8   S9
+	             /  \   |   |   / \   |   |   |   /|\   |
+	target      A    B  C   D  E   F  G   H   I  J K L  M
+
+### Clusters of size 400
+
+	taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400
+
+<details>
+  <summary>Results</summary>
+
+	#id	st	end	len	tax	bin	
+	F	1	100	100	5.1	0
+	E	1	100	100	5.1	0
+	H	1	100	100	4.4	0
+	G	1	100	100	5.2	0
+	D	1	100	100	4.3	1
+	C	1	100	100	4.2	1
+	B	1	100	100	4.1	1
+	A	1	100	100	4.1	1
+	L	1	100	100	4.6	2
+	K	1	100	100	4.6	2
+	J	1	100	100	4.6	2
+	M	1	100	100	1	2
+	I	1	100	100	4.5	3
+
+</details>
+
+### Pre-clustered by "rank-2"
+
+	taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -p "rank-2"
+
+ * ABCD (2.1) and EFGHI (2.2) are pre-clustered together
+
+<details>
+  <summary>Results</summary>
+
+	#id	st	end	len	tax	bin	
+	I	1	100	100	4.5	0
+	G	1	100	100	5.2	0
+	E	1	100	100	5.1	0
+	F	1	100	100	5.1	0
+	H	1	100	100	4.4	0
+	C	1	100	100	4.2	1
+	D	1	100	100	4.3	1
+	A	1	100	100	4.1	1
+	B	1	100	100	4.1	1
+	J	1	100	100	4.6	2
+	K	1	100	100	4.6	2
+	L	1	100	100	4.6	2
+	M	1	100	100	1	2
+
+</details>
+
+### Bin exclusive clusters by "rank-4"
+
+	taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -e "rank-4"
+
+ * Clusters are generated for each sub-tree of nodes from rank-4. The taxid of the exclusive rank is printed instead of the original.
+
+<details>
+  <summary>Results</summary>
+
+	#id	st	end	len	tax	bin	
+	F	1	100	100	4.4	0
+	E	1	100	100	4.4	0
+	H	1	100	100	4.4	0
+	G	1	100	100	4.4	0
+	L	1	100	100	4.6	1
+	K	1	100	100	4.6	1
+	J	1	100	100	4.6	1
+	B	1	100	100	4.1	2
+	A	1	100	100	4.1	2
+	I	1	100	100	4.5	3
+	D	1	100	100	4.3	4
+	C	1	100	100	4.2	5
+	M	1	100	100	1	6
+
+</details>
+
+### Cluster with specialization
+
+	taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -s MySpec -e MySpec
+
+ * Clusters are exclusive by specialization
+
+<details>
+  <summary>Results</summary>
+
+	#id	st	end	len	tax	bin	spec	
+	L	1	100	100	4.6	0	S8
+	K	1	100	100	4.6	0	S8
+	J	1	100	100	4.6	0	S8
+	F	1	100	100	5.1	1	S4
+	E	1	100	100	5.1	1	S4
+	B	1	100	100	4.1	2	S1
+	A	1	100	100	4.1	2	S1
+	G	1	100	100	5.2	3	S5
+	D	1	100	100	4.3	4	S3
+	M	1	100	100	1	5	S9
+	H	1	100	100	4.4	6	S6
+	I	1	100	100	4.5	7	S7
+	C	1	100	100	4.2	8	S2
+
+</details>
+
+### Cluster with fragmentation
+
+	taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 150 -f 50 -a 5 -e "rank-3"
+
+ * Clusters of size 150. Fragment inputs in 50 with overlap of 5. Cluster exclusive of "rank-3"
+
+<details>
+  <summary>Results</summary>
+
+	#id	st	end	len	tax	bin	
+	F	1	55	55	3.4	0
+	E	1	55	55	3.4	0
+	B	1	55	55	3.1	1
+	A	1	55	55	3.1	1
+	L	1	55	55	4.6	2
+	K	1	55	55	4.6	2
+	G	1	55	55	3.4	3
+	G	51	100	50	3.4	3
+	H	1	55	55	3.4	4
+	H	51	100	50	3.4	4
+	D	1	55	55	3.2	5
+	D	51	100	50	3.2	5
+	C	1	55	55	3.2	6
+	C	51	100	50	3.2	6
+	I	1	55	55	4.5	7
+	I	51	100	50	4.5	7
+	J	1	55	55	4.6	8
+	L	51	100	50	4.6	8
+	M	1	55	55	1	9
+	M	51	100	50	1	9
+	F	51	100	50	3.4	10
+	E	51	100	50	3.4	10
+	B	51	100	50	3.1	11
+	A	51	100	50	3.1	11
+	K	51	100	50	4.6	12
+	J	51	100	50	4.6	12
+
+</details>
+
+## Examples with real data:
+
+### Prepare data:
+
+	gzip -d sample_data/*.gz
+
+### Clustering:
+
+	taxsbp.py -i sample_data/20181219_abfv_refseq_cg.tsv -n sample_data/20181219_abfv_refseq_cg_nodes.dmp -l 10000000 -f 999500 -a 500
 
 ## Parameters:
 
 $ taxsbp -h
 
-	usage: TaxSBP [-h] -f <input_file> -n <nodes_file> [-m <merged_file>]
-              [-b <bins>] [-l <bin_len>] [-a <fragment_len>]
-              [-o <overlap_len>] [-p <pre_cluster>] [-r <bin_exclusive>]
-              [-z <specialization>] [-u <update_file>] [-v]
+	usage: taxsbp [-h] [-i <input_file>] [-o <output_file>] [-n <nodes_file>] [-m <merged_file>] [-l <bin_len>] [-b <bins>]
+	              [-f <fragment_len>] [-a <overlap_len>] [-p <pre_cluster>] [-e <bin_exclusive>] [-s <specialization>]
+	              [-u <update_file>] [-w] [-t] [-v]
 
 	optional arguments:
-	  -h, --help           show this help message and exit
-	  -f <input_file>      Tab-separated with the fields: sequence id <tab>
-	                       sequence length <tab> taxonomic id [<tab>
-	                       specialization]
-	  -n <nodes_file>      nodes.dmp from NCBI Taxonomy
-	  -m <merged_file>     merged.dmp from NCBI Taxonomy
-	  -b <bins>            Approximate number of bins (estimated by total
-	                       length/bin number). Default: 50 [Mutually exclusive -l]
-	  -l <bin_len>         Maximum bin length (in bp). Use this parameter insted
-	                       of -b to define the number of bins [Mutually exclusive
-	                       -b]
-	  -a <fragment_len>    Fragment sequences into pieces, output accession will
-	                       be modified with positions: ACCESION/start:end
-	  -o <overlap_len>     Overlap length between fragments [Only valid with -a]
-	  -p <pre_cluster>     Pre-cluster sequences into rank/taxid/specialization,
-	                       so they won't be splitted among bins
-	                       [none,specialization name,taxid,species,genus,...]
-	                       Default: none
-	  -r <bin_exclusive>   Make bins rank/taxid/specialization exclusive, so bins
-	                       won't have mixed sequences. When the chosen rank is not
-	                       present on a sequence lineage, this sequence will be
-	                       taxid/specialization exclusive. [none,specialization
-	                       name,taxid,species,genus,...] Default: none
-	  -z <specialization>  Specialization name (e.g. assembly, strain). If given,
-	                       TaxSBP will cluster entries on a specialized level
-	                       after the taxonomic id. The specialization identifier
-	                       should be provided as an extra collumn in the
-	                       input_file ans should respect the taxonomic hiercharchy
-	                       (one taxid -> multiple specializations / one
-	                       specialization -> one taxid). Default: ''
-	  -u <update_file>     Previously generated files to be updated. Default: ''
-	  -v                   show program's version number and exit
+	  -h, --help            show this help message and exit
+	  -i <input_file>, --input-file <input_file>
+	                        Tab-separated with the fields: sequence id <tab> sequence length <tab> taxonomic id [<tab>
+	                        specialization]
+	  -o <output_file>, --output-file <output_file>
+	                        Path to the output tab-separated file. Fields: sequence id <tab> sequence start <tab> sequence
+	                        end <tab> sequence length <tab> taxonomic id <tab> bin id [<tab> specialization]. Default: STDOUT
+	  -n <nodes_file>, --nodes-file <nodes_file>
+	                        nodes.dmp from NCBI Taxonomy
+	  -m <merged_file>, --merged-file <merged_file>
+	                        merged.dmp from NCBI Taxonomy
+	  -l <bin_len>, --bin-len <bin_len>
+	                        Maximum bin length (in bp). Use this parameter insted of -b to define the number of bins.
+	                        Default: length of the biggest group [Mutually exclusive -b]
+	  -b <bins>, --bins <bins>
+	                        Approximate number of bins (estimated by total length/bin number). [Mutually exclusive -l]
+	  -f <fragment_len>, --fragment-len <fragment_len>
+	                        Fragment sequences into pieces
+	  -a <overlap_len>, --overlap-len <overlap_len>
+	                        Overlap length between fragments [Only valid with -a]
+	  -p <pre_cluster>, --pre-cluster <pre_cluster>
+	                        Pre-cluster sequences into any existing rank, leaves or specialization. Entries will not be
+	                        divided in bins ['leaves',specialization name,rank name]
+	  -e <bin_exclusive>, --bin-exclusive <bin_exclusive>
+	                        Make bins rank, leaves or specialization exclusive. Bins will not have mixed entries. When the
+	                        chosen rank is not present on a sequence lineage, this sequence will be leaf/specialization
+	                        exclusive. ['leaves',specialization name,rank name]
+	  -s <specialization>, --specialization <specialization>
+	                        Specialization name (e.g. assembly, strain). If given, TaxSBP will cluster entries on a
+	                        specialized level after the leaf. The specialization identifier should be provided as an extra
+	                        collumn in the input_file and should respect the taxonomic hiercharchy: One leaf can have
+	                        multiple specializations but a specialization is present in only one leaf
+	  -u <update_file>, --update-file <update_file>
+	                        Previously generated clusters to be updated. Output only new sequences
+	  -w, --allow-merge     When updating, allow merging of existing bins. Will output the whole set, not only new bins
+	  -t, --silent          Do not print warning to STDERR
+	  -v, --version         show program's version number and exit
+
+## Manual Installation
+
+### Dependencies:
+
+- python>=3.4
+- [binpacking](https://pypi.org/project/binpacking/)==1.4.3
+- [pylca](https://github.com/pirovc/pylca)==1.0.0
+- [pandas](https://pypi.org/project/pandas/)pandas>=0.22.0 (tests only)
 
+### Pylca:
 
+```shh
+git clone https://github.com/pirovc/pylca
+cd pylca
+python setup.py install
+```
+
+### TaxSBP + binpacking:
+
+```shh
+git clone https://github.com/pirovc/taxsbp.git
+cd taxsbp
+python setup.py install
+taxsbp -h
+```
+
+### Testing:
+
+```shh
+pip install "pandas>=0.22.0"
+cd taxsbp
+python3 -m unittest discover -s tests/taxsbp/integration/
+```
 
 References:
 -----------
@@ -87,4 +290,4 @@ References:
 
 [2] Federhen, S. (2012). The NCBI Taxonomy database. Nucleic Acids Research, 40(D1), D136–D143. http://doi.org/10.1093/nar/gkr1178
 
-[3] https://www.ics.uci.edu/~eppstein/
+[3] https://www.ics.uci.edu/~eppstein/ in the package https://github.com/pirovc/pylca
diff --git a/sample_data/20181219_abfv_refseq_cg.tsv.gz b/sample_data/20181219_abfv_refseq_cg.tsv.gz
diff --git a/sample_data/20181219_abfv_refseq_cg_nodes.dmp.gz b/sample_data/20181219_abfv_refseq_cg_nodes.dmp.gz
diff --git a/sample_data/nodes.dmp b/sample_data/nodes.dmp
@@ -0,0 +1,14 @@
+1	|	1	|	rank-1	|	
+2.1	|	1	|	rank-2	|	
+2.2	|	1	|	rank-2	|	
+3.1	|	2.1	|	rank-3	|	
+3.2	|	2.1	|	rank-3	|	
+4.1	|	3.1	|	rank-4	|	
+4.2	|	3.2	|	rank-4	|	
+4.3	|	3.2	|	rank-4	|	
+3.4	|	2.2	|	rank-3	|	
+4.4	|	3.4	|	rank-4	|	
+5.1	|	4.4	|	rank-5	|	
+5.2	|	4.4	|	rank-5	|	
+4.5	|	2.2	|	rank-4	|	
+4.6	|	1	|	rank-4	|	
diff --git a/sample_data/seqinfo.tsv b/sample_data/seqinfo.tsv
@@ -0,0 +1,13 @@
+A	100	4.1	S1
+B	100	4.1	S1
+C	100	4.2	S2
+D	100	4.3	S3
+E	100	5.1	S4
+F	100	5.1	S4
+G	100	5.2	S5
+H	100	4.4	S6
+I	100	4.5	S7
+J	100	4.6	S8
+K	100	4.6	S8
+L	100	4.6	S8
+M	100	1	S9