Feature/pre release v1.0.0 (#28)

* readme adjustments, license * better code approxsbp and comments * clean pre_cluster function * final changes, taxid for leaves * final changes readme
pirovc · Aug 5, 2020 · 81d792a · 81d792a
1 parent 23fc84c
commit 81d792a
Show file tree

Hide file tree

Showing 5 changed files with 88 additions and 110 deletions.
diff --git a/README.md b/README.md
@@ -4,7 +4,7 @@
 
 [![Build Status](https://travis-ci.org/pirovc/taxsbp.svg?branch=master)](https://travis-ci.org/pirovc/taxsbp) 
 
-Implementation of the approximation algorithm for the hierarchically structured bin packing problem [1] based on the NCBI Taxonomy database [2] (uses LCA script from [3]).
+Implementation of the approximation algorithm for the hierarchically structured bin packing problem [1] adapted for the NCBI Taxonomy database [2] (uses LCA script from [3]).
 
 ## Installation
 
@@ -22,6 +22,7 @@ or [manual installation](#manual-installation) without conda
 	`sequence id <tab> sequence length <tab> taxonomic id [ <tab> specialization]`
 
  * nodes.dmp and merged.dmp from NCBI Taxonomy (ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz)
+ * specialization can be used to further cluster sequences by groups beyond the taxonomy (e.g. strain name, assembly accession, ...)
 
 ### Output 
 
@@ -33,19 +34,16 @@ or [manual installation](#manual-installation) without conda
 
 The sample data is comprised of:
 
-	- Root node (1)
-	- hierarchy with 5 levels (rank1-rank5)
-	- Leaf nodes (*)
-	- 8 Specializations (S1-S9)
-	- 13 targets (A-M) with equal length (100) 
+	- hierarchy with 5 levels (rank1-rank5), where the root node is "1" and leaf nodes are marked with a "*"
+	- 8 Specializations (S1-S9) and 13 targets (A-M) with equal length (100) 
 
 	rank-1                   1 ________________
 	                        / \           \    \
 	rank-2                2.1 2.2 ______   \    \
 	                      / \    \      \   \    \
 	rank-3             3.1  3.2   3.4    \   \    \
 	                    /   / \     \     \   \    \
-	rank-4          *4.1 *4.2 *4.3   4.4  *4.5 *4.6 \
+	rank-4          *4.1 *4.2 *4.3  *4.4  *4.5 *4.6 \
 	                 /    /   /     / | \   \   \    \
 	rank-5          /    /   / *5.1 *5.2 \   \   \    \
 	               /    /   /     /   |   \   \   \    \
@@ -57,8 +55,6 @@ The sample data is comprised of:
 
 	taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400
 
-Clusters are limited by size 400
-
 <details>
   <summary>Results</summary>
 
@@ -83,7 +79,7 @@ Clusters are limited by size 400
 
 	taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -p "rank-2"
 
-ABCD (2.1) and EFGHI (2.2) are forced together (even if bigger than 400)
+ * ABCD (2.1) and EFGHI (2.2) are pre-clustered together
 
 <details>
   <summary>Results</summary>
@@ -109,7 +105,7 @@ ABCD (2.1) and EFGHI (2.2) are forced together (even if bigger than 400)
 
 	taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -e "rank-4"
 
-Clusters are generated for each sub-tree of nodes from rank-4. The used rank is printed instead of the original.
+ * Clusters are generated for each sub-tree of nodes from rank-4. The taxid of the exclusive rank is printed instead of the original.
 
 <details>
   <summary>Results</summary>
@@ -135,7 +131,7 @@ Clusters are generated for each sub-tree of nodes from rank-4. The used rank is
 
 	taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -s MySpec -e MySpec
 
-Clusters are exclusive by specialization
+ * Clusters are exclusive by specialization
 
 <details>
   <summary>Results</summary>
@@ -161,7 +157,7 @@ Clusters are exclusive by specialization
 
 	taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 150 -f 50 -a 5 -e "rank-3"
 
-Clusters of size 150. Fragment inputs in 50 with overlap of 5. Cluster exclusive of "rank-3"
+ * Clusters of size 150. Fragment inputs in 50 with overlap of 5. Cluster exclusive of "rank-3"
 
 <details>
   <summary>Results</summary>
@@ -210,47 +206,47 @@ Clusters of size 150. Fragment inputs in 50 with overlap of 5. Cluster exclusive
 
 $ taxsbp -h
 
-	usage: TaxSBP [-h] [-i <input_file>] [-o <output_file>] [-n <nodes_file>] [-m <merged_file>] [-b <bins>] [-l <bin_len>]
+	usage: taxsbp [-h] [-i <input_file>] [-o <output_file>] [-n <nodes_file>] [-m <merged_file>] [-l <bin_len>] [-b <bins>]
 	              [-f <fragment_len>] [-a <overlap_len>] [-p <pre_cluster>] [-e <bin_exclusive>] [-s <specialization>]
-	              [-u <update_file>] [--output-unique-seqid] [-v]
+	              [-u <update_file>] [-w] [-t] [-v]
 
 	optional arguments:
 	  -h, --help            show this help message and exit
 	  -i <input_file>, --input-file <input_file>
 	                        Tab-separated with the fields: sequence id <tab> sequence length <tab> taxonomic id [<tab>
 	                        specialization]
 	  -o <output_file>, --output-file <output_file>
-	                        Path to the output tab-separated file with the fields. Default: STDOUT
+	                        Path to the output tab-separated file. Fields: sequence id <tab> sequence start <tab> sequence
+	                        end <tab> sequence length <tab> taxonomic id <tab> bin id [<tab> specialization]. Default: STDOUT
 	  -n <nodes_file>, --nodes-file <nodes_file>
 	                        nodes.dmp from NCBI Taxonomy
 	  -m <merged_file>, --merged-file <merged_file>
 	                        merged.dmp from NCBI Taxonomy
-	  -b <bins>, --bins <bins>
-	                        Approximate number of bins (estimated by total length/bin number). [Mutually exclusive -l]
 	  -l <bin_len>, --bin-len <bin_len>
 	                        Maximum bin length (in bp). Use this parameter insted of -b to define the number of bins.
 	                        Default: length of the biggest group [Mutually exclusive -b]
+	  -b <bins>, --bins <bins>
+	                        Approximate number of bins (estimated by total length/bin number). [Mutually exclusive -l]
 	  -f <fragment_len>, --fragment-len <fragment_len>
-	                        Fragment sequences into pieces, output accession will be modified with positions:
-	                        ACCESION/start:end
+	                        Fragment sequences into pieces
 	  -a <overlap_len>, --overlap-len <overlap_len>
 	                        Overlap length between fragments [Only valid with -a]
 	  -p <pre_cluster>, --pre-cluster <pre_cluster>
-	                        Pre-cluster sequences into rank/taxid/specialization, so they won't be splitted among bins
-	                        [none,specialization name,taxid,species,genus,...] Default: none
+	                        Pre-cluster sequences into any existing rank, leaves or specialization. Entries will not be
+	                        divided in bins ['leaves',specialization name,rank name]
 	  -e <bin_exclusive>, --bin-exclusive <bin_exclusive>
-	                        Make bins rank/taxid/specialization exclusive, so bins won't have mixed sequences. When the
-	                        chosen rank is not present on a sequence lineage, this sequence will be taxid/specialization
-	                        exclusive. [none,specialization name,taxid,species,genus,...] Default: none
+	                        Make bins rank, leaves or specialization exclusive. Bins will not have mixed entries. When the
+	                        chosen rank is not present on a sequence lineage, this sequence will be leaf/specialization
+	                        exclusive. ['leaves',specialization name,rank name]
 	  -s <specialization>, --specialization <specialization>
 	                        Specialization name (e.g. assembly, strain). If given, TaxSBP will cluster entries on a
-	                        specialized level after the taxonomic id. The specialization identifier should be provided as an
-	                        extra collumn in the input_file ans should respect the taxonomic hiercharchy (one taxid ->
-	                        multiple specializations / one specialization -> one taxid). Default: ''
+	                        specialized level after the leaf. The specialization identifier should be provided as an extra
+	                        collumn in the input_file and should respect the taxonomic hiercharchy: One leaf can have
+	                        multiple specializations but a specialization is present in only one leaf
 	  -u <update_file>, --update-file <update_file>
-	                        Previously generated files to be updated. Default: ''
-	  --output-unique-seqid
-	                        Output unique sequence ids after fragmentation in the format: seq.id/seq.start:seq.end]
+	                        Previously generated clusters to be updated. Output only new sequences
+	  -w, --allow-merge     When updating, allow merging of existing bins. Will output the whole set, not only new bins
+	  -t, --silent          Do not print warning to STDERR
 	  -v, --version         show program's version number and exit
 
 ## Manual Installation

diff --git a/setup.py b/setup.py
@@ -20,7 +20,7 @@ def read(filename):
     author="Vitor C. Piro",
     author_email="[email protected]",
 
-    description="Implementation of the approximation algorithm for the hierarchically structured bin packing problem based on the NCBI Taxonomy database.",
+    description="Implementation of the approximation algorithm for the hierarchically structured bin packing problem adapted for the NCBI Taxonomy database.",
     long_description=read("README.md"),
 
     packages=['taxsbp'],

diff --git a/taxsbp/Group.py b/taxsbp/Group.py
@@ -24,12 +24,18 @@ def clear_clusters(self):
 		self.clusters = []
 
 	def get_clusters_to_bpck(self):
+		# Return list of tuples with the clusters in the format necessary for the binpacking
+		# Format [(binid,length,seq1,seq2,...,seqN),...]
+		# Example: [(0,500,A,B,C),(1,300,D),(None,200,E)]
 		ret = []
 		for c in self.clusters:
 			ret.append(c.get_tuples())
 		return ret
 
 	def add_clusters_from_bpck(self, bpck_clusters, leaves: set=None):
+		# Parse binpacking output - list of lists with tuples generated with get_clusters_to_bpck
+		# Example: [[(0,500,A,B,C)],[(1,300,D),(None,200,E)]]
+
 		# For each cluster returned by binpaking
 		for cluster in bpck_clusters:
 			# split clusters in their respective binid assigned (or None)