Skip to content

Commit

Permalink
Feature/pre release v1.0.0 (#28)
Browse files Browse the repository at this point in the history
* readme adjustments, license

* better code approxsbp and comments

* clean pre_cluster function

* final changes, taxid for leaves

* final changes readme
  • Loading branch information
pirovc authored Aug 5, 2020
1 parent 23fc84c commit 81d792a
Show file tree
Hide file tree
Showing 5 changed files with 88 additions and 110 deletions.
58 changes: 27 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

[![Build Status](https://travis-ci.org/pirovc/taxsbp.svg?branch=master)](https://travis-ci.org/pirovc/taxsbp)

Implementation of the approximation algorithm for the hierarchically structured bin packing problem [1] based on the NCBI Taxonomy database [2] (uses LCA script from [3]).
Implementation of the approximation algorithm for the hierarchically structured bin packing problem [1] adapted for the NCBI Taxonomy database [2] (uses LCA script from [3]).

## Installation

Expand All @@ -22,6 +22,7 @@ or [manual installation](#manual-installation) without conda
`sequence id <tab> sequence length <tab> taxonomic id [ <tab> specialization]`

* nodes.dmp and merged.dmp from NCBI Taxonomy (ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz)
* specialization can be used to further cluster sequences by groups beyond the taxonomy (e.g. strain name, assembly accession, ...)

### Output

Expand All @@ -33,19 +34,16 @@ or [manual installation](#manual-installation) without conda

The sample data is comprised of:

- Root node (1)
- hierarchy with 5 levels (rank1-rank5)
- Leaf nodes (*)
- 8 Specializations (S1-S9)
- 13 targets (A-M) with equal length (100)
- hierarchy with 5 levels (rank1-rank5), where the root node is "1" and leaf nodes are marked with a "*"
- 8 Specializations (S1-S9) and 13 targets (A-M) with equal length (100)

rank-1 1 ________________
/ \ \ \
rank-2 2.1 2.2 ______ \ \
/ \ \ \ \ \
rank-3 3.1 3.2 3.4 \ \ \
/ / \ \ \ \ \
rank-4 *4.1 *4.2 *4.3 4.4 *4.5 *4.6 \
rank-4 *4.1 *4.2 *4.3 *4.4 *4.5 *4.6 \
/ / / / | \ \ \ \
rank-5 / / / *5.1 *5.2 \ \ \ \
/ / / / | \ \ \ \
Expand All @@ -57,8 +55,6 @@ The sample data is comprised of:

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400

Clusters are limited by size 400

<details>
<summary>Results</summary>

Expand All @@ -83,7 +79,7 @@ Clusters are limited by size 400

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -p "rank-2"

ABCD (2.1) and EFGHI (2.2) are forced together (even if bigger than 400)
* ABCD (2.1) and EFGHI (2.2) are pre-clustered together

<details>
<summary>Results</summary>
Expand All @@ -109,7 +105,7 @@ ABCD (2.1) and EFGHI (2.2) are forced together (even if bigger than 400)

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -e "rank-4"

Clusters are generated for each sub-tree of nodes from rank-4. The used rank is printed instead of the original.
* Clusters are generated for each sub-tree of nodes from rank-4. The taxid of the exclusive rank is printed instead of the original.

<details>
<summary>Results</summary>
Expand All @@ -135,7 +131,7 @@ Clusters are generated for each sub-tree of nodes from rank-4. The used rank is

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -s MySpec -e MySpec

Clusters are exclusive by specialization
* Clusters are exclusive by specialization

<details>
<summary>Results</summary>
Expand All @@ -161,7 +157,7 @@ Clusters are exclusive by specialization

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 150 -f 50 -a 5 -e "rank-3"

Clusters of size 150. Fragment inputs in 50 with overlap of 5. Cluster exclusive of "rank-3"
* Clusters of size 150. Fragment inputs in 50 with overlap of 5. Cluster exclusive of "rank-3"

<details>
<summary>Results</summary>
Expand Down Expand Up @@ -210,47 +206,47 @@ Clusters of size 150. Fragment inputs in 50 with overlap of 5. Cluster exclusive

$ taxsbp -h

usage: TaxSBP [-h] [-i <input_file>] [-o <output_file>] [-n <nodes_file>] [-m <merged_file>] [-b <bins>] [-l <bin_len>]
usage: taxsbp [-h] [-i <input_file>] [-o <output_file>] [-n <nodes_file>] [-m <merged_file>] [-l <bin_len>] [-b <bins>]
[-f <fragment_len>] [-a <overlap_len>] [-p <pre_cluster>] [-e <bin_exclusive>] [-s <specialization>]
[-u <update_file>] [--output-unique-seqid] [-v]
[-u <update_file>] [-w] [-t] [-v]

optional arguments:
-h, --help show this help message and exit
-i <input_file>, --input-file <input_file>
Tab-separated with the fields: sequence id <tab> sequence length <tab> taxonomic id [<tab>
specialization]
-o <output_file>, --output-file <output_file>
Path to the output tab-separated file with the fields. Default: STDOUT
Path to the output tab-separated file. Fields: sequence id <tab> sequence start <tab> sequence
end <tab> sequence length <tab> taxonomic id <tab> bin id [<tab> specialization]. Default: STDOUT
-n <nodes_file>, --nodes-file <nodes_file>
nodes.dmp from NCBI Taxonomy
-m <merged_file>, --merged-file <merged_file>
merged.dmp from NCBI Taxonomy
-b <bins>, --bins <bins>
Approximate number of bins (estimated by total length/bin number). [Mutually exclusive -l]
-l <bin_len>, --bin-len <bin_len>
Maximum bin length (in bp). Use this parameter insted of -b to define the number of bins.
Default: length of the biggest group [Mutually exclusive -b]
-b <bins>, --bins <bins>
Approximate number of bins (estimated by total length/bin number). [Mutually exclusive -l]
-f <fragment_len>, --fragment-len <fragment_len>
Fragment sequences into pieces, output accession will be modified with positions:
ACCESION/start:end
Fragment sequences into pieces
-a <overlap_len>, --overlap-len <overlap_len>
Overlap length between fragments [Only valid with -a]
-p <pre_cluster>, --pre-cluster <pre_cluster>
Pre-cluster sequences into rank/taxid/specialization, so they won't be splitted among bins
[none,specialization name,taxid,species,genus,...] Default: none
Pre-cluster sequences into any existing rank, leaves or specialization. Entries will not be
divided in bins ['leaves',specialization name,rank name]
-e <bin_exclusive>, --bin-exclusive <bin_exclusive>
Make bins rank/taxid/specialization exclusive, so bins won't have mixed sequences. When the
chosen rank is not present on a sequence lineage, this sequence will be taxid/specialization
exclusive. [none,specialization name,taxid,species,genus,...] Default: none
Make bins rank, leaves or specialization exclusive. Bins will not have mixed entries. When the
chosen rank is not present on a sequence lineage, this sequence will be leaf/specialization
exclusive. ['leaves',specialization name,rank name]
-s <specialization>, --specialization <specialization>
Specialization name (e.g. assembly, strain). If given, TaxSBP will cluster entries on a
specialized level after the taxonomic id. The specialization identifier should be provided as an
extra collumn in the input_file ans should respect the taxonomic hiercharchy (one taxid ->
multiple specializations / one specialization -> one taxid). Default: ''
specialized level after the leaf. The specialization identifier should be provided as an extra
collumn in the input_file and should respect the taxonomic hiercharchy: One leaf can have
multiple specializations but a specialization is present in only one leaf
-u <update_file>, --update-file <update_file>
Previously generated files to be updated. Default: ''
--output-unique-seqid
Output unique sequence ids after fragmentation in the format: seq.id/seq.start:seq.end]
Previously generated clusters to be updated. Output only new sequences
-w, --allow-merge When updating, allow merging of existing bins. Will output the whole set, not only new bins
-t, --silent Do not print warning to STDERR
-v, --version show program's version number and exit

## Manual Installation
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ def read(filename):
author="Vitor C. Piro",
author_email="[email protected]",

description="Implementation of the approximation algorithm for the hierarchically structured bin packing problem based on the NCBI Taxonomy database.",
description="Implementation of the approximation algorithm for the hierarchically structured bin packing problem adapted for the NCBI Taxonomy database.",
long_description=read("README.md"),

packages=['taxsbp'],
Expand Down
6 changes: 6 additions & 0 deletions taxsbp/Group.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,12 +24,18 @@ def clear_clusters(self):
self.clusters = []

def get_clusters_to_bpck(self):
# Return list of tuples with the clusters in the format necessary for the binpacking
# Format [(binid,length,seq1,seq2,...,seqN),...]
# Example: [(0,500,A,B,C),(1,300,D),(None,200,E)]
ret = []
for c in self.clusters:
ret.append(c.get_tuples())
return ret

def add_clusters_from_bpck(self, bpck_clusters, leaves: set=None):
# Parse binpacking output - list of lists with tuples generated with get_clusters_to_bpck
# Example: [[(0,500,A,B,C)],[(1,300,D),(None,200,E)]]

# For each cluster returned by binpaking
for cluster in bpck_clusters:
# split clusters in their respective binid assigned (or None)
Expand Down
Loading

0 comments on commit 81d792a

Please sign in to comment.