Skip to content

Commit

Permalink
Merge pull request #29 from pirovc/dev
Browse files Browse the repository at this point in the history
Version 1.0.0
  • Loading branch information
pirovc authored Aug 5, 2020
2 parents 6e14819 + 81d792a commit 6edcc30
Show file tree
Hide file tree
Showing 33 changed files with 1,789 additions and 896 deletions.
22 changes: 22 additions & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
language: python
python:
- "3.4"
- "3.5"
- "3.6"
- "3.7"
- "3.8"
# - "3.8-dev"

install:
- pip install "pandas>=0.22.0"
- pip install binpacking==1.4.3
- git clone https://github.com/pirovc/pylca.git
- cd pylca
- python setup.py install
- cd ..

script:
- python -m unittest discover -s tests/taxsbp/integration/

notifications:
email: false
3 changes: 1 addition & 2 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
The MIT License (MIT)

Copyright (c) 2019 Vitor C. Piro - [email protected] - [email protected]
Robert Koch-Institut, Germany
Copyright (c) 2020 Vitor C. Piro - [email protected]

All rights reserved.

Expand Down
301 changes: 252 additions & 49 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,23 +2,17 @@

[![install with bioconda](https://img.shields.io/badge/install%20with-bioconda-brightgreen.svg?style=flat)](http://bioconda.github.io/recipes/taxsbp/README.html)

Vitor C. Piro ([email protected])
[![Build Status](https://travis-ci.org/pirovc/taxsbp.svg?branch=master)](https://travis-ci.org/pirovc/taxsbp)

Implementation of the approximation algorithm for the hierarchically structured bin packing problem [1] based on the NCBI Taxonomy database [2] (uses LCA script from [3]).

## Dependencies:

- python>=3.4
- [binpacking](https://pypi.org/project/binpacking/)==1.4.1
Implementation of the approximation algorithm for the hierarchically structured bin packing problem [1] adapted for the NCBI Taxonomy database [2] (uses LCA script from [3]).

## Installation

```shh
git clone https://github.com/pirovc/taxsbp.git
cd taxsbp
python setup.py install
taxsbp -h
conda -c bioconda -c conda-forge taxsbp
```
or [manual installation](#manual-installation) without conda

## Usage

### Input
Expand All @@ -28,57 +22,266 @@ taxsbp -h
`sequence id <tab> sequence length <tab> taxonomic id [ <tab> specialization]`

* nodes.dmp and merged.dmp from NCBI Taxonomy (ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz)
* specialization can be used to further cluster sequences by groups beyond the taxonomy (e.g. strain name, assembly accession, ...)

### Output

* A tab-separated file:

`sequence id [/seq.start:seq.end] <tab> sequence length <tab> taxonomic id [ <tab> specialization] <tab> bin id`
`sequence id <tab> seq.start <tab> seq.end <tab> sequence length <tab> taxonomic id <tab> bin id [ <tab> specialization] `

## Examples with sample data:

The sample data is comprised of:

- hierarchy with 5 levels (rank1-rank5), where the root node is "1" and leaf nodes are marked with a "*"
- 8 Specializations (S1-S9) and 13 targets (A-M) with equal length (100)

rank-1 1 ________________
/ \ \ \
rank-2 2.1 2.2 ______ \ \
/ \ \ \ \ \
rank-3 3.1 3.2 3.4 \ \ \
/ / \ \ \ \ \
rank-4 *4.1 *4.2 *4.3 *4.4 *4.5 *4.6 \
/ / / / | \ \ \ \
rank-5 / / / *5.1 *5.2 \ \ \ \
/ / / / | \ \ \ \
spec. S1 S2 S3 S4 S5 S6 S7 S8 S9
/ \ | | / \ | | | /|\ |
target A B C D E F G H I J K L M

### Clusters of size 400

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400

<details>
<summary>Results</summary>

#id st end len tax bin
F 1 100 100 5.1 0
E 1 100 100 5.1 0
H 1 100 100 4.4 0
G 1 100 100 5.2 0
D 1 100 100 4.3 1
C 1 100 100 4.2 1
B 1 100 100 4.1 1
A 1 100 100 4.1 1
L 1 100 100 4.6 2
K 1 100 100 4.6 2
J 1 100 100 4.6 2
M 1 100 100 1 2
I 1 100 100 4.5 3

</details>

### Pre-clustered by "rank-2"

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -p "rank-2"

* ABCD (2.1) and EFGHI (2.2) are pre-clustered together

<details>
<summary>Results</summary>

#id st end len tax bin
I 1 100 100 4.5 0
G 1 100 100 5.2 0
E 1 100 100 5.1 0
F 1 100 100 5.1 0
H 1 100 100 4.4 0
C 1 100 100 4.2 1
D 1 100 100 4.3 1
A 1 100 100 4.1 1
B 1 100 100 4.1 1
J 1 100 100 4.6 2
K 1 100 100 4.6 2
L 1 100 100 4.6 2
M 1 100 100 1 2

</details>

### Bin exclusive clusters by "rank-4"

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -e "rank-4"

* Clusters are generated for each sub-tree of nodes from rank-4. The taxid of the exclusive rank is printed instead of the original.

<details>
<summary>Results</summary>

#id st end len tax bin
F 1 100 100 4.4 0
E 1 100 100 4.4 0
H 1 100 100 4.4 0
G 1 100 100 4.4 0
L 1 100 100 4.6 1
K 1 100 100 4.6 1
J 1 100 100 4.6 1
B 1 100 100 4.1 2
A 1 100 100 4.1 2
I 1 100 100 4.5 3
D 1 100 100 4.3 4
C 1 100 100 4.2 5
M 1 100 100 1 6

</details>

### Cluster with specialization

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 400 -s MySpec -e MySpec

* Clusters are exclusive by specialization

<details>
<summary>Results</summary>

#id st end len tax bin spec
L 1 100 100 4.6 0 S8
K 1 100 100 4.6 0 S8
J 1 100 100 4.6 0 S8
F 1 100 100 5.1 1 S4
E 1 100 100 5.1 1 S4
B 1 100 100 4.1 2 S1
A 1 100 100 4.1 2 S1
G 1 100 100 5.2 3 S5
D 1 100 100 4.3 4 S3
M 1 100 100 1 5 S9
H 1 100 100 4.4 6 S6
I 1 100 100 4.5 7 S7
C 1 100 100 4.2 8 S2

</details>

### Cluster with fragmentation

taxsbp.py -i sample_data/seqinfo.tsv -n sample_data/nodes.dmp -l 150 -f 50 -a 5 -e "rank-3"

* Clusters of size 150. Fragment inputs in 50 with overlap of 5. Cluster exclusive of "rank-3"

<details>
<summary>Results</summary>

#id st end len tax bin
F 1 55 55 3.4 0
E 1 55 55 3.4 0
B 1 55 55 3.1 1
A 1 55 55 3.1 1
L 1 55 55 4.6 2
K 1 55 55 4.6 2
G 1 55 55 3.4 3
G 51 100 50 3.4 3
H 1 55 55 3.4 4
H 51 100 50 3.4 4
D 1 55 55 3.2 5
D 51 100 50 3.2 5
C 1 55 55 3.2 6
C 51 100 50 3.2 6
I 1 55 55 4.5 7
I 51 100 50 4.5 7
J 1 55 55 4.6 8
L 51 100 50 4.6 8
M 1 55 55 1 9
M 51 100 50 1 9
F 51 100 50 3.4 10
E 51 100 50 3.4 10
B 51 100 50 3.1 11
A 51 100 50 3.1 11
K 51 100 50 4.6 12
J 51 100 50 4.6 12

</details>

## Examples with real data:

### Prepare data:

gzip -d sample_data/*.gz

### Clustering:

taxsbp.py -i sample_data/20181219_abfv_refseq_cg.tsv -n sample_data/20181219_abfv_refseq_cg_nodes.dmp -l 10000000 -f 999500 -a 500

## Parameters:

$ taxsbp -h

usage: TaxSBP [-h] -f <input_file> -n <nodes_file> [-m <merged_file>]
[-b <bins>] [-l <bin_len>] [-a <fragment_len>]
[-o <overlap_len>] [-p <pre_cluster>] [-r <bin_exclusive>]
[-z <specialization>] [-u <update_file>] [-v]
usage: taxsbp [-h] [-i <input_file>] [-o <output_file>] [-n <nodes_file>] [-m <merged_file>] [-l <bin_len>] [-b <bins>]
[-f <fragment_len>] [-a <overlap_len>] [-p <pre_cluster>] [-e <bin_exclusive>] [-s <specialization>]
[-u <update_file>] [-w] [-t] [-v]

optional arguments:
-h, --help show this help message and exit
-f <input_file> Tab-separated with the fields: sequence id <tab>
sequence length <tab> taxonomic id [<tab>
specialization]
-n <nodes_file> nodes.dmp from NCBI Taxonomy
-m <merged_file> merged.dmp from NCBI Taxonomy
-b <bins> Approximate number of bins (estimated by total
length/bin number). Default: 50 [Mutually exclusive -l]
-l <bin_len> Maximum bin length (in bp). Use this parameter insted
of -b to define the number of bins [Mutually exclusive
-b]
-a <fragment_len> Fragment sequences into pieces, output accession will
be modified with positions: ACCESION/start:end
-o <overlap_len> Overlap length between fragments [Only valid with -a]
-p <pre_cluster> Pre-cluster sequences into rank/taxid/specialization,
so they won't be splitted among bins
[none,specialization name,taxid,species,genus,...]
Default: none
-r <bin_exclusive> Make bins rank/taxid/specialization exclusive, so bins
won't have mixed sequences. When the chosen rank is not
present on a sequence lineage, this sequence will be
taxid/specialization exclusive. [none,specialization
name,taxid,species,genus,...] Default: none
-z <specialization> Specialization name (e.g. assembly, strain). If given,
TaxSBP will cluster entries on a specialized level
after the taxonomic id. The specialization identifier
should be provided as an extra collumn in the
input_file ans should respect the taxonomic hiercharchy
(one taxid -> multiple specializations / one
specialization -> one taxid). Default: ''
-u <update_file> Previously generated files to be updated. Default: ''
-v show program's version number and exit
-h, --help show this help message and exit
-i <input_file>, --input-file <input_file>
Tab-separated with the fields: sequence id <tab> sequence length <tab> taxonomic id [<tab>
specialization]
-o <output_file>, --output-file <output_file>
Path to the output tab-separated file. Fields: sequence id <tab> sequence start <tab> sequence
end <tab> sequence length <tab> taxonomic id <tab> bin id [<tab> specialization]. Default: STDOUT
-n <nodes_file>, --nodes-file <nodes_file>
nodes.dmp from NCBI Taxonomy
-m <merged_file>, --merged-file <merged_file>
merged.dmp from NCBI Taxonomy
-l <bin_len>, --bin-len <bin_len>
Maximum bin length (in bp). Use this parameter insted of -b to define the number of bins.
Default: length of the biggest group [Mutually exclusive -b]
-b <bins>, --bins <bins>
Approximate number of bins (estimated by total length/bin number). [Mutually exclusive -l]
-f <fragment_len>, --fragment-len <fragment_len>
Fragment sequences into pieces
-a <overlap_len>, --overlap-len <overlap_len>
Overlap length between fragments [Only valid with -a]
-p <pre_cluster>, --pre-cluster <pre_cluster>
Pre-cluster sequences into any existing rank, leaves or specialization. Entries will not be
divided in bins ['leaves',specialization name,rank name]
-e <bin_exclusive>, --bin-exclusive <bin_exclusive>
Make bins rank, leaves or specialization exclusive. Bins will not have mixed entries. When the
chosen rank is not present on a sequence lineage, this sequence will be leaf/specialization
exclusive. ['leaves',specialization name,rank name]
-s <specialization>, --specialization <specialization>
Specialization name (e.g. assembly, strain). If given, TaxSBP will cluster entries on a
specialized level after the leaf. The specialization identifier should be provided as an extra
collumn in the input_file and should respect the taxonomic hiercharchy: One leaf can have
multiple specializations but a specialization is present in only one leaf
-u <update_file>, --update-file <update_file>
Previously generated clusters to be updated. Output only new sequences
-w, --allow-merge When updating, allow merging of existing bins. Will output the whole set, not only new bins
-t, --silent Do not print warning to STDERR
-v, --version show program's version number and exit

## Manual Installation

### Dependencies:

- python>=3.4
- [binpacking](https://pypi.org/project/binpacking/)==1.4.3
- [pylca](https://github.com/pirovc/pylca)==1.0.0
- [pandas](https://pypi.org/project/pandas/)pandas>=0.22.0 (tests only)

### Pylca:

```shh
git clone https://github.com/pirovc/pylca
cd pylca
python setup.py install
```

### TaxSBP + binpacking:

```shh
git clone https://github.com/pirovc/taxsbp.git
cd taxsbp
python setup.py install
taxsbp -h
```

### Testing:

```shh
pip install "pandas>=0.22.0"
cd taxsbp
python3 -m unittest discover -s tests/taxsbp/integration/
```

References:
-----------
Expand All @@ -87,4 +290,4 @@ References:

[2] Federhen, S. (2012). The NCBI Taxonomy database. Nucleic Acids Research, 40(D1), D136–D143. http://doi.org/10.1093/nar/gkr1178

[3] https://www.ics.uci.edu/~eppstein/
[3] https://www.ics.uci.edu/~eppstein/ in the package https://github.com/pirovc/pylca
Binary file added sample_data/20181219_abfv_refseq_cg.tsv.gz
Binary file not shown.
Binary file added sample_data/20181219_abfv_refseq_cg_nodes.dmp.gz
Binary file not shown.
14 changes: 14 additions & 0 deletions sample_data/nodes.dmp
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
1 | 1 | rank-1 |
2.1 | 1 | rank-2 |
2.2 | 1 | rank-2 |
3.1 | 2.1 | rank-3 |
3.2 | 2.1 | rank-3 |
4.1 | 3.1 | rank-4 |
4.2 | 3.2 | rank-4 |
4.3 | 3.2 | rank-4 |
3.4 | 2.2 | rank-3 |
4.4 | 3.4 | rank-4 |
5.1 | 4.4 | rank-5 |
5.2 | 4.4 | rank-5 |
4.5 | 2.2 | rank-4 |
4.6 | 1 | rank-4 |
13 changes: 13 additions & 0 deletions sample_data/seqinfo.tsv
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
A 100 4.1 S1
B 100 4.1 S1
C 100 4.2 S2
D 100 4.3 S3
E 100 5.1 S4
F 100 5.1 S4
G 100 5.2 S5
H 100 4.4 S6
I 100 4.5 S7
J 100 4.6 S8
K 100 4.6 S8
L 100 4.6 S8
M 100 1 S9
Loading

0 comments on commit 6edcc30

Please sign in to comment.