Skip to content

Commit

Permalink
tweak doc
Browse files Browse the repository at this point in the history
  • Loading branch information
gregdenay committed Jun 14, 2024
1 parent f09fec3 commit 6df91be
Show file tree
Hide file tree
Showing 5 changed files with 48 additions and 50 deletions.
2 changes: 1 addition & 1 deletion docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ docker pull gregdenay/taxidtools

With the [NCBI's taxdump files](https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/new_taxdump/) installed locally:

```python
``` py
>>> import taxidTools as txd
>>> ncbi = txd.read_taxdump('nodes.dmp', 'rankedlineage.dmp', 'merged.dmp')
>>> tax.getName('9606')
Expand Down
49 changes: 24 additions & 25 deletions docs/recipes/verify_blast.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,9 +7,9 @@ are in agreement with the expected composition of the sample to calculate to per
of a method for example.

First things first, let`s load the taxdump file in a Taxonomy object:
```python
``` py
import taxidTools
tax = read_taxdump("nodes.dmp", "rankedlineage.dmp", "merged.dmp")
tax = taxidTools.read_taxdump("nodes.dmp", "rankedlineage.dmp", "merged.dmp")
```

## Getting a taxid for each sequence
Expand All @@ -26,7 +26,7 @@ Gallus gallus species 0.9

In order to work with these nodes later we want to create a list of Nodes from this output:

```python
``` py
# This allows you to run the code in your interpreter
# in practice you should parse the sintax output into a list of names
names = ["Bos", "Gallus gallus"]
Expand All @@ -42,7 +42,7 @@ one of our sequences. BLAST can typically output taxids directly, otherwise get
names like above. Let`s say we parsed our BLAST file in a list of list of taxids. Each element of
the outer list is a list of hits for a single sequence:

```python
``` py
res = [
[9913, 9913, 72004],
[9031, 9031]
Expand All @@ -52,9 +52,9 @@ res = [
Ideally we would like to have a single assignement for each sequence. We can do this by assigning the last common ancestor
of all the hits for this sequence, or use a less stringent approach, like a majority agreement:

```python
``` py
# Here we could also choose to use tax.lca() instead
nodes = [tax.consensus(ids, 0.51) fir ids in res]
nodes = [tax.consensus(ids, 0.51) for ids in res]
```

We now have a single Node object for each sequence, neatly organized in a list!
Expand All @@ -65,7 +65,7 @@ In order to verify that our results are correct, we want to compare
this list to a list of expected taxids, for example Bos taurus (cattle) and
Gallus gallus (chicken), bot at the species level:

```python
``` py
expected = [9913, 9031]
```

Expand All @@ -79,9 +79,8 @@ expected components. The smallest distance indicates the correponding expected c
One has to keep in mind that different branches of the taxonomy can have a wildly different number of nodes,
so it can greatly simplify things first normalize to taxonomy for such an approach:

```python
norm = tax.copy()
norm.filterRanks(inplace=False)
``` py
norm = tax.filterRanks(inplace=False)

distances = []
for n in nodes:
Expand All @@ -95,14 +94,14 @@ index_corr = [d.index(min(d)) for d in distances]
Now that we have a list which links each consensus to the index of its closest match in the list of
expected species, it is straightforward to determine the agreement rank between result and expectation:

```python
rank = []
``` py
ranks = []
for i in range(len(nodes)):
rank.append(
ranks.append(
tax.lca(
nodes[i].taxid,
expected[index_corr[i]]
)
[nodes[i].taxid,
expected[index_corr[i]]]
).rank
)
```

Expand All @@ -112,8 +111,8 @@ Let's say we want to determine these values at the genus resolution. The advanta
the taxonomy earlier is that we don't need to care about the precise order of ranks in each branch,
we can simply check wether the agreement rank in either of 'genus' or 'species':

```python
[1 if r in ['genus', 'species'] else 0 for r in ranks]
``` py
[True if r in ['genus', 'species'] else False for r in ranks]
```

### Unnormalized taxonomy
Expand All @@ -122,27 +121,27 @@ Of course it is possible to follow a similar approach without normalizing the ta
significantly more complicated. For example checking wether *Bos taurus* (9913) consensus (here genus) is
under the genus level involves determining the correpsonding expected node as before with the unnormalized taxonomy.

```python
``` py
distances = [tax.distance(9913, e) for e in expected]
# Getting the index of the minimum distance
index_corr = distances.index(min(distances))
agreement = expected[index_corr]
```

Now instead of simply checking the rank of the agreement, we will rather determine the ancestor
node of the expected species at the required resolution:

```python
lin = txd.getLineage(agreement)
target = lin.filter('genus')[0]
``` py
lin = tax.getAncestry(agreement)
lin.filter(['genus'])
target = lin[0]
```

Now the last common ancestor of our result and the corresponding expected species is either
an ancestor of `target`, in which case the result did not reach the expected resolution,
or its descendant or the target itself, in which case the required resolution is attained:

```python
not tax.isAncestor(target, tax.lca(agreement, 9913))
``` py
not tax.isAncestorOf(target.taxid, tax.lca([agreement, 9913]))
```

Note that in the last expression above we added `not` in order to have the results in the same form
Expand Down
22 changes: 11 additions & 11 deletions docs/usage/advanced.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ for example.
Should you want to keep a copy of the original Taxonomy (and the Nodes), you should
do a copy:

```python
``` py
>>> import copy
>>> backup = tax.copy()
```
Expand All @@ -33,7 +33,7 @@ Alternatively you can save the Taxonomy in JSON format for a later use (see next

Determining a consensus node from a bunch of taxid can be done as easily as:

```python
``` py
>>> tax.lca(['9606', '10090']).name # Mice and men
'Euarchontoglires'
```
Expand All @@ -43,7 +43,7 @@ frequencies of a bunch of taxids. You can set a minimal frequency threshold (bet
As soon a a single node meets this threshold, it will be returned as a consensus. If this threshold is
not met with the given input, then the parents of the input will be considered, and so on.

```python
``` py
>>> tax_list = ['9606']*6 + ['314146']*3 + ['4641']*8 # Mice and men and bananas
>>> tax.consensus(tax_list, 0.51).name
'Euarchontoglires'
Expand All @@ -55,7 +55,7 @@ not met with the given input, then the parents of the input will be considered,

Distance between two nodes is straightforward to calculate:

```python
``` py
>>> tax.distance('9606', '10090')
18
```
Expand All @@ -69,7 +69,7 @@ If you don't care about part of the Taxonomy
you can extract a subtree and/or filter the Taxonomy to keep only specific
ranks.

```python
``` py
>>> tax.prune('40674') # mammals class
>>> tax.filterRanks(['species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom'])
>>> tax.getAncestry('9606')
Expand All @@ -85,7 +85,7 @@ to calculate internode distances or comparing Lineages. When requesting a rank
which nodes are missing, these nodes will be replaced by a DummyNode.
These special kind of nodes act as place-holders for non-existing nodes.

```python
``` py
>>> tax.filterRanks(['species', 'subgenus', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom'])
>>> tax.getAncestry('9606')
Lineage([Node(9606), DummyNode(AAeFFWcs), Node(9605), Node(9604), Node(9443), Node(40674),
Expand All @@ -94,7 +94,7 @@ Node(7711), Node(33208), Node(1)])

Note that the above methods **mutate** the nodes:

```python
``` py
>>> tax.getParent('9606')
DummyNode(AAeFFWcs)
>>> tax.getRank('AAeFFWcs')
Expand All @@ -104,7 +104,7 @@ DummyNode(AAeFFWcs)
The formatted Linaean taxonomy ranks can be retrieved from the utility function `linne()`
for use in diverse methods:

```python
``` py
>>> taxidTools.linne()
['species', 'genus', 'family', 'order', 'class', 'phylum', 'kingdom']
>>> tax.filterRanks(taxidTools.linne())
Expand All @@ -118,7 +118,7 @@ As you probably already noticed, parsing the Taxonomy definition can
take a couple of minutes. If you plan on regularly using a subset of the Taxonomy,
it can be beneficial to save a filtered version to a JSON file and to reload it later.

```python
``` py
>>> tax.write("my_filtered_taxonomy.json")
>>> new_tax = taxidTools.read_json("my_filtered_taxonomy.json")
```
Expand All @@ -128,7 +128,7 @@ it can be beneficial to save a filtered version to a JSON file and to reload it
Creating a Taxonomy object can also be done without the Taxdump files.
You can either manually create Nodes and build a Taxonomy from them:

```python
``` py
>>> root = taxidTools.Node(taxid = 1, name = 'root', rank = 'root')
>>> node1 = taxidTools.Node(taxid = 2, name = 'node1', rank = 'rank1', parent = root)
>>> tax = taxidTools.Taxonomy.from_list([root, node1])
Expand All @@ -145,7 +145,7 @@ to create a parsing function to:
Here is a boilerplate code for such a function, assuming that each node
is defined on a single line:

```python
``` py
def custom_parser(file):
# Create two empty dict that will store the node
# information and parent information respectively
Expand Down
18 changes: 9 additions & 9 deletions docs/usage/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@ using the Taxdump files is the easiest solution.

Start by importing taxidTools:

```python
``` py
>>> import taxidTools
```

Then load the taxdump files that you saved and unpacked locally:

```python
``` py
>>> tax = taxidTools.read_taxdump(
"path/to/nodes.dmp",
"path/to/rankedlineage.dmp",
Expand Down Expand Up @@ -53,7 +53,7 @@ is refered to as root node and represents the top of the taxonomy.

All these properties can be easily accessed, using the taxid number:

```python
``` py
>>> tax.getName('9606')
'Homo sapiens'
>>> tax.getRank('9606')
Expand All @@ -67,7 +67,7 @@ Node(9605)
It is also possible to etrieve the taxid number for a name. However be careful that
this can lead to unexpected results if the names are not unique!

```python
``` py
>>> tax.getTaxid('Homo sapiens')
'9606'
>>> tax.addNode(Node(taxid = 0, name = 'Homo sapiens'))
Expand All @@ -81,7 +81,7 @@ Actually the Taxonomy object is just a dictionnary of Nodes.
You can access a Node object directly by passing its taxid as a key
to a Taxonomy object and retrieve the Node properties:

```python
``` py
>>> hs = tax.get('9606')
>>> hs.name
'Homo sapiens'
Expand All @@ -100,7 +100,7 @@ Node(9605)
It is possible to test directly the relationships betwen two nodes.
Note that a Node is neither an ancestor or descendant of itself.

```python
``` py
>>> tax.isDescendantOf('9606', '9605')
True
>>> tax.isAncestorOf('9606', '9605')
Expand All @@ -113,7 +113,7 @@ It is also possible to retrieve the whole ancestry of a given node.
Ancestries are stored in list-like Lineage objects, Nodes indices follow
the taxonomy order.

```python
``` py
>>> lin = tax.getAncestry('9606')
>>> lin[0]
Node(9606)
Expand All @@ -123,7 +123,7 @@ Node(9606)

It is possible to filter a Lineage for specific ranks:

```python
``` py
>>> lin.filter(['genus', 'family'])
>>> lin
Lineage([Node(9605), Node(9604)])
Expand All @@ -132,7 +132,7 @@ Lineage([Node(9605), Node(9604)])
This mutates the Lineage object, if you want to keep the object intact
you should use list comprehensions to filter specific nodes:

```python
``` py
>>> lin = tax.getAncestry('9606')
>>> [node for node in lin if node.rank in ['genus', 'family']]
[Node(9605), Node(9604)]
Expand Down
7 changes: 3 additions & 4 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,11 +46,10 @@ theme:
- navigation.instant
- navigation.tracking
- navigation.tabs
- navigation.tabs.sticky
- navigation.sections
- navigation.top
- toc.integrate
- content.code.annotate
- navigation.path
- toc.follow
- content.code.copy

plugins:
- search
Expand Down

0 comments on commit 6df91be

Please sign in to comment.