Skip to content

Commit

Permalink
Documentation update
Browse files Browse the repository at this point in the history
  • Loading branch information
Old-Shatterhand committed Apr 19, 2024
1 parent 62d81c6 commit 999e11a
Show file tree
Hide file tree
Showing 5 changed files with 351 additions and 13 deletions.
1 change: 1 addition & 0 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@
html_show_sourcelink = True

rst_context = {"DataSAIL": datasail}
mathjax3_config = {'chtml': {'displayAlign': 'left'}}

add_module_names = False
fail_on_warning = True
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@ arguments are mostly the same.

workflow/input
workflow/clustering
workflow/embeddings
workflow/splits
workflow/solvers
posters
Expand Down
3 changes: 2 additions & 1 deletion docs/workflow/clustering.rst
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,8 @@ Details about the clustering algorithms
#######################################

For all algorithms, the given general commands can be extended by user defined arguments according to the specification
of the respective tool.
of the respective tool. Apart from this, DataSAIL offers multiple options to compute similarities or distances between
embeddings of the input data. More information is given :ref:`here <embeddings-label>`.

CD-HIT(-EST)
============
Expand Down
335 changes: 335 additions & 0 deletions docs/workflow/embeddings.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,335 @@
########################
Clustering of Embeddings
########################

.. _embeddings-label:

DataSAIL offers different clustering algorithms implemented in SciPy and RDKit to cluster the embeddings.
The clustering algorithms are:

.. list-table:: Title
:widths: 30 15 15 15 15 15
:header-rows: 1

* - Algorithm
- Sim or Dist
- Boolean
- Integer
- Float
- RDKit or SciPy
* - AllBit
- Sim
- X
- \-
- \-
- RDKit
* - Asymmetric
- Sim
- X
- \-
- \-
- RDKit
* - Braun-Blanquet
- Sim
- X
- \-
- \-
- RDKit
* - Canberra
- Dist
- X
- X
- X
- SciPy
* - Dice
- Sim
- X
- X
- \-
- RDKit
* - Hamming
- Dist
- X
- X
- X
- SciPy
* - Kulczynski
- Sim
- X
- \-
- \-
- RDKit
* - Jaccard
- Dist
- X
- \-
- \-
- SciPy
* - Matching
- Dist
- X
- X
- X
- SciPy
* - OnBit
- Sim
- X
- \-
- \-
- RDKit
* - Rogers-Tanimoto
- Dist
- X
- \-
- \-
- SciPy
* - Rogot-Goldberg
- Sim
- X
- \-
- \-
- RDKit
* - Russel
- Sim
- X
- \-
- \-
- RDKit
* - Sokal
- Sim
- X
- \-
- \-
- RDKit
* - Sokal-Michener
- Dist
- X
- \-
- \-
- SciPy
* - Tanimoto
- Sim
- X
- X
- \-
- RDKit
* - Yule
- Dist
- X
- \-
- \-
- SciPy

Individual Algorithms
#####################

In the following, we will describe the individual algorithms in more detail and with the mathematical formula that
computes the respective metric between two vectors :math:`u` and :math:`v` of length :math:`n`. Depending on the method
used, :math:`u` and :math:`v` can be float-vectors but may also be restricted to be int-vectors or bit-vectors.

.. note::
We will use the `Iverson bracket <https://en.wikipedia.org/wiki/Iverson_bracket>`__ notation :math:`[P]` to
denote the indicator function that is 1 if the predicate :math:`P` is true and 0 otherwise.

AllBit
======

This is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v`.

.. math::
\text{AllBit}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i]]}{n}
Asymmetric
==========

The Asymmetric similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the
minimum number of bits set in either of the two vectors. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L520>`__.

.. math::
& u_1 = \sum_{i=1}^{n} [u[i]]\\
& v_1 = \sum_{i=1}^{n} [v[i]]\\
& \text{Asymmetric}(u, v) = \begin{cases}
0, &\text{if} \min(u_1,v_1) = 0,\\
\frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\min(u_1, v_1)} &\text{otherwise}
\end{cases}
Braun-Blanquet
==============

The Braun-Blanquet similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the
maximum number of bits set in either of the two vectors. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L409>`__.

.. math::
& u_1 = \sum_{i=1}^{n} [u[i]]\\
& v_1 = \sum_{i=1}^{n} [v[i]]\\
& \text{Braun-Blanquet}(u, v) = \begin{cases}
0, &\text{if} \max(u_1,v_1) = 0,\\
\frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\max(u_1, v_1)} &\text{otherwise}
\end{cases}
Canberra
========

The Canberra distance is the sum of the absolute differences of the two vectors :math:`u` and :math:`v` divided by the
sum of the absolute values of the two vectors. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L1131>`__.

.. math::
\text{Canberra}(u, v) = \sum_{i=1}^{n} \frac{|u[i] - v[i]|}{|u[i]| + |v[i]|}
Dice
====

The Dice similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the
number of bits set in either of the two vectors. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L333>`__.

.. math::
\text{Dice}(u, v) = \frac{2 \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i]] + \sum_{i=1}^{n} [v[i]]}
Hamming or Matching
===================

The Hamming distance (a.k.a. Matching distance) is the number of bits that are different in the two bit vectors
:math:`u` and :math:`v`. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L697>`__.

.. math::
\text{Hamming}(u, v) = \sum_{i=1}^{n} [u[i] \neq v[i]]
Jaccard
=======

The Jaccard distance is the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided by
the number of equal one-bits in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are different
in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L755>`__.

.. math::
\text{Jaccard}(u, v) = \frac{\sum_{i=1}^{n} [u[i] \neq v[i]]}{n}
Kulczynski
==========

The Kulczynski similarity is the number of equal one-bits in the two bit vectors :math:`u` and :math:`v` multiplied
with the sum of ones in both vectors divided by twice the sum of ones in both vectors multiplied. The implementation is
given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L317>`__.

.. math::
& u_1 = \sum_{i=1}^{n} [u[i]]\\
& v_1 = \sum_{i=1}^{n} [v[i]]\\
& \text{Kulczynski}(u, v) = \begin{cases}
0, &\text{if} u_1 \cdot v_1 = 0,\\
\frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1]) \cdot (u_1 + v_1)}{2 \cdot u_1 \cdot v_1)} &\text{otherwise}
\end{cases}
Matching
========

see Hamming

OnBit
=====

The OnBit similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum
of the one-bits in the two bit vectors :math:`u` and :math:`v`. The similarity is 0 if the latter sum is 0. The
implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L463>`__.

.. math::
\text{OnBit}(u, v) = \begin{cases}
0, &\text{if} \sum_{i=1}^{n} [u[i] \lor v[i]] = 0,\\
\frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1])}{\sum_{i=1}^{n} [u[i] \lor v[i]]} &\text{otherwise}
\end{cases}
Rogers-Tanimoto
===============

The Rogers-Tanimoto distance is twice the number of bits that are different in the two bit vectors :math:`u` and
:math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v`
plus the number of bits that are equal in the vectors. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L1389>`__.

.. math::
\text{Rogers-Tanimoto}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} [u[i] \neq v[i]] + \sum_{i=1}^{n} [u[i] = v[i]]}
Rogot-Goldberg
==============

The Rogot-Goldberg similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by
the sum of the one-bits in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are different in
the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L434>`__.

.. math::
& x = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\
& y = \sum_{i=1}^{n} [u[i]]\\
& z = \sum_{i=1}^{n} [u[i]]\\
& d = n - y - z + x\\
& \text{Rogot-Goldberg}(u, v) = \begin{cases}
1, &\text{if} x = n \lor d = n,\\
\frac{x}{x + z} + \frac{d}{2 \cdot n - y - z} &\text{otherwise}
\end{cases}
Russel
======

The Russel similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the
number of one-bits in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L425>`__.

.. math::
\text{Russel}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{n}
Sokal
=====

The Sokal similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum
of the one-bits in the two bit vectors :math:`u` and :math:`v` minus the number of equal one-bits in the two bit
vectors :math:`u` and :math:`v`. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L349>`__.

.. math::
\text{Sokal}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{2 \cdot \sum_{i=1}^{n} [u[i]] + [v[i]] - \sum_{i=1}^{n} [u[i] = v[i] = 1]}
Sokal-Michener
==============

The Sokal-Michener distance is twice the number of bits that are different in the two bit vectors :math:`u` and
:math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v`
plus the number of bits that are equal in the vectors. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L1496>`__.

.. math::
\text{Sokal-Michener}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} 2 \cdot [u[i] \neq v[i]] + [u[i] = v[i]]}
Tanimoto
========

The Tanimoto similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the
sum of the one-bits in the two bit vectors :math:`u` and :math:`v` minus the number of equal one-bits in the two bit
vectors :math:`u` and :math:`v`. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L270>`__.

.. math::
& t = \sum_{i=1}^{n} [u[i]] + [v[i]]\\
& c = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\
& \text{Tanimoto}(u, v) = \begin{cases}
1, &\text{if} t = 0,\\
\frac{c}{t - c} &\text{otherwise}
\end{cases}
Yule
====

The Yule distance is twice the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided
by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v` plus the number of
bits that are equal in the vectors. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L1274>`__.

.. math::
\text{Yule}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i] = v[i]] + \sum_{i=1}^{n} [u[i] = v[i] = 1]}
Loading

0 comments on commit 999e11a

Please sign in to comment.