Documentation update

kalininalab · Apr 19, 2024 · 999e11a · 999e11a
1 parent 62d81c6
commit 999e11a
Show file tree

Hide file tree

Showing 5 changed files with 351 additions and 13 deletions.
diff --git a/docs/conf.py b/docs/conf.py
@@ -59,6 +59,7 @@
 html_show_sourcelink = True
 
 rst_context = {"DataSAIL": datasail}
+mathjax3_config = {'chtml': {'displayAlign': 'left'}}
 
 add_module_names = False
 fail_on_warning = True
diff --git a/docs/index.rst b/docs/index.rst
@@ -64,6 +64,7 @@ arguments are mostly the same.
 
     workflow/input
     workflow/clustering
+    workflow/embeddings
     workflow/splits
     workflow/solvers
     posters

diff --git a/docs/workflow/clustering.rst b/docs/workflow/clustering.rst
@@ -144,7 +144,8 @@ Details about the clustering algorithms
 #######################################
 
 For all algorithms, the given general commands can be extended by user defined arguments according to the specification
-of the respective tool.
+of the respective tool. Apart from this, DataSAIL offers multiple options to compute similarities or distances between
+embeddings of the input data. More information is given :ref:`here <embeddings-label>`.
 
 CD-HIT(-EST)
 ============

diff --git a/docs/workflow/embeddings.rst b/docs/workflow/embeddings.rst
@@ -0,0 +1,335 @@
+########################
+Clustering of Embeddings
+########################
+
+.. _embeddings-label:
+
+DataSAIL offers different clustering algorithms implemented in SciPy and RDKit to cluster the embeddings.
+The clustering algorithms are:
+
+.. list-table:: Title
+    :widths: 30 15 15 15 15 15
+    :header-rows: 1
+
+    * - Algorithm
+      - Sim or Dist
+      - Boolean
+      - Integer
+      - Float
+      - RDKit or SciPy
+    * - AllBit
+      - Sim
+      - X
+      - \-
+      - \-
+      - RDKit
+    * - Asymmetric
+      - Sim
+      - X
+      - \-
+      - \-
+      - RDKit
+    * - Braun-Blanquet
+      - Sim
+      - X
+      - \-
+      - \-
+      - RDKit
+    * - Canberra
+      - Dist
+      - X
+      - X
+      - X
+      - SciPy
+    * - Dice
+      - Sim
+      - X
+      - X
+      - \-
+      - RDKit
+    * - Hamming
+      - Dist
+      - X
+      - X
+      - X
+      - SciPy
+    * - Kulczynski
+      - Sim
+      - X
+      - \-
+      - \-
+      - RDKit
+    * - Jaccard
+      - Dist
+      - X
+      - \-
+      - \-
+      - SciPy
+    * - Matching
+      - Dist
+      - X
+      - X
+      - X
+      - SciPy
+    * - OnBit
+      - Sim
+      - X
+      - \-
+      - \-
+      - RDKit
+    * - Rogers-Tanimoto
+      - Dist
+      - X
+      - \-
+      - \-
+      - SciPy
+    * - Rogot-Goldberg
+      - Sim
+      - X
+      - \-
+      - \-
+      - RDKit
+    * - Russel
+      - Sim
+      - X
+      - \-
+      - \-
+      - RDKit
+    * - Sokal
+      - Sim
+      - X
+      - \-
+      - \-
+      - RDKit
+    * - Sokal-Michener
+      - Dist
+      - X
+      - \-
+      - \-
+      - SciPy
+    * - Tanimoto
+      - Sim
+      - X
+      - X
+      - \-
+      - RDKit
+    * - Yule
+      - Dist
+      - X
+      - \-
+      - \-
+      - SciPy
+
+Individual Algorithms
+#####################
+
+In the following, we will describe the individual algorithms in more detail and with the mathematical formula that
+computes the respective metric between two vectors :math:`u` and :math:`v` of length :math:`n`. Depending on the method
+used, :math:`u` and :math:`v` can be float-vectors but may also be restricted to be int-vectors or bit-vectors.
+
+.. note::
+    We will use the `Iverson bracket <https://en.wikipedia.org/wiki/Iverson_bracket>`__ notation :math:`[P]` to
+    denote the indicator function that is 1 if the predicate :math:`P` is true and 0 otherwise.
+
+AllBit
+======
+
+This is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v`.
+
+.. math::
+
+    \text{AllBit}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i]]}{n}
+
+Asymmetric
+==========
+
+The Asymmetric similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the
+minimum number of bits set in either of the two vectors. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L520>`__.
+
+.. math::
+
+    & u_1 = \sum_{i=1}^{n} [u[i]]\\
+    & v_1 = \sum_{i=1}^{n} [v[i]]\\
+    & \text{Asymmetric}(u, v) = \begin{cases}
+        0, &\text{if} \min(u_1,v_1) = 0,\\
+        \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\min(u_1, v_1)} &\text{otherwise}
+    \end{cases}
+
+Braun-Blanquet
+==============
+
+The Braun-Blanquet similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the
+maximum number of bits set in either of the two vectors. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L409>`__.
+
+.. math::
+
+    & u_1 = \sum_{i=1}^{n} [u[i]]\\
+    & v_1 = \sum_{i=1}^{n} [v[i]]\\
+    & \text{Braun-Blanquet}(u, v) = \begin{cases}
+        0, &\text{if} \max(u_1,v_1) = 0,\\
+        \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\max(u_1, v_1)} &\text{otherwise}
+    \end{cases}
+
+Canberra
+========
+
+The Canberra distance is the sum of the absolute differences of the two vectors :math:`u` and :math:`v` divided by the
+sum of the absolute values of the two vectors. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L1131>`__.
+
+.. math::
+
+    \text{Canberra}(u, v) = \sum_{i=1}^{n} \frac{|u[i] - v[i]|}{|u[i]| + |v[i]|}
+
+Dice
+====
+
+The Dice similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the
+number of bits set in either of the two vectors. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L333>`__.
+
+.. math::
+
+    \text{Dice}(u, v) = \frac{2 \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i]] + \sum_{i=1}^{n} [v[i]]}
+
+Hamming or Matching
+===================
+
+The Hamming distance (a.k.a. Matching distance) is the number of bits that are different in the two bit vectors
+:math:`u` and :math:`v`. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L697>`__.
+
+.. math::
+
+    \text{Hamming}(u, v) = \sum_{i=1}^{n} [u[i] \neq v[i]]
+
+Jaccard
+=======
+
+The Jaccard distance is the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided by
+the number of equal one-bits in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are different
+in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L755>`__.
+
+.. math::
+
+    \text{Jaccard}(u, v) = \frac{\sum_{i=1}^{n} [u[i] \neq v[i]]}{n}
+
+Kulczynski
+==========
+
+The Kulczynski similarity is the number of equal one-bits in the two bit vectors :math:`u` and :math:`v` multiplied
+with the sum of ones in both vectors divided by twice the sum of ones in both vectors multiplied. The implementation is
+given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L317>`__.
+
+.. math::
+
+    & u_1 = \sum_{i=1}^{n} [u[i]]\\
+    & v_1 = \sum_{i=1}^{n} [v[i]]\\
+    & \text{Kulczynski}(u, v) = \begin{cases}
+        0, &\text{if} u_1 \cdot v_1 = 0,\\
+        \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1]) \cdot (u_1 + v_1)}{2 \cdot u_1 \cdot v_1)} &\text{otherwise}
+    \end{cases}
+
+Matching
+========
+
+see Hamming
+
+OnBit
+=====
+
+The OnBit similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum
+of the one-bits in the two bit vectors :math:`u` and :math:`v`. The similarity is 0 if the latter sum is 0. The
+implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L463>`__.
+
+.. math::
+
+    \text{OnBit}(u, v) = \begin{cases}
+        0, &\text{if} \sum_{i=1}^{n} [u[i] \lor v[i]] = 0,\\
+        \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1])}{\sum_{i=1}^{n} [u[i] \lor v[i]]} &\text{otherwise}
+    \end{cases}
+
+Rogers-Tanimoto
+===============
+
+The Rogers-Tanimoto distance is twice the number of bits that are different in the two bit vectors :math:`u` and
+:math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v`
+plus the number of bits that are equal in the vectors. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L1389>`__.
+
+.. math::
+
+    \text{Rogers-Tanimoto}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} [u[i] \neq v[i]] + \sum_{i=1}^{n} [u[i] = v[i]]}
+
+Rogot-Goldberg
+==============
+
+The Rogot-Goldberg similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by
+the sum of the one-bits in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are different in
+the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L434>`__.
+
+.. math::
+
+    & x = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\
+    & y = \sum_{i=1}^{n} [u[i]]\\
+    & z = \sum_{i=1}^{n} [u[i]]\\
+    & d = n - y - z + x\\
+    & \text{Rogot-Goldberg}(u, v) = \begin{cases}
+        1, &\text{if} x = n \lor d = n,\\
+        \frac{x}{x + z} + \frac{d}{2 \cdot n - y - z} &\text{otherwise}
+    \end{cases}
+
+Russel
+======
+
+The Russel similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the
+number of one-bits in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L425>`__.
+
+.. math::
+
+    \text{Russel}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{n}
+
+Sokal
+=====
+
+The Sokal similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum
+of the one-bits in the two bit vectors :math:`u` and :math:`v` minus the number of equal one-bits in the two bit
+vectors :math:`u` and :math:`v`. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L349>`__.
+
+.. math::
+
+    \text{Sokal}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{2 \cdot \sum_{i=1}^{n} [u[i]] + [v[i]] - \sum_{i=1}^{n} [u[i] = v[i] = 1]}
+
+Sokal-Michener
+==============
+
+The Sokal-Michener distance is twice the number of bits that are different in the two bit vectors :math:`u` and
+:math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v`
+plus the number of bits that are equal in the vectors. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L1496>`__.
+
+.. math::
+
+    \text{Sokal-Michener}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} 2 \cdot [u[i] \neq v[i]] + [u[i] = v[i]]}
+
+Tanimoto
+========
+
+The Tanimoto similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the
+sum of the one-bits in the two bit vectors :math:`u` and :math:`v` minus the number of equal one-bits in the two bit
+vectors :math:`u` and :math:`v`. The implementation is given in `RDKit <https://github.com/rdkit/rdkit/blob/722cbba894736bf3adbe792e7158fba26b5f8e6f/Code/DataStructs/BitOps.cpp#L270>`__.
+
+.. math::
+
+    & t = \sum_{i=1}^{n} [u[i]] + [v[i]]\\
+    & c = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\
+    & \text{Tanimoto}(u, v) = \begin{cases}
+        1, &\text{if} t = 0,\\
+        \frac{c}{t - c} &\text{otherwise}
+    \end{cases}
+
+Yule
+====
+
+The Yule distance is twice the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided
+by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v` plus the number of
+bits that are equal in the vectors. The implementation is given in `SciPy <https://github.com/scipy/scipy/blob/7dcd8c59933524986923cde8e9126f5fc2e6b30b/scipy/spatial/distance.py#L1274>`__.
+
+.. math::
+
+    \text{Yule}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i] = v[i]] + \sum_{i=1}^{n} [u[i] = v[i] = 1]}