From 999e11a1f3e86fe87df0e75b336a22946ec77401 Mon Sep 17 00:00:00 2001 From: Roman Joeres Date: Fri, 19 Apr 2024 12:34:57 +0200 Subject: [PATCH] Documentation update --- docs/conf.py | 1 + docs/index.rst | 1 + docs/workflow/clustering.rst | 3 +- docs/workflow/embeddings.rst | 335 +++++++++++++++++++++++++++++++++++ docs/workflow/input.rst | 24 +-- 5 files changed, 351 insertions(+), 13 deletions(-) create mode 100644 docs/workflow/embeddings.rst diff --git a/docs/conf.py b/docs/conf.py index e535cae..310d2ca 100644 --- a/docs/conf.py +++ b/docs/conf.py @@ -59,6 +59,7 @@ html_show_sourcelink = True rst_context = {"DataSAIL": datasail} +mathjax3_config = {'chtml': {'displayAlign': 'left'}} add_module_names = False fail_on_warning = True diff --git a/docs/index.rst b/docs/index.rst index bb882cb..9eee5f8 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -64,6 +64,7 @@ arguments are mostly the same. workflow/input workflow/clustering + workflow/embeddings workflow/splits workflow/solvers posters diff --git a/docs/workflow/clustering.rst b/docs/workflow/clustering.rst index 2ebdf2b..6fd9583 100644 --- a/docs/workflow/clustering.rst +++ b/docs/workflow/clustering.rst @@ -144,7 +144,8 @@ Details about the clustering algorithms ####################################### For all algorithms, the given general commands can be extended by user defined arguments according to the specification -of the respective tool. +of the respective tool. Apart from this, DataSAIL offers multiple options to compute similarities or distances between +embeddings of the input data. More information is given :ref:`here `. CD-HIT(-EST) ============ diff --git a/docs/workflow/embeddings.rst b/docs/workflow/embeddings.rst new file mode 100644 index 0000000..551fc32 --- /dev/null +++ b/docs/workflow/embeddings.rst @@ -0,0 +1,335 @@ +######################## +Clustering of Embeddings +######################## + +.. _embeddings-label: + +DataSAIL offers different clustering algorithms implemented in SciPy and RDKit to cluster the embeddings. +The clustering algorithms are: + +.. list-table:: Title + :widths: 30 15 15 15 15 15 + :header-rows: 1 + + * - Algorithm + - Sim or Dist + - Boolean + - Integer + - Float + - RDKit or SciPy + * - AllBit + - Sim + - X + - \- + - \- + - RDKit + * - Asymmetric + - Sim + - X + - \- + - \- + - RDKit + * - Braun-Blanquet + - Sim + - X + - \- + - \- + - RDKit + * - Canberra + - Dist + - X + - X + - X + - SciPy + * - Dice + - Sim + - X + - X + - \- + - RDKit + * - Hamming + - Dist + - X + - X + - X + - SciPy + * - Kulczynski + - Sim + - X + - \- + - \- + - RDKit + * - Jaccard + - Dist + - X + - \- + - \- + - SciPy + * - Matching + - Dist + - X + - X + - X + - SciPy + * - OnBit + - Sim + - X + - \- + - \- + - RDKit + * - Rogers-Tanimoto + - Dist + - X + - \- + - \- + - SciPy + * - Rogot-Goldberg + - Sim + - X + - \- + - \- + - RDKit + * - Russel + - Sim + - X + - \- + - \- + - RDKit + * - Sokal + - Sim + - X + - \- + - \- + - RDKit + * - Sokal-Michener + - Dist + - X + - \- + - \- + - SciPy + * - Tanimoto + - Sim + - X + - X + - \- + - RDKit + * - Yule + - Dist + - X + - \- + - \- + - SciPy + +Individual Algorithms +##################### + +In the following, we will describe the individual algorithms in more detail and with the mathematical formula that +computes the respective metric between two vectors :math:`u` and :math:`v` of length :math:`n`. Depending on the method +used, :math:`u` and :math:`v` can be float-vectors but may also be restricted to be int-vectors or bit-vectors. + +.. note:: + We will use the `Iverson bracket `__ notation :math:`[P]` to + denote the indicator function that is 1 if the predicate :math:`P` is true and 0 otherwise. + +AllBit +====== + +This is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v`. + +.. math:: + + \text{AllBit}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i]]}{n} + +Asymmetric +========== + +The Asymmetric similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the +minimum number of bits set in either of the two vectors. The implementation is given in `RDKit `__. + +.. math:: + + & u_1 = \sum_{i=1}^{n} [u[i]]\\ + & v_1 = \sum_{i=1}^{n} [v[i]]\\ + & \text{Asymmetric}(u, v) = \begin{cases} + 0, &\text{if} \min(u_1,v_1) = 0,\\ + \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\min(u_1, v_1)} &\text{otherwise} + \end{cases} + +Braun-Blanquet +============== + +The Braun-Blanquet similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the +maximum number of bits set in either of the two vectors. The implementation is given in `RDKit `__. + +.. math:: + + & u_1 = \sum_{i=1}^{n} [u[i]]\\ + & v_1 = \sum_{i=1}^{n} [v[i]]\\ + & \text{Braun-Blanquet}(u, v) = \begin{cases} + 0, &\text{if} \max(u_1,v_1) = 0,\\ + \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{\max(u_1, v_1)} &\text{otherwise} + \end{cases} + +Canberra +======== + +The Canberra distance is the sum of the absolute differences of the two vectors :math:`u` and :math:`v` divided by the +sum of the absolute values of the two vectors. The implementation is given in `SciPy `__. + +.. math:: + + \text{Canberra}(u, v) = \sum_{i=1}^{n} \frac{|u[i] - v[i]|}{|u[i]| + |v[i]|} + +Dice +==== + +The Dice similarity is the ratio of equal bits in the two bit vectors :math:`u` and :math:`v` divided by the sum of the +number of bits set in either of the two vectors. The implementation is given in `RDKit `__. + +.. math:: + + \text{Dice}(u, v) = \frac{2 \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i]] + \sum_{i=1}^{n} [v[i]]} + +Hamming or Matching +=================== + +The Hamming distance (a.k.a. Matching distance) is the number of bits that are different in the two bit vectors +:math:`u` and :math:`v`. The implementation is given in `SciPy `__. + +.. math:: + + \text{Hamming}(u, v) = \sum_{i=1}^{n} [u[i] \neq v[i]] + +Jaccard +======= + +The Jaccard distance is the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided by +the number of equal one-bits in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are different +in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `SciPy `__. + +.. math:: + + \text{Jaccard}(u, v) = \frac{\sum_{i=1}^{n} [u[i] \neq v[i]]}{n} + +Kulczynski +========== + +The Kulczynski similarity is the number of equal one-bits in the two bit vectors :math:`u` and :math:`v` multiplied +with the sum of ones in both vectors divided by twice the sum of ones in both vectors multiplied. The implementation is +given in `RDKit `__. + +.. math:: + + & u_1 = \sum_{i=1}^{n} [u[i]]\\ + & v_1 = \sum_{i=1}^{n} [v[i]]\\ + & \text{Kulczynski}(u, v) = \begin{cases} + 0, &\text{if} u_1 \cdot v_1 = 0,\\ + \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1]) \cdot (u_1 + v_1)}{2 \cdot u_1 \cdot v_1)} &\text{otherwise} + \end{cases} + +Matching +======== + +see Hamming + +OnBit +===== + +The OnBit similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum +of the one-bits in the two bit vectors :math:`u` and :math:`v`. The similarity is 0 if the latter sum is 0. The +implementation is given in `RDKit `__. + +.. math:: + + \text{OnBit}(u, v) = \begin{cases} + 0, &\text{if} \sum_{i=1}^{n} [u[i] \lor v[i]] = 0,\\ + \frac{(\sum_{i=1}^{n} [u[i] = v[i] = 1])}{\sum_{i=1}^{n} [u[i] \lor v[i]]} &\text{otherwise} + \end{cases} + +Rogers-Tanimoto +=============== + +The Rogers-Tanimoto distance is twice the number of bits that are different in the two bit vectors :math:`u` and +:math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v` +plus the number of bits that are equal in the vectors. The implementation is given in `SciPy `__. + +.. math:: + + \text{Rogers-Tanimoto}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} [u[i] \neq v[i]] + \sum_{i=1}^{n} [u[i] = v[i]]} + +Rogot-Goldberg +============== + +The Rogot-Goldberg similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by +the sum of the one-bits in the two bit vectors :math:`u` and :math:`v` plus the number of bits that are different in +the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. + +.. math:: + + & x = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\ + & y = \sum_{i=1}^{n} [u[i]]\\ + & z = \sum_{i=1}^{n} [u[i]]\\ + & d = n - y - z + x\\ + & \text{Rogot-Goldberg}(u, v) = \begin{cases} + 1, &\text{if} x = n \lor d = n,\\ + \frac{x}{x + z} + \frac{d}{2 \cdot n - y - z} &\text{otherwise} + \end{cases} + +Russel +====== + +The Russel similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the +number of one-bits in the two bit vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. + +.. math:: + + \text{Russel}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{n} + +Sokal +===== + +The Sokal similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the sum +of the one-bits in the two bit vectors :math:`u` and :math:`v` minus the number of equal one-bits in the two bit +vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. + +.. math:: + + \text{Sokal}(u, v) = \frac{\sum_{i=1}^{n} [u[i] = v[i] = 1]}{2 \cdot \sum_{i=1}^{n} [u[i]] + [v[i]] - \sum_{i=1}^{n} [u[i] = v[i] = 1]} + +Sokal-Michener +============== + +The Sokal-Michener distance is twice the number of bits that are different in the two bit vectors :math:`u` and +:math:`v` divided by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v` +plus the number of bits that are equal in the vectors. The implementation is given in `SciPy `__. + +.. math:: + + \text{Sokal-Michener}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] \neq v[i]]}{\sum_{i=1}^{n} 2 \cdot [u[i] \neq v[i]] + [u[i] = v[i]]} + +Tanimoto +======== + +The Tanimoto similarity is the ratio of equal one-bits in the two bit vectors :math:`u` and :math:`v` divided by the +sum of the one-bits in the two bit vectors :math:`u` and :math:`v` minus the number of equal one-bits in the two bit +vectors :math:`u` and :math:`v`. The implementation is given in `RDKit `__. + +.. math:: + + & t = \sum_{i=1}^{n} [u[i]] + [v[i]]\\ + & c = \sum_{i=1}^{n} [u[i] = v[i] = 1]\\ + & \text{Tanimoto}(u, v) = \begin{cases} + 1, &\text{if} t = 0,\\ + \frac{c}{t - c} &\text{otherwise} + \end{cases} + +Yule +==== + +The Yule distance is twice the number of bits that are different in the two bit vectors :math:`u` and :math:`v` divided +by the sum of the number of bits that are different in the two bit vectors :math:`u` and :math:`v` plus the number of +bits that are equal in the vectors. The implementation is given in `SciPy `__. + +.. math:: + + \text{Yule}(u, v) = \frac{2 \cdot \sum_{i=1}^{n} [u[i] = v[i] = 1]}{\sum_{i=1}^{n} [u[i] = v[i]] + \sum_{i=1}^{n} [u[i] = v[i] = 1]} diff --git a/docs/workflow/input.rst b/docs/workflow/input.rst index 603b6fa..1b5df62 100644 --- a/docs/workflow/input.rst +++ b/docs/workflow/input.rst @@ -15,17 +15,17 @@ The standard way to share data in an effective way are :code:`.csv` and :code:`. are used to, e. g., transport data about molecules, weights of samples, or stratification. From these files, DataSAIL only reads the first two columns. The first column has to contain the names of the samples and the second row the according information (SMILES or FASTA string, weighting, stratification, ...). Also, the first row must be column -names, therefore, DataSAIL ignores the first row. Examples are given in :code:`tests/data/pipline/drug.tsv`v(`Link `_) -and :code:`tests/data/pipeline/drugs_weights.tsv` (`Link `_). +names, therefore, DataSAIL ignores the first row. Examples are given in :code:`tests/data/pipline/drug.tsv`v(`Link `__) +and :code:`tests/data/pipeline/drugs_weights.tsv` (`Link `__). -But they are also used to ship similarity and distance matrices. An -example is given in :code:`tests/data/pipeline/drug_sim.csv` (`Link `_) -and :code:`tests/data/pipeline/drug_dist.csv` (`Link `_). +But they are also used to ship similarity and distance matrices. An example is given in +:code:`tests/data/pipeline/drug_sim.csv` (`Link `__) +and :code:`tests/data/pipeline/drug_dist.csv` (`Link `__). Here, the first row and column contain the names of the samples and the rest of the matrix the similarities or distances between the samples. CSV and TSV files can also be used to transport interactions. An example is given in -:code:`tests/data/pipeline/inter.tsv` (`Link `_). +:code:`tests/data/pipeline/inter.tsv` (`Link `__). Again, only the first two columns matter which specify which sample from the e-entity with which sample from the f-entity interacts. @@ -44,13 +44,13 @@ For Protein and Nucleotide Sequences Sequence-based datasets are stored inside a single files. Each sequences must be identified with its name in a line starting with a :code:`>`. All following lines are concatenated to form the sequence until there is an empty line, the end of the file, or a line that starts with :code:`>` starting the next line. An example with protein sequences is -given in :code:`tests/data/pipline/seqs.fasta` (`Link `_). +given in :code:`tests/data/pipline/seqs.fasta` (`Link `__). For whole Genomes ================= Genome input through FASTA files is a bit different to the format above. Here, each file contains all contigs, or reads -of one sample and the dataset is represented by a folder. Examples are given in :code:`tests/data/genomes` (`Link `_). +of one sample and the dataset is represented by a folder. Examples are given in :code:`tests/data/genomes` (`Link `__). Pickle Files ############ @@ -59,7 +59,7 @@ Pickle Files From version 1.0.0 on, DataSAIL can also take embeddings as input. Here, the pickle file has to contain a dictionary mapping the sample names to the embeddings. An example storing Morgan fingerprints of the molecules in -:code:`tests/data/pipeline/drugs.tsv` in a pickle file is given in :code:`tests/data/pipeline/drugs.pkl` (`Link `_). +:code:`tests/data/pipeline/drugs.tsv` in a pickle file is given in :code:`tests/data/pipeline/drugs.pkl` (`Link `__). HDF5 Files ########## @@ -69,7 +69,7 @@ HDF5 Files Also, from version 1.0.0 on, DataSAIL supports the :code:`.h5` format. This format is used to store large datasets in runtime and memory efficient way. Similar to Pickle files, the HDF5 file has to contain a dictionary mapping the sample names to the embeddings. An example storing Morgan fingerprints of the molecules in -:code:`tests/data/pipeline/drugs.tsv` in a HDF5 file is given in :code:`tests/data/pipeline/drugs.h5` (`Link `_). +:code:`tests/data/pipeline/drugs.tsv` in a HDF5 file is given in :code:`tests/data/pipeline/drugs.h5` (`Link `__). To open and convert it to a dictionary, the following code can be used: .. code-block:: python @@ -80,7 +80,7 @@ To open and convert it to a dictionary, the following code can be used: with h5py.File('tests/data/pipeline/morgan.h5', 'r') as f: morgan = {k: np.array(v) for k, v in f.items()} -Example code for creation and reading of Pickle and HDF5 files can be found in :code:`tests/data/pipeline/embed.py` (`Link `_). +Example code for creation and reading of Pickle and HDF5 files can be found in :code:`tests/data/pipeline/embed.py` (`Link `__). Molecular Input Files ##################### @@ -96,4 +96,4 @@ except for :code:`.sdf` files, which can contain multiple molecules. The molecul molecules in the same file. Example files for :code:`.mol`, :code:`.mrv`, :code:`.pdb`, and :code:`.tpl` are given in -:code:`tests/data/pipeline/mol_formats//` (`Link `_). +:code:`tests/data/pipeline/mol_formats//` (`Link `__).