Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: data clustering workflow #94

Merged
merged 20 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 0 additions & 8 deletions .github/workflows/run-tox.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,3 @@ jobs:
- name: Run tox
run: |
tox
- name: Upload to codecov
if: ${{matrix.python-version == '3.10'}}
uses: codecov/codecov-action@v1
with:
fail_ci_if_error: false
files: ./coverage.xml
flags: pytest
name: "PyGenStability-py310"
63 changes: 48 additions & 15 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,68 +18,68 @@ We further provide specific analysis tools to process and analyse the results fr

## Documentation

A documentation of all features of the *PyGenStability* is available here: https://barahona-research-group.github.io/PyGenStability/, or in pdf [here](pygenstability_doc.pdf).
A documentation of all features of the *PyGenStability* package is available here: https://barahona-research-group.github.io/PyGenStability/, or in pdf [here](pygenstability_doc.pdf).

## Installation

You can install the package using [pypi](https://pypi.org/project/PyGenStability/):

```
```bash
pip install pygenstability
```

Using a fresh python3 virtual environment, e.g. conda, may be recommended to avoid conflicts with other python packages.

By default, the package uses the Louvain algorithm [4] for optimizing generalized Markov Stability. To use the Leiden algorithm [5], install this package with:
```
```bash
pip install pygenstability[leiden]
```

To plot network partitions using `networkx`, install this package with:
```
```bash
pip install pygenstability[networkx]
```

To use `plotly` for interactive plots in the browser, install this package with:
```
```bash
pip install pygenstability[plotly]
```

To install all dependencies, run:
```
```bash
pip install pygenstability[all]
```

### Installation from GitHub

You can also install the source code of this package from GitHub directly by first cloning this repo with:
```
```bash
git clone --recurse-submodules https://github.com/ImperialCollegeLondon/PyGenStability.git
```

(if the `--recurse-submodules` has not been used, just do `git submodule update --init --recursive` to fetch the submodule with M. Schaub's code).

The wrapper for the submodule uses Pybind11 https://github.com/pybind/pybind11 and, to install the package, simply run (within the `PyGenStability` directory):
```
```bash
pip install .
```
using a fresh python3 virtual environment to avoid conflicts. Similar to above, you can also specify additional dependencies, e.g. to install the package with `networkx` run:
```
```bash
pip install .[networkx]
```

## Using the code

The code is simple to run with the default settings. We can input our graph (of type scipy.csgraph), run a scan in scales with a chosen Markov Stability constructor and plot the results in a summary figure presenting different partition quality measures across scales (values of MS cost function, number of communities, etc.) with an indication of optimal scales.

```
```python
import pygenstability as pgs
results = pgs.run(graph)
pgs.plot_scan(results)
```

Although it is enforced in the code, it is advised to set environment variables
```
```bash
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1
export NUMEXPR_MAX_THREADS=1
Expand All @@ -92,7 +92,7 @@ There are a variety of further choices that users can make that will impact the

While Louvain is defined as the default due to its familiarity within the research community, Leiden is known to produce better partitions and can be used by specifying the run function.

```
```python
results = pgs.run(graph, method="leiden")
```

Expand All @@ -102,7 +102,7 @@ There are also additional post-processing and analysis functions, including:

Optimal scale selection [6] is performed by default with the run function but can be repeated with different parameters if needed, see `pygenstability/optimal_scales.py`. To reduce noise, e.g., one can increase the parameter values for `block_size` and `window_size`. The optimal network partitions can then be plotted given a NetworkX nx_graph.

```
```python
results = pgs.identify_optimal_scales(results, block_size=10, window_size=5)
pgs.plot_optimal_partitions(nx_graph, results)
```
Expand All @@ -126,6 +126,33 @@ For those of you that wish to implement their own constructor, you will need to
- take a scipy.csgraph `graph` and a float `time` as argument
- return a `quality_matrix` (sparse scipy matrix) and a `null_model` (multiples of two, in a numpy array)

## Graph-based data clustering

PyGenStability can also be used to perform multiscale graph-based data clustering on data that comes in the form of a sample-by-feature matrix. This approach was shown to achieve better performance than other popular clustering methods without the need of setting the number of clusters externally [9].

We provide an easy-to-use interface in our `pygenstability.data_clustering.py` module. Given a sample-by-feature matrix `X`, one can apply graph-based data clustering as follows:

```python
clustering = pgs.DataClustering(
graph_method="cknn",
k=5,
constructor="continuous_normalized")

# apply graph-based data clustering to X
results = clustering.fit(X)

# identify optimal scales and plot scan
clustering.scale_selection(kernel_size=0.2)
clustering.plot_scan()
```

We currently support $k$-Nearest Neighbor (kNN) and Continuous $k$-Nearest Neighbor (CkNN) [10] graph constructions (specified by `graph_method`) and `k` refers to the number of neighbours considered in the construction. See documentation for a list of all parameters. All functionalities of PyGenStability including plotting and scale selection are also available for data clustering. For example, given two-dimensional coordinates of the data points one can plot the optimal partitions directly:

```python
# plot robust partitions
clustering.plot_robust_partitions(x_coord=x_coord,y_coord=y_coord)
```

## Contributors

- Alexis Arnaudon, GitHub: `arnaudon <https://github.com/arnaudon>`
Expand Down Expand Up @@ -170,11 +197,11 @@ The original paper for Markov Stability can also be cited as:

In the `example` folder, a demo script with a stochastic block model can be tried with

```
```bash
python simple_example.py
```
or using the click app:
```
```bash
./run_simple_example.sh
```

Expand All @@ -188,6 +215,7 @@ Other examples can be found as jupyter-notebooks in the `examples/` directory, i
- Example 4: Custom constructors
- Example 5: Hypergraphs
- Example 6: Signed networks
- Example 7: Graph-based data clustering

Finally, we provide applications to real-world networks in the `examples/real_examples/` directory, including:

Expand Down Expand Up @@ -224,10 +252,15 @@ If you are interested in trying our other packages, see the below list:

[8] S. Gomez, P. Jensen, and A. Arenas, 'Analysis of community structure in networks of correlated data'. *Physical Review E*, vol. 80, no. 1, p. 016114, Jul. 2009, doi: 10.1103/PhysRevE.80.016114.

[9] Z. Liu and M. Barahona, 'Graph-based data clustering via multiscale community detection', *Applied Network Science*, vol. 5, no. 1, p. 3, Dec. 2020, doi: 10.1007/s41109-019-0248-7.

[10] T. Berry and T. Sauer, 'Consistent manifold representation for topological data analysis', *Foundations of Data Science*, vol. 1, no. 1, p. 1-38, Feb. 2019, doi: 10.3934/fods.2019001.

## Licence

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.

76 changes: 54 additions & 22 deletions docs/index_readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,62 +9,62 @@ We further provide specific analysis tools to process and analyse the results fr

You can install the package using [pypi](https://pypi.org/project/PyGenStability/):

```
```bash
pip install pygenstability
```

Using a fresh python3 virtual environment, e.g. conda, may be recommended to avoid conflicts with other python packages.

By default, the package uses the Louvain algorithm [4] for optimizing generalized Markov Stability. To use the Leiden algorithm [5], install this package with:
```
```bash
pip install pygenstability[leiden]
```

To plot network partitions using `networkx`, install this package with:
```
```bash
pip install pygenstability[networkx]
```

To use `plotly` for interactive plots in the browser, install this package with:
```
```bash
pip install pygenstability[plotly]
```

To install all dependencies, run:
```
```bash
pip install pygenstability[all]
```

### Installation from GitHub

You can also install the source code of this package from GitHub directly by first cloning this repo with:
```
```bash
git clone --recurse-submodules https://github.com/ImperialCollegeLondon/PyGenStability.git
```

(if the `--recurse-submodules` has not been used, just do `git submodule update --init --recursive` to fetch the submodule with M. Schaub's code).

The wrapper for the submodule uses Pybind11 https://github.com/pybind/pybind11 and, to install the package, simply run (within the `PyGenStability` directory):
```
```bash
pip install .
```
using a fresh python3 virtual environment to avoid conflicts. Similar to above, you can also specify additional dependencies, e.g. to install the package with `networkx` run:
```
```bash
pip install .[networkx]
```

## Using the code

The code is simple to run with the default settings. We can input our graph (of type scipy.csgraph), run a scan in scales with a chosen Markov Stability constructor and plot the results in a summary figure presenting different partition quality measures across scales (values of MS cost function, number of communities, etc.) with an indication of optimal scales.

```
```python
import pygenstability as pgs
results = pgs.run(graph)
pgs.plot_scan(results)
```

Although it is enforced in the code, it is advised to set environment variables
```
```bash
export OPENBLAS_NUM_THREADS=1
export OMP_NUM_THREADS=1
export NUMEXPR_MAX_THREADS=1
Expand All @@ -77,7 +77,7 @@ There are a variety of further choices that users can make that will impact the

While Louvain is defined as the default due to its familiarity within the research community, Leiden is known to produce better partitions and can be used by specifying the run function.

```
```python
results = pgs.run(graph, method="leiden")
```

Expand All @@ -87,7 +87,7 @@ There are also additional post-processing and analysis functions, including:

Optimal scale selection [6] is performed by default with the run function but can be repeated with different parameters if needed, see `pygenstability/optimal_scales.py`. To reduce noise, e.g., one can increase the parameter values for `block_size` and `window_size`. The optimal network partitions can then be plotted given a NetworkX nx_graph.

```
```python
results = pgs.identify_optimal_scales(results, block_size=10, window_size=5)
pgs.plot_optimal_partitions(nx_graph, results)
```
Expand All @@ -111,6 +111,33 @@ For those of you that wish to implement their own constructor, you will need to
- take a scipy.csgraph `graph` and a float `time` as argument
- return a `quality_matrix` (sparse scipy matrix) and a `null_model` (multiples of two, in a numpy array)

## Graph-based data clustering

PyGenStability can also be used to perform multiscale graph-based data clustering on data that comes in the form of a sample-by-feature matrix. This approach was shown to achieve better performance than other popular clustering methods without the need of setting the number of clusters externally [9].

We provide an easy-to-use interface in our `pygenstability.data_clustering.py` module. Given a sample-by-feature matrix `X`, one can apply graph-based data clustering as follows:

```python
clustering = pgs.DataClustering(
graph_method="cknn",
k=5,
constructor="continuous_normalized")

# apply graph-based data clustering to X
results = clustering.fit(X)

# identify optimal scales and plot scan
clustering.scale_selection(kernel_size=0.2)
clustering.plot_scan()
```

We currently support $k$-Nearest Neighbor (kNN) and Continuous $k$-Nearest Neighbor (CkNN) [10] graph constructions (specified by `graph_method`) and `k` refers to the number of neighbours considered in the construction. See documentation for a list of all parameters. All functionalities of PyGenStability including plotting and scale selection are also available for data clustering. For example, given two-dimensional coordinates of the data points one can plot the optimal partitions directly:

```python
# plot robust partitions
clustering.plot_robust_partitions(x_coord=x_coord,y_coord=y_coord)
```

## Contributors

- Alexis Arnaudon, GitHub: `arnaudon <https://github.com/arnaudon>`
Expand Down Expand Up @@ -155,11 +182,11 @@ The original paper for Markov Stability can also be cited as:

In the `example` folder, a demo script with a stochastic block model can be tried with

```
```bash
python simple_example.py
```
or using the click app:
```
```bash
./run_simple_example.sh
```

Expand All @@ -173,6 +200,7 @@ Other examples can be found as jupyter-notebooks in the `examples/` directory, i
- Example 4: Custom constructors
- Example 5: Hypergraphs
- Example 6: Signed networks
- Example 7: Graph-based data clustering

Finally, we provide applications to real-world networks in the `examples/real_examples/` directory, including:

Expand All @@ -193,21 +221,25 @@ If you are interested in trying our other packages, see the below list:

## References

[1] J.-C. Delvenne, S. N. Yaliraki, and M. Barahona, ‘Stability of graph communities across time scales’, *Proceedings of the National Academy of Sciences*, vol. 107, no. 29, pp. 12755–12760, Jul. 2010, doi: 10.1073/pnas.0903215107.
[1] J.-C. Delvenne, S. N. Yaliraki, and M. Barahona, 'Stability of graph communities across time scales', *Proceedings of the National Academy of Sciences*, vol. 107, no. 29, pp. 12755–12760, Jul. 2010, doi: 10.1073/pnas.0903215107.

[2] R. Lambiotte, J.-C. Delvenne, and M. Barahona, 'Random Walks, Markov Processes and the Multiscale Modular Organization of Complex Networks', *IEEE Trans. Netw. Sci. Eng.*, vol. 1, no. 2, pp. 76–90, Jul. 2014, doi: 10.1109/TNSE.2015.2391998.

[3] M. T. Schaub, J.-C. Delvenne, R. Lambiotte, and M. Barahona, 'Multiscale dynamical embeddings of complex networks', *Phys. Rev. E*, vol. 99, no. 6, Jun. 2019, doi: 10.1103/PhysRevE.99.062308.

[2] R. Lambiotte, J.-C. Delvenne, and M. Barahona, ‘Random Walks, Markov Processes and the Multiscale Modular Organization of Complex Networks’, *IEEE Trans. Netw. Sci. Eng.*, vol. 1, no. 2, pp. 76–90, Jul. 2014, doi: 10.1109/TNSE.2015.2391998.
[4] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, 'Fast unfolding of communities in large networks', *J. Stat. Mech.*, vol. 2008, no. 10, Oct. 2008, doi: 10.1088/1742-5468/2008/10/p10008.

[3] M. T. Schaub, J.-C. Delvenne, R. Lambiotte, and M. Barahona, ‘Multiscale dynamical embeddings of complex networks’, *Phys. Rev. E*, vol. 99, no. 6, Jun. 2019, doi: 10.1103/PhysRevE.99.062308.
[5] V. A. Traag, L. Waltman, and N. J. van Eck, 'From Louvain to Leiden: guaranteeing well-connected communities', *Sci Rep*, vol. 9, no. 1, p. 5233, Mar. 2019, doi: 10.1038/s41598-019-41695-z.

[4] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and E. Lefebvre, ‘Fast unfolding of communities in large networks’, *J. Stat. Mech.*, vol. 2008, no. 10, Oct. 2008, doi: 10.1088/1742-5468/2008/10/p10008.
[6] D. J. Schindler, J. Clarke, and M. Barahona, 'Multiscale Mobility Patterns and the Restriction of Human Movement', *Royal Society Open Science*, vol. 10, no. 10, p. 230405, Oct. 2023, doi: 10.1098/rsos.230405.

[5] V. A. Traag, L. Waltman, and N. J. van Eck, ‘From Louvain to Leiden: guaranteeing well-connected communities’, *Sci Rep*, vol. 9, no. 1, p. 5233, Mar. 2019, doi: 10.1038/s41598-019-41695-z.
[7] A. Arnaudon, D. J. Schindler, R. L. Peach, A. Gosztolai, M. Hodges, M. T. Schaub, and M. Barahona, 'PyGenStability: Multiscale community detection with generalized Markov Stability', *arXiv pre-print*, Mar. 2023, doi: 10.48550/arXiv.2303.05385.

[6] D. J. Schindler, J. Clarke, and M. Barahona, ‘Multiscale Mobility Patterns and the Restriction of Human Movement’, *Royal Society Open Science*, vol. 10, no. 10, p. 230405, Oct. 2023, doi: 10.1098/rsos.230405.
[8] S. Gomez, P. Jensen, and A. Arenas, 'Analysis of community structure in networks of correlated data'. *Physical Review E*, vol. 80, no. 1, p. 016114, Jul. 2009, doi: 10.1103/PhysRevE.80.016114.

[7] A. Arnaudon, D. J. Schindler, R. L. Peach, A. Gosztolai, M. Hodges, M. T. Schaub, and M. Barahona, ‘PyGenStability: Multiscale community detection with generalized Markov Stability‘, *arXiv pre-print*, Mar. 2023, doi: 10.48550/arXiv.2303.05385.
[9] Z. Liu and M. Barahona, 'Graph-based data clustering via multiscale community detection', *Applied Network Science*, vol. 5, no. 1, p. 3, Dec. 2020, doi: 10.1007/s41109-019-0248-7.

[8] S. Gómez, P. Jensen, and A. Arenas, ‘Analysis of community structure in networks of correlated data‘. *Physical Review E*, vol. 80, no. 1, p. 016114, Jul. 2009, doi: 10.1103/PhysRevE.80.016114.
[10] T. Berry and T. Sauer, 'Consistent manifold representation for topological data analysis', *Foundations of Data Science*, vol. 1, no. 1, p. 1-38, Feb. 2019, doi: 10.3934/fods.2019001.

## Licence

Expand Down
4 changes: 4 additions & 0 deletions docs/source/dataclustering.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
DataClustering module
=====================
.. automodule:: pygenstability.DataClustering
:members:
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Documentation of the API of *PyGenStability*.
constructors
app
plotting
dataclustering
optimal_scales
io
examples/example
Loading
Loading