merge main to iss14 to include tests

flatironinstitute · Jul 11, 2024 · b5b4b05 · b5b4b05
2 parents f489f14 + b11df30
commit b5b4b05
Show file tree

Hide file tree

Showing 32 changed files with 1,523 additions and 312 deletions.
diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml
@@ -0,0 +1,61 @@
+# GHA workflow for running tests.
+#
+# Largely taken from
+# https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+# Please check the link for more detailed instructions
+
+name: Run tests
+
+on: [push]
+
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+    strategy:
+      matrix:
+        python-version: ["3.8", "3.9", "3.10", "3.11"]
+
+    steps:
+      - uses: actions/checkout@v3
+
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Cache dependencies
+        id: cache_deps
+        uses: actions/cache@v3
+        with:
+          path: |
+            ${{ env.pythonLocation }}
+          key: venv-${{ runner.os }}-${{ env.pythonLocation }}-${{ hashFiles('**/pyproject.toml') }}
+
+      - name: Cache test data
+        id: cache_test_data
+        uses: actions/cache@v3
+        with:
+          path: |
+            tests/data
+            data
+          key: venv-${{ runner.os }}-${{ env.pythonLocation }}-${{ hashFiles('**/tests/scripts/fetch_test_data.sh') }}
+
+      - name: Install dependencies
+        if: ${{ steps.cache_deps.outputs.cache-hit != 'true' }}
+        run: |
+          python -m pip install --upgrade pip
+          pip install .
+          pip install pytest omegaconf
+
+      - name: Get test data from OSF
+        if: ${{ steps.cache_test_data.outputs.cache-hit != 'true' }}
+        run: |
+          sh tests/scripts/fetch_test_data.sh
+
+      - name: Test with pytest
+        run: |
+          pytest tests/test_preprocessing.py
+          pytest tests/test_svd.py
+          pytest tests/test_map_to_map.py
+          pytest tests/test_distribution_to_distribution.py
diff --git a/.gitignore b/.gitignore
@@ -158,3 +158,9 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
+
+# Tutorials folder
+tutorials/*
+
+# Config file templates
+config_files/*
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -10,7 +10,6 @@ repos:
     -   id: trailing-whitespace
     -   id: end-of-file-fixer
     -   id: check-yaml
-    -   id: check-added-large-files
 - repo: https://github.com/astral-sh/ruff-pre-commit
   # Ruff version.
   rev: v0.3.4

diff --git a/README.md b/README.md
@@ -1,28 +1,81 @@
 <h1 align='center'>Cryo-EM Heterogeniety Challenge</h1>
 
-This repository contains the code used to analyse the submissions for the first Cryo-EM Heteorgeneity Challenge.
+<p align="center">
 
+<img alt="Supported Python versions" src="https://img.shields.io/badge/Supported_Python_Versions-3.8_%7C_3.9_%7C_3.10_%7C_3.11-blue">
+<img alt="GitHub Downloads (all assets, all releases)" src="https://img.shields.io/github/downloads/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/total">
+<img alt="GitHub branch check runs" src="https://img.shields.io/github/check-runs/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/main">
+<img alt="GitHub License" src="https://img.shields.io/github/license/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1">
 
-## Warning
+</p>
 
+<p align="center">
 
-This is a work in progress, while the code will probably not change, we are still writting better tutorials, documentation, and other ideas for analyzing the data. We are also in the process of making it easier for other people to contribute with their own metrics and methods. We are also in the process of distributiing the code to PyPi
+<img alt="Cryo-EM Heterogeneity Challenge" src="https://simonsfoundation.imgix.net/wp-content/uploads/2023/05/15134456/Screenshot-2023-05-15-at-1.39.07-PM.png?auto=format&q=90">
 
+</p>
 
-## Accesing the data
 
-The data is available in TODO
 
+This repository contains the code used to analyse the submissions for the [Inaugural Flatiron Cryo-EM Heterogeneity Challenge](https://www.simonsfoundation.org/flatiron/center-for-computational-biology/structural-and-molecular-biophysics-collaboration/heterogeneity-in-cryo-electron-microscopy/).
 
-## Installation
+# Scope
+This repository explains how to preprocess a submission (80 maps and corresponding probability distribution), and analyze it. Challenge participants can benchmark their submissions locally against the ground truth and other submissions that are available on the cloud via the Open Science Foundation project [The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge](https://osf.io/8h6fz/).
 
+# Warning
+This is a work in progress, while the code will probably not change, we are still writting better tutorials, documentation, and other ideas for analyzing the data. We are also in the process of making it easier for other people to contribute with their own metrics and methods. We are also in the process of distributing the code to PyPi.
+
+# Accesing the data
+The data is available via the Open Science Foundation project [The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge](https://osf.io/8h6fz/). You can download via a web browser, or programatically with wget as per [this script](https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/blob/main/tests/scripts/fetch_test_data.sh).
+
+**_NOTE_**: We recommend downloadaing the data with the script and wget as the downloads from the web browser might be unstable.
+
+# Installation
+
+## Stable installation
 Installing this repository is simply. We recommend creating a virtual environment (using conda or pyenv), since we have dependencies such as PyTorch or Aspire, which are better dealt with in an isolated environment. After creating your environment, make sure to activate it and run
 
 ```bash
 cd /path/to/Cryo-EM-Heterogeneity-Challenge-1
 pip install .
 ```
 
-You are all set. If you want to run our code, please check the notebooks in the folder called "tutorials".
+## Devel installation
+If you are interested in testing the programs previously installed, please, install the repository in development mode with the following commands:
+
+```bash
+cd /path/to/Cryo-EM-Heterogeneity-Challenge-1
+pip install .[dev]
+```
+
+The test included in the repo can be executed with PyTest as shown below:
+
+```bash
+cd /path/to/Cryo-EM-Heterogeneity-Challenge-1
+pytest tests/test_preprocessing.py
+pytest tests/test_svd.py
+pytest tests/test_map_to_map.py
+pytest tests/test_distribution_to_distribution.py
+```
+
+# Running
+If you want to run our code, please check the notebooks in the [tutorials folder](https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/tree/main/tutorials).
+
+The tutorials explain how to setup the config files, and run the commands
+```
+cryo_challenge run_preprocessing                      --config config_files/config_preproc.yaml
+cryo_challenge run_svd                                --config config_files/config_svd.yaml
+cryo_challenge run_map2map_pipeline                   --config config_files/config_map_to_map.yaml
+cryo_challenge run_distribution2distribution_pipeline --config config_files/config_distribution_to_distribution.yaml
+```
+
+# Contributing
+If you find any bug or have a suggestion on the code feel free to open an issue [here](https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/issues).
+
+We also welcome any help with the development of this repository. If you want to contribute with your own suggestions, code, or fixes, we recommend creating a fork of this repository to avoid any incompatibilities with newer versions of the software. Once you are happy with your new code, please, make a PR from your fork to this repository.
+
+We are also working on pipelines to simplify the exentension of the code with new metrics or functionalities, stay tuned!
 
-## Acknowledgements
+# Acknowledgements
+* Miro A. Astore, Geoffrey Woollard, David Silva-Sánchez, Wenda Zhao, Khanh Dao Duc, Nikolaus Grigorieff, Pilar Cossio, and Sonya M. Hanson. "The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge". 9 June 2023. DOI:10.17605/OSF.IO/8H6FZ
+* [David Herreros](https://github.com/DavidHerreros) for testing and CI and debugging in this repo
diff --git a/config_files/config_distribution_to_distribution.yaml b/config_files/config_distribution_to_distribution.yaml
@@ -12,4 +12,4 @@ cvxpy_solver: ECOS
 optimal_q_kl:
   n_iter: 100000
   break_atol: 0.0001
-output_fname: results/distribution_to_distribution_submission_0.pkl
+output_fname: results/distribution_to_distribution_submission_0.pkl
diff --git a/config_files/config_map_to_map_distance_matrix.yaml b/config_files/config_map_to_map_distance_matrix.yaml
@@ -1,15 +1,15 @@
 data:
   n_pix: 224
-  psize: 2.146 
+  psize: 2.146
   submission:
     fname: data/submission_0.pt
     volume_key: volumes
     metadata_key: populations
     label_key: id
   ground_truth:
-    volumes: data/maps_gt_flat.pt 
-    metadata: data/metadata.csv 
-  mask: 
+    volumes: data/maps_gt_flat.pt
+    metadata: data/metadata.csv
+  mask:
     do: true
     volume: data/mask_dilated_wide_224x224.mrc
 analysis:
@@ -23,4 +23,4 @@ analysis:
   normalize:
     do: true
     method: median_zscore
-output: results/map_to_map_distance_matrix_submission_0.pkl
+output: results/map_to_map_distance_matrix_submission_0.pkl
diff --git a/config_files/config_plotting.yaml b/config_files/config_plotting.yaml
@@ -0,0 +1,13 @@
+gt_metadata: path/to/metadata.csv
+
+map2map_results:
+  - path/to/map2map_results_1.pkl
+  - path/to/map2map_results_2.pkl
+
+dist2dist_results:
+  pkl_fnames:
+    - path/to/dist2dist_results_1.pkl
+    - path/to/dist2dist_results_2.pkl
+  pkl_globs:
+    - string/path/with/wildcards/and/regex/filestem*_field1[0-9].pkl
+    - string/path/with/wildcards/and/regex/filestem*_field1[0-9]_field2[0-9].pkl
diff --git a/pyproject.toml b/pyproject.toml
@@ -38,27 +38,29 @@ classifiers = [
   "Programming Language :: Python :: Implementation :: PyPy",
 ]
 dependencies = [
-  "torch",
-  "numpy",
-  "natsort",
-  "pandas",
-  "dataclasses_json",
-  "mrcfile",
-  "scipy",
-  "cvxpy",
-  "POT",
-  "aspire",
-  "jupyter",
-  "osfclient",
-  "seaborn"
+  "torch<=2.3.1",
+  "numpy<=2.0.0",
+  "natsort<=8.4.0",
+  "pandas<=2.2.2",
+  "dataclasses_json<=0.6.7",
+  "mrcfile<=1.5.0",
+  "scipy<=1.13.1",
+  "cvxpy<=1.5.2",
+  "POT<=0.9.3",
+  "aspire<=0.12.2",
+  "jupyter<=1.0.0",
+  "osfclient<=0.0.5",
+  "seaborn<=0.13.2",
+  "ipyfilechooser<=0.6.0",
 ]
 
 [project.optional-dependencies]
 dev = [
-  "pytest",
+  "pytest<=8.2.2",
   "mypy",
   "pre-commit",
-  "ruff"
+  "ruff",
+  "omegaconf<=2.3.0"
 ]
 
 [project.urls]

diff --git a/src/cryo_challenge/_distribution_to_distribution/distribution_to_distribution.py b/src/cryo_challenge/_distribution_to_distribution/distribution_to_distribution.py
@@ -2,8 +2,6 @@
 import numpy as np
 import pickle
 from scipy.stats import rankdata
-import yaml
-import argparse
 import torch
 import ot
 
@@ -13,6 +11,15 @@
 )
 
 
+def sort_by_transport(cost):
+    m, n = cost.shape
+    _, transport = compute_wasserstein_between_distributions_from_weights_and_cost(
+        np.ones(m) / m, np.ones(n) / n, cost
+    )
+    indices = np.argsort((transport * np.arange(m)[..., None]).sum(0))
+    return cost[:, indices], indices, transport
+
+
 def compute_wasserstein_between_distributions_from_weights_and_cost(
     weights_a, weights_b, cost, numItermax=1000000
 ):
@@ -58,15 +65,14 @@ def make_assignment_matrix(cost_matrix):
 
 
 def run(config):
-
     metadata_df = pd.read_csv(config["gt_metadata_fname"])
     metadata_df.sort_values("pc1", inplace=True)
 
     with open(config["input_fname"], "rb") as f:
         data = pickle.load(f)
 
     # user_submitted_populations = np.ones(80)/80
-    user_submitted_populations = data["user_submitted_populations"]#.numpy()
+    user_submitted_populations = data["user_submitted_populations"]  # .numpy()
     id = torch.load(data["config"]["data"]["submission"]["fname"])["id"]
 
     results_dict = {}
@@ -206,5 +212,5 @@ def optimal_q_kl(n_iter, x_start, A, Window, prob_gt, break_atol):
     DistributionToDistributionResultsValidator.from_dict(results_dict)
     with open(config["output_fname"], "wb") as f:
         pickle.dump(results_dict, f)
-    
+
     return results_dict
diff --git a/src/cryo_challenge/_map_to_map/map_to_map_distance_matrix.py b/src/cryo_challenge/_map_to_map/map_to_map_distance_matrix.py
@@ -42,7 +42,7 @@ def run(config):
     user_submission_label = submission[label_key]
 
     # n_trunc = 10
-    metadata_gt = pd.read_csv(config["data"]["ground_truth"]["metadata"])#[:n_trunc]
+    metadata_gt = pd.read_csv(config["data"]["ground_truth"]["metadata"])  # [:n_trunc]
 
     results_dict = {}
     results_dict["config"] = config

diff --git a/src/cryo_challenge/_ploting/plotting_utils.py b/src/cryo_challenge/_ploting/plotting_utils.py
@@ -0,0 +1,7 @@
+import numpy as np
+
+
+def res_at_fsc_threshold(fscs, threshold=0.5):
+    res_fsc_half = np.argmin(fscs > threshold, axis=-1)
+    fraction_nyquist = 0.5 * res_fsc_half / fscs.shape[-1]
+    return res_fsc_half, fraction_nyquist
diff --git a/src/cryo_challenge/data/__init__.py b/src/cryo_challenge/data/__init__.py
@@ -1,6 +1,18 @@
-from ._validation.config_validators import validate_input_config_disttodist as validate_input_config_disttodist
-from ._validation.config_validators import validate_config_dtd_optimal_q_kl as validate_config_dtd_optimal_q_kl
-from cryo_challenge.data._validation.output_validators import DistributionToDistributionResultsValidator as DistributionToDistributionResultsValidator
-from cryo_challenge.data._validation.output_validators import MetricDistToDistValidator as MetricDistToDistValidator
-from cryo_challenge.data._validation.output_validators import ReplicateValidatorEMD as ReplicateValidatorEMD
-from cryo_challenge.data._validation.output_validators import ReplicateValidatorKL as ReplicateValidatorKL
+from ._validation.config_validators import (
+    validate_input_config_disttodist as validate_input_config_disttodist,
+)
+from ._validation.config_validators import (
+    validate_config_dtd_optimal_q_kl as validate_config_dtd_optimal_q_kl,
+)
+from cryo_challenge.data._validation.output_validators import (
+    DistributionToDistributionResultsValidator as DistributionToDistributionResultsValidator,
+)
+from cryo_challenge.data._validation.output_validators import (
+    MetricDistToDistValidator as MetricDistToDistValidator,
+)
+from cryo_challenge.data._validation.output_validators import (
+    ReplicateValidatorEMD as ReplicateValidatorEMD,
+)
+from cryo_challenge.data._validation.output_validators import (
+    ReplicateValidatorKL as ReplicateValidatorKL,
+)
diff --git a/src/cryo_challenge/data/_io/svd_io_utils.py b/src/cryo_challenge/data/_io/svd_io_utils.py
@@ -145,14 +145,16 @@ def load_ref_vols(box_size_ds: int, path_to_volumes: str, dtype=torch.float32):
 
     # Reshape volumes to correct size
     if volumes.dim() == 2:
-        box_size = int(round((float(volumes.shape[-1]) ** (1. / 3.))))
+        box_size = int(round((float(volumes.shape[-1]) ** (1.0 / 3.0))))
         volumes = torch.reshape(volumes, (-1, box_size, box_size, box_size))
     elif volumes.dim() == 4:
         pass
     else:
-        raise ValueError(f"The shape of the volumes stored in {path_to_volumes} have the unexpected shape "
-                         f"{torch.shape}. Please, review the file and regenerate it so that volumes stored hasve the "
-                         f"shape (num_vols, box_size ** 3) or (num_vols, box_size, box_size, box_size).")
+        raise ValueError(
+            f"The shape of the volumes stored in {path_to_volumes} have the unexpected shape "
+            f"{torch.shape}. Please, review the file and regenerate it so that volumes stored hasve the "
+            f"shape (num_vols, box_size ** 3) or (num_vols, box_size, box_size, box_size)."
+        )
 
     volumes_ds = torch.empty(
         (volumes.shape[0], box_size_ds, box_size_ds, box_size_ds), dtype=dtype