Merge branch 'dev' into resolution_metric

flatironinstitute · Aug 12, 2024 · a10f656 · a10f656
2 parents 3f86081 + fc99c96
commit a10f656
Show file tree

Hide file tree

Showing 22 changed files with 162 additions and 126 deletions.
diff --git a/.github/workflows/main_merge_check.yml b/.github/workflows/main_merge_check.yml
@@ -0,0 +1,14 @@
+name: Check merging branch
+
+on:
+  pull_request:
+
+jobs:
+  check_branch:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Check branch
+        if: github.base_ref == 'main' && github.head_ref != 'dev'
+        run: |
+          echo "ERROR: You can only merge to main from dev."
+          exit 1
diff --git a/.github/workflows/testing.yml b/.github/workflows/testing.yml
@@ -15,48 +15,30 @@ jobs:
     strategy:
       matrix:
         python-version: ["3.8", "3.9", "3.10", "3.11"]
+      fail-fast: false
+
 
     steps:
-      - uses: actions/checkout@v3
-
+      - uses: actions/checkout@v4
       - name: Set up Python ${{ matrix.python-version }}
-        uses: actions/setup-python@v4
+        uses: actions/setup-python@v5
         with:
           python-version: ${{ matrix.python-version }}
-
-      - name: Cache dependencies
-        id: cache_deps
-        uses: actions/cache@v3
-        with:
-          path: |
-            ${{ env.pythonLocation }}
-          key: venv-${{ runner.os }}-${{ env.pythonLocation }}-${{ hashFiles('**/pyproject.toml') }}
-
-      - name: Cache test data
-        id: cache_test_data
-        uses: actions/cache@v3
-        with:
-          path: |
-            tests/data
-            data
-          key: venv-${{ runner.os }}-${{ env.pythonLocation }}-${{ hashFiles('**/tests/scripts/fetch_test_data.sh') }}
+          cache: 'pip' # caching pip dependencies
 
       - name: Install dependencies
-        if: ${{ steps.cache_deps.outputs.cache-hit != 'true' }}
         run: |
           python -m pip install --upgrade pip
           pip install .
           pip install pytest omegaconf
-          
+
       - name: Get test data from OSF
-        if: ${{ steps.cache_test_data.outputs.cache-hit != 'true' }}
         run: |
           sh tests/scripts/fetch_test_data.sh
-          
+
       - name: Test with pytest
         run: |
           pytest tests/test_preprocessing.py
           pytest tests/test_svd.py
           pytest tests/test_map_to_map.py
           pytest tests/test_distribution_to_distribution.py
-          
diff --git a/.gitignore b/.gitignore
@@ -1,3 +1,15 @@
+# downloaded data
+data/dataset_2_submissions
+data/dataset_1_submissions
+data/dataset_2_ground_truth
+
+# data for testing and resulting outputs
+tests/data/Ground_truth
+tests/data/dataset_2_submissions/
+tests/data/unprocessed_dataset_2_submissions/submission_x/
+tests/results/
+
+
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
@@ -158,9 +170,3 @@ cython_debug/
 #  and can be added to the global gitignore or merged into this file.  For a more nuclear
 #  option (not recommended) you can uncomment the following to ignore the entire idea folder.
 #.idea/
-
-# Tutorials folder
-tutorials/*
-
-# Config file templates
-config_files/*
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -10,7 +10,6 @@ repos:
     -   id: trailing-whitespace
     -   id: end-of-file-fixer
     -   id: check-yaml
-    -   id: check-added-large-files
 - repo: https://github.com/astral-sh/ruff-pre-commit
   # Ruff version.
   rev: v0.3.4

diff --git a/README.md b/README.md
@@ -1,7 +1,7 @@
 <h1 align='center'>Cryo-EM Heterogeneity Challenge</h1>
 
 <p align="center">
-        
+
 <img alt="Supported Python versions" src="https://img.shields.io/badge/Supported_Python_Versions-3.8_%7C_3.9_%7C_3.10_%7C_3.11-blue">
 <img alt="GitHub Downloads (all assets, all releases)" src="https://img.shields.io/github/downloads/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/total">
 <img alt="GitHub branch check runs" src="https://img.shields.io/github/check-runs/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/main">
@@ -10,13 +10,13 @@
 </p>
 
 <p align="center">
-        
+
 <img alt="Cryo-EM Heterogeneity Challenge" src="https://simonsfoundation.imgix.net/wp-content/uploads/2023/05/15134456/Screenshot-2023-05-15-at-1.39.07-PM.png?auto=format&q=90">
 
 </p>
 
 
-        
+
 This repository contains the code used to analyse the submissions for the [Inaugural Flatiron Cryo-EM Heterogeneity Challenge](https://www.simonsfoundation.org/flatiron/center-for-computational-biology/structural-and-molecular-biophysics-collaboration/heterogeneity-in-cryo-electron-microscopy/).
 
 # Scope
@@ -26,13 +26,13 @@ This repository explains how to preprocess a submission (80 maps and correspondi
 This is a work in progress, while the code will probably not change, we are still writting better tutorials, documentation, and other ideas for analyzing the data. We are also in the process of making it easier for other people to contribute with their own metrics and methods. We are also in the process of distributing the code to PyPi.
 
 # Accesing the data
-The data is available via the Open Science Foundation project [The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge](https://osf.io/8h6fz/). You can download via a web browser, or programatically with wget as per [this script](https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/blob/main/tests/scripts/fetch_test_data.sh).
+The data is available via the Open Science Foundation project [The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge](https://osf.io/8h6fz/). You can download via a web browser, or programatically with wget as per [this script](https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/blob/main/data/fetch_data.sh).
 
 **_NOTE_**: We recommend downloadaing the data with the script and wget as the downloads from the web browser might be unstable.
 
 # Installation
 
-## Stable installation 
+## Stable installation
 Installing this repository is simply. We recommend creating a virtual environment (using conda or pyenv), since we have dependencies such as PyTorch or Aspire, which are better dealt with in an isolated environment. After creating your environment, make sure to activate it and run
 
 ```bash
@@ -63,7 +63,7 @@ pytest tests/test_distribution_to_distribution.py
 If you want to run our code on the full challenge data, or you own local data, please complete the following steps
 
 ### 1. Download the full challenge data from [The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge](https://osf.io/8h6fz/)
-You can do this through the web browser, or programatically with wget (you can get inspiration from [this script](https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/blob/main/tests/scripts/fetch_test_data.sh), which is just for the test data, not the full datasets)
+You can do this through the web browser, or programatically with wget (you can use [this script](https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/blob/main/data/fetch_data.sh), this will download around 220 GB of data)
 
 ### 2. Modify the config files and run the commands on the full challenge data
 Point to the path where the data is locally

diff --git a/config_files/config_distribution_to_distribution.yaml b/config_files/config_distribution_to_distribution.yaml
@@ -4,7 +4,7 @@ metrics:
   - corr
   - bioem
   - fsc
-gt_metadata_fname: data/metadata.csv
+gt_metadata_fname: data/dataset_2_ground_truth/metadata.csv
 n_replicates: 30
 n_pool_microstate: 5
 replicate_fraction: 0.9

diff --git a/config_files/config_map_to_map_distance_matrix.yaml b/config_files/config_map_to_map_distance_matrix.yaml
@@ -2,16 +2,16 @@ data:
   n_pix: 224
   psize: 2.146 
   submission:
-    fname: data/submission_0.pt
+    fname: data/dataset_2_ground_truth/submission_0.pt
     volume_key: volumes
     metadata_key: populations
     label_key: id
   ground_truth:
-    volumes: data/maps_gt_flat.pt 
-    metadata: data/metadata.csv 
+    volumes: data/dataset_2_ground_truth/maps_gt_flat.pt 
+    metadata: data/dataset_2_ground_truth/metadata.csv 
   mask: 
     do: true
-    volume: data/mask_dilated_wide_224x224.mrc
+    volume: data/dataset_2_ground_truth/mask_dilated_wide_224x224.mrc
 analysis:
   metrics:
     - l2

diff --git a/config_files/config_plotting.yaml b/config_files/config_plotting.yaml
@@ -1,4 +1,4 @@
-gt_metadata: path/to/metadata.csv
+gt_metadata: data/dataset_2_ground_truth/metadata.csv
 
 map2map_results:
   - path/to/map2map_results_1.pkl

diff --git a/config_files/config_preproc.yaml b/config_files/config_preproc.yaml
@@ -1,5 +1,4 @@
 submission_config_file: submission_config.json
-seed_flavor_assignment: 0
 thresh_percentile: 93.0
 BOT_box_size: 32
 BOT_loss: wemd

diff --git a/config_files/config_svd.yaml b/config_files/config_svd.yaml
@@ -3,7 +3,7 @@ box_size_ds: 32
 submission_list: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
 experiment_mode: "all_vs_ref" # options are "all_vs_all", "all_vs_ref"
 # optional unless experiment_mode is "all_vs_ref"
-path_to_reference: /path/to/reference
+path_to_reference: /path/to/reference/volumes.pt
 dtype: "float32" # options are "float32", "float64"
 output_options:
   # path will be created if it does not exist

diff --git a/data/fetch_data.sh b/data/fetch_data.sh
@@ -0,0 +1,21 @@
+mkdir -p data/dataset_2_submissions data/dataset_1_submissions data/dataset_2_ground_truth
+
+# dataset 1 submissions
+for i in {0..10}
+do
+    wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/dataset_1_submissions/submission_${i}.pt?download=true -O data/dataset_1_submissions/submission_${i}.pt
+done
+
+# dataset 2 submissions
+for i in {0..11}
+do
+    wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/dataset_2_submissions/submission_${i}.pt?download=true -O data/dataset_2_submissions/submission_${i}.pt
+done
+
+# ground truth
+
+wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/Ground_truth/maps_gt_flat.pt?download=true -O data/dataset_2_ground_truth/maps_gt_flat.pt
+
+wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/Ground_truth/metadata.csv?download=true -O data/dataset_2_ground_truth/metadata.csv
+
+wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/Ground_truth/mask_dilated_wide_224x224.mrc?download=true -O data/dataset_2_ground_truth/mask_dilated_wide_224x224.mrc
diff --git a/pyproject.toml b/pyproject.toml
@@ -38,29 +38,29 @@ classifiers = [
   "Programming Language :: Python :: Implementation :: PyPy",
 ]
 dependencies = [
-  "torch<=2.3.1",
-  "numpy<=2.0.0",
-  "natsort<=8.4.0",
-  "pandas<=2.2.2",
-  "dataclasses_json<=0.6.7",
-  "mrcfile<=1.5.0",
-  "scipy<=1.13.1",
-  "cvxpy<=1.5.2",
-  "POT<=0.9.3",
-  "aspire<=0.12.2",
-  "jupyter<=1.0.0",
-  "osfclient<=0.0.5",
-  "seaborn<=0.13.2",
-  "ipyfilechooser<=0.6.0",
+  "torch",
+  "numpy",
+  "natsort",
+  "pandas",
+  "dataclasses_json",
+  "mrcfile",
+  "scipy",
+  "cvxpy",
+  "POT",
+  "aspire",
+  "jupyter",
+  "osfclient",
+  "seaborn",
+  "ipyfilechooser",
+  "omegaconf"
 ]
 
 [project.optional-dependencies]
 dev = [
-  "pytest<=8.2.2",
+  "pytest",
   "mypy",
   "pre-commit",
   "ruff",
-  "omegaconf<=2.3.0"
 ]
 
 [project.urls]

diff --git a/src/cryo_challenge/__init__.py b/src/cryo_challenge/__init__.py
@@ -0,0 +1 @@
+from cryo_challenge.__about__ import __version__
diff --git a/src/cryo_challenge/_preprocessing/dataloader.py b/src/cryo_challenge/_preprocessing/dataloader.py
@@ -25,7 +25,11 @@ class SubmissionPreprocessingDataLoader(Dataset):
 
     def __init__(self, submission_config):
         self.submission_config = submission_config
-        self.submission_paths, self.gt_path = self.extract_submission_paths()
+        self.validate_submission_config()
+
+        self.submission_paths, self.population_files, self.gt_path = (
+            self.extract_submission_paths()
+        )
         self.subs_index = [int(idx) for idx in list(self.submission_config.keys())[1:]]
         path_to_gt_ref = os.path.join(
             self.gt_path, self.submission_config["gt"]["ref_align_fname"]
@@ -53,30 +57,40 @@ def validate_submission_config(self):
                     raise ValueError("Box size not found for ground truth")
                 if "pixel_size" not in value.keys():
                     raise ValueError("Pixel size not found for ground truth")
+                if "ref_align_fname" not in value.keys():
+                    raise ValueError(
+                        "Reference align file name not found for ground truth"
+                    )
                 continue
             else:
                 if "path" not in value.keys():
                     raise ValueError(f"Path not found for submission {key}")
-                if "id" not in value.keys():
-                    raise ValueError(f"ID not found for submission {key}")
+                if "name" not in value.keys():
+                    raise ValueError(f"Name not found for submission {key}")
                 if "box_size" not in value.keys():
                     raise ValueError(f"Box size not found for submission {key}")
                 if "pixel_size" not in value.keys():
                     raise ValueError(f"Pixel size not found for submission {key}")
                 if "align" not in value.keys():
                     raise ValueError(f"Align not found for submission {key}")
-
+                if "populations_file" not in value.keys():
+                    raise ValueError(f"Population file not found for submission {key}")
+                if "flip" not in value.keys():
+                    raise ValueError(f"Flip not found for submission {key}")
+                if "submission_version" not in value.keys():
+                    raise ValueError(
+                        f"Submission version not found for submission {key}"
+                    )
                 if not os.path.exists(value["path"]):
                     raise ValueError(f"Path {value['path']} does not exist")
 
                 if not os.path.isdir(value["path"]):
                     raise ValueError(f"Path {value['path']} is not a directory")
 
-        ids = list(self.submission_config.keys())[1:]
-        if ids != list(range(len(ids))):
-            raise ValueError(
-                "Submission IDs should be integers starting from 0 and increasing by 1"
-            )
+                if not os.path.exists(value["populations_file"]):
+                    raise ValueError(
+                        f"Population file {value['populations_file']} does not exist"
+                    )
 
         return
 
@@ -135,13 +149,16 @@ def help(cls):
 
     def extract_submission_paths(self):
         submission_paths = []
+        population_files = []
         for key, value in self.submission_config.items():
             if key == "gt":
                 gt_path = value["path"]
 
             else:
                 submission_paths.append(value["path"])
-        return submission_paths, gt_path
+                population_files.append(value["populations_file"])
+
+        return submission_paths, population_files, gt_path
 
     def __len__(self):
         return len(self.submission_paths)
@@ -151,13 +168,9 @@ def __getitem__(self, idx):
             glob.glob(os.path.join(self.submission_paths[idx], "*.mrc"))
         )
         vol_paths = [vol_path for vol_path in vol_paths if "mask" not in vol_path]
-
         assert len(vol_paths) > 0, "No volumes found in submission directory"
 
-        populations = np.loadtxt(
-            os.path.join(self.submission_paths[idx], "populations.txt")
-        )
-        populations = torch.from_numpy(populations)
+        populations = torch.from_numpy(np.loadtxt(self.population_files[idx]))
 
         vol0 = mrcfile.open(vol_paths[0], mode="r")
         volumes = torch.zeros(
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		from cryo_challenge.__about__ import __version__