Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge dev into main #61

Merged
merged 20 commits into from
Aug 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .github/workflows/testing.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ jobs:
strategy:
matrix:
python-version: ["3.8", "3.9", "3.10", "3.11"]
fail-fast: false


steps:
Expand Down
18 changes: 12 additions & 6 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,15 @@
# downloaded data
data/dataset_2_submissions
data/dataset_1_submissions
data/dataset_2_ground_truth

# data for testing and resulting outputs
tests/data/Ground_truth
tests/data/dataset_2_submissions/
tests/data/unprocessed_dataset_2_submissions/submission_x/
tests/results/


# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down Expand Up @@ -158,9 +170,3 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/

# Tutorials folder
tutorials/*

# Config file templates
config_files/*
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
<h1 align='center'>Cryo-EM Heterogeneity Challenge</h1>

<p align="center">

<img alt="Supported Python versions" src="https://img.shields.io/badge/Supported_Python_Versions-3.8_%7C_3.9_%7C_3.10_%7C_3.11-blue">
<img alt="GitHub Downloads (all assets, all releases)" src="https://img.shields.io/github/downloads/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/total">
<img alt="GitHub branch check runs" src="https://img.shields.io/github/check-runs/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/main">
Expand All @@ -10,13 +10,13 @@
</p>

<p align="center">

<img alt="Cryo-EM Heterogeneity Challenge" src="https://simonsfoundation.imgix.net/wp-content/uploads/2023/05/15134456/Screenshot-2023-05-15-at-1.39.07-PM.png?auto=format&q=90">

</p>



This repository contains the code used to analyse the submissions for the [Inaugural Flatiron Cryo-EM Heterogeneity Challenge](https://www.simonsfoundation.org/flatiron/center-for-computational-biology/structural-and-molecular-biophysics-collaboration/heterogeneity-in-cryo-electron-microscopy/).

# Scope
Expand All @@ -26,13 +26,13 @@ This repository explains how to preprocess a submission (80 maps and correspondi
This is a work in progress, while the code will probably not change, we are still writting better tutorials, documentation, and other ideas for analyzing the data. We are also in the process of making it easier for other people to contribute with their own metrics and methods. We are also in the process of distributing the code to PyPi.

# Accesing the data
The data is available via the Open Science Foundation project [The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge](https://osf.io/8h6fz/). You can download via a web browser, or programatically with wget as per [this script](https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/blob/main/tests/scripts/fetch_test_data.sh).
The data is available via the Open Science Foundation project [The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge](https://osf.io/8h6fz/). You can download via a web browser, or programatically with wget as per [this script](https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/blob/main/data/fetch_data.sh).

**_NOTE_**: We recommend downloadaing the data with the script and wget as the downloads from the web browser might be unstable.

# Installation

## Stable installation
## Stable installation
Installing this repository is simply. We recommend creating a virtual environment (using conda or pyenv), since we have dependencies such as PyTorch or Aspire, which are better dealt with in an isolated environment. After creating your environment, make sure to activate it and run

```bash
Expand Down Expand Up @@ -63,7 +63,7 @@ pytest tests/test_distribution_to_distribution.py
If you want to run our code on the full challenge data, or you own local data, please complete the following steps

### 1. Download the full challenge data from [The Inaugural Flatiron Institute Cryo-EM Heterogeneity Community Challenge](https://osf.io/8h6fz/)
You can do this through the web browser, or programatically with wget (you can get inspiration from [this script](https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/blob/main/tests/scripts/fetch_test_data.sh), which is just for the test data, not the full datasets)
You can do this through the web browser, or programatically with wget (you can use [this script](https://github.com/flatironinstitute/Cryo-EM-Heterogeneity-Challenge-1/blob/main/data/fetch_data.sh), this will download around 220 GB of data)

### 2. Modify the config files and run the commands on the full challenge data
Point to the path where the data is locally
Expand Down
2 changes: 1 addition & 1 deletion config_files/config_distribution_to_distribution.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ metrics:
- corr
- bioem
- fsc
gt_metadata_fname: data/metadata.csv
gt_metadata_fname: data/dataset_2_ground_truth/metadata.csv
n_replicates: 30
n_pool_microstate: 5
replicate_fraction: 0.9
Expand Down
8 changes: 4 additions & 4 deletions config_files/config_map_to_map_distance_matrix.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,16 +2,16 @@ data:
n_pix: 224
psize: 2.146
submission:
fname: data/submission_0.pt
fname: data/dataset_2_ground_truth/submission_0.pt
volume_key: volumes
metadata_key: populations
label_key: id
ground_truth:
volumes: data/maps_gt_flat.pt
metadata: data/metadata.csv
volumes: data/dataset_2_ground_truth/maps_gt_flat.pt
metadata: data/dataset_2_ground_truth/metadata.csv
mask:
do: true
volume: data/mask_dilated_wide_224x224.mrc
volume: data/dataset_2_ground_truth/mask_dilated_wide_224x224.mrc
analysis:
metrics:
- l2
Expand Down
2 changes: 1 addition & 1 deletion config_files/config_plotting.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
gt_metadata: path/to/metadata.csv
gt_metadata: data/dataset_2_ground_truth/metadata.csv

map2map_results:
- path/to/map2map_results_1.pkl
Expand Down
2 changes: 1 addition & 1 deletion config_files/config_svd.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ box_size_ds: 32
submission_list: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
experiment_mode: "all_vs_ref" # options are "all_vs_all", "all_vs_ref"
# optional unless experiment_mode is "all_vs_ref"
path_to_reference: /path/to/reference
path_to_reference: /path/to/reference/volumes.pt
dtype: "float32" # options are "float32", "float64"
output_options:
# path will be created if it does not exist
Expand Down
21 changes: 21 additions & 0 deletions data/fetch_data.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
mkdir -p data/dataset_2_submissions data/dataset_1_submissions data/dataset_2_ground_truth

# dataset 1 submissions
for i in {0..10}
do
wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/dataset_1_submissions/submission_${i}.pt?download=true -O data/dataset_1_submissions/submission_${i}.pt
done

# dataset 2 submissions
for i in {0..11}
do
wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/dataset_2_submissions/submission_${i}.pt?download=true -O data/dataset_2_submissions/submission_${i}.pt
done

# ground truth

wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/Ground_truth/maps_gt_flat.pt?download=true -O data/dataset_2_ground_truth/maps_gt_flat.pt

wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/Ground_truth/metadata.csv?download=true -O data/dataset_2_ground_truth/metadata.csv

wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/Ground_truth/mask_dilated_wide_224x224.mrc?download=true -O data/dataset_2_ground_truth/mask_dilated_wide_224x224.mrc
32 changes: 16 additions & 16 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,29 +38,29 @@ classifiers = [
"Programming Language :: Python :: Implementation :: PyPy",
]
dependencies = [
"torch<=2.3.1",
"numpy<=2.0.0",
"natsort<=8.4.0",
"pandas<=2.2.2",
"dataclasses_json<=0.6.7",
"mrcfile<=1.5.0",
"scipy<=1.13.1",
"cvxpy<=1.5.2",
"POT<=0.9.3",
"aspire<=0.12.2",
"jupyter<=1.0.0",
"osfclient<=0.0.5",
"seaborn<=0.13.2",
"ipyfilechooser<=0.6.0",
"torch",
"numpy",
"natsort",
"pandas",
"dataclasses_json",
"mrcfile",
"scipy",
"cvxpy",
"POT",
"aspire",
"jupyter",
"osfclient",
"seaborn",
"ipyfilechooser",
"omegaconf"
]

[project.optional-dependencies]
dev = [
"pytest<=8.2.2",
"pytest",
"mypy",
"pre-commit",
"ruff",
"omegaconf<=2.3.0"
]

[project.urls]
Expand Down
2 changes: 2 additions & 0 deletions src/cryo_challenge/_preprocessing/dataloader.py
Original file line number Diff line number Diff line change
Expand Up @@ -65,6 +65,8 @@ def validate_submission_config(self):
raise ValueError(f"Pixel size not found for submission {key}")
if "align" not in value.keys():
raise ValueError(f"Align not found for submission {key}")
if "flip" not in value.keys():
raise ValueError(f"Flip not found for submission {key}")

if not os.path.exists(value["path"]):
raise ValueError(f"Path {value['path']} does not exist")
Expand Down
10 changes: 9 additions & 1 deletion src/cryo_challenge/_preprocessing/preprocessing_pipeline.py
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,11 @@ def preprocess_submissions(submission_dataset, config):
print(" Centering submission")
volumes = center_submission(volumes, pixel_size=pixel_size_gt)

# flip handedness
if submission_dataset.submission_config[str(idx)]["flip"] == 1:
print(" Flipping handedness of submission")
volumes = volumes.flip(-1)

# align to GT
if submission_dataset.submission_config[str(idx)]["align"] == 1:
print(" Aligning submission to ground truth")
Expand All @@ -124,7 +129,10 @@ def preprocess_submissions(submission_dataset, config):
print(f" submission saved as submission_{idx}.pt")
print(f"Preprocessing submission {idx} complete")

with open("hash_table.json", "w") as f:
hash_table_path = os.path.join(
config["output_path"], "submission_to_icecream_table.json"
)
with open(hash_table_path, "w") as f:
json.dump(hash_table, f, indent=4)

return
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@ cvxpy_solver: ECOS
optimal_q_kl:
n_iter: 100000
break_atol: 0.0001
output_fname: results/test_distribution_to_distribution_submission_0.pkl
output_fname: tests/results/test_distribution_to_distribution_submission_0.pkl
12 changes: 6 additions & 6 deletions tests/config_files/test_config_map_to_map.yaml
Original file line number Diff line number Diff line change
@@ -1,17 +1,17 @@
data:
n_pix: 224
psize: 2.146
psize: 2.146
submission:
fname: tests/data/dataset_2_submissions/test_submission_0_n8.pt
volume_key: volumes
metadata_key: populations
label_key: id
ground_truth:
volumes: tests/data/Ground_truth/test_maps_gt_flat_10.pt
metadata: tests/data/Ground_truth/test_metadata_10.csv
mask:
volumes: tests/data/Ground_truth/test_maps_gt_flat_10.pt
metadata: tests/data/Ground_truth/test_metadata_10.csv
mask:
do: true
volume: data/Ground_truth/mask_dilated_wide_224x224.mrc
volume: tests/data/Ground_truth/mask_dilated_wide_224x224.mrc
analysis:
metrics:
- l2
Expand All @@ -20,4 +20,4 @@ analysis:
normalize:
do: true
method: median_zscore
output: tests/results/test_map_to_map_distance_matrix_submission_0.pkl
output: tests/results/test_map_to_map_distance_matrix_submission_0.pkl
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
"align": 1,
"box_size": 244,
"pixel_size": 2.146,
"path": "tests/data/unprocessed_dataset_2_submissions/submission_x"
"path": "tests/data/unprocessed_dataset_2_submissions/submission_x",
"flip": 1
}
}
}
8 changes: 4 additions & 4 deletions tests/scripts/fetch_test_data.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
mkdir -p tests/data/dataset_2_submissions data/dataset_2_submissions tests/results tests/data/unprocessed_dataset_2_submissions/submission_x tests/data/Ground_truth/ data/Ground_truth
mkdir -p tests/data/dataset_2_submissions tests/data/dataset_2_submissions tests/results tests/data/unprocessed_dataset_2_submissions/submission_x tests/data/Ground_truth/ tests/data/Ground_truth
wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/tests/dataset_2_submissions/test_submission_0_n8.pt?download=true -O tests/data/dataset_2_submissions/test_submission_0_n8.pt
ADIR=$(pwd)
ln -s $ADIR/tests/data/dataset_2_submissions/test_submission_0_n8.pt $ADIR/tests/data/dataset_2_submissions/submission_0.pt # symlink for svd which needs submission_0.pt for filename
wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/tests/Ground_truth/test_maps_gt_flat_10.pt?download=true -O tests/data/Ground_truth/test_maps_gt_flat_10.pt
wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/tests/Ground_truth/test_metadata_10.csv?download=true -O tests/data/Ground_truth/test_metadata_10.csv
wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/tests/Ground_truth/1.mrc?download=true -O tests/data/Ground_truth/1.mrc
wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/Ground_truth/mask_dilated_wide_224x224.mrc?download=true -O data/Ground_truth/mask_dilated_wide_224x224.mrc
wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/tests/Ground_truth/test_metadata_10.csv?download=true -O tests/data/Ground_truth/test_metadata_10.csv
wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/tests/Ground_truth/1.mrc?download=true -O tests/data/Ground_truth/1.mrc
wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/Ground_truth/mask_dilated_wide_224x224.mrc?download=true -O tests/data/Ground_truth/mask_dilated_wide_224x224.mrc
for FILE in 1.mrc 2.mrc 3.mrc 4.mrc populations.txt
do
wget https://files.osf.io/v1/resources/8h6fz/providers/dropbox/tests/unprocessed_dataset_2_submissions/submission_x/${FILE}?download=true -O tests/data/unprocessed_dataset_2_submissions/submission_x/${FILE}
Expand Down
2 changes: 2 additions & 0 deletions tutorials/1_tutorial_preprocessing.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -136,13 +136,15 @@
" 0: {\n",
" \"name\": \"submission1\",\n",
" \"align\": 0,\n",
" \"flip\": 0,\n",
" \"box_size\": 144,\n",
" \"pixel_size\": 1.073 * 2,\n",
" \"path\": submission1_path.selected_path,\n",
" },\n",
" 1: {\n",
" \"name\": \"submission2\",\n",
" \"align\": 1,\n",
" \"flip\": 1,\n",
" \"box_size\": 288,\n",
" \"pixel_size\": 1.073,\n",
" \"path\": submission2_path.selected_path,\n",
Expand Down
Loading