IMage Perturbation Autoencoder (IMPA) is a computer vision model performing style transfer on images of cells undergoing perturbation. IMPA learns a perturbation space and uses it to perform style transfer by conditioning the latent space of an autoencoder. Through this process, the model is used to translate a cell image into what it would look like had it been treated with a certain perturbation.
The perturbation space can be expressed as:
- A prior on the perturbation space (e.g. drug embeddings reflecting compound-specific physiochemical characteristics)
- Trainable embeddings optimized alongside the model
If trained on a meaniningful prior perturbation space, IMPA learns to map unseen drugs in proximity of drugs used for training. When proximity also involves functional similarity, IMPA is able to predict the effect of unseen drugs on control cells. Moreover, distances between the style encodings of different perturbations are correlated with distances in the phenotypic space. As a result, IMPA can be used to fastly inspect active compounds based on the comparison of the style vectors learned for different perturbations.
To run the model, clone this repository and create the environment via:
conda env create -f environment.yml
Navigate to the repository and install the Python package.
pip install -e .
All files related to the model are stored in the IMPA
folder.
utils.py
: contains helper functionssolver.py
: contains theSolver
class implementing the model setup, data loading and training loop.model.py
: implements the neural network modules and initialization function.main.py
: calls theSolver
class and implements training supported byseml
andsacred
.checkpoint.py
: implements the util class for handling saving and loading checkpoints.eval/eval.py
: contains the evaluation script used during training by theSolver
class.data/data_loader.py
: implementstorch
dataset and data loader wrappers around the image data.
We trained the models using the seml framework. Configurations can be found in the training_config
folder. IMPA can be trained both with and without the support of seml
. This is possible via two independent main files:
main.py
: train withseml
on theslurm
scheduling systemmain_not_seml.py
: train withoutseml
on theslurm
scheduling system via sbatch files
Scripts to run the code without seml
can be found in the scripts
folder. In a terminal, enter:
sbatch training_config.yaml
And the script will be submitted automatically. The logs of the run will be saved in the training_config/logs
folder.
For other scheduling systems the user may be required to apply minor modifications to the main.py
file to accept custom configuration files. For training with seml
we redirect the user to the official page of the package.
To train the model with the provided yaml files, adapt the .yaml
files to the experimental setup (i.e. add path strings referencing the used directories).
Datasets are available at:
- BBBC021 https://bbbc.broadinstitute.org/BBBC021
- BBBC025 https://bbbc.broadinstitute.org/BBBC025
- RxRx1 https://www.kaggle.com/c/recursion-cellular-image-classification/overview/resources
Model checkpoints and pre-processed data are made available here.