=== WORK IN PROGRESS ===
- Submodule initialization does not work because DLUP is not yet publicly available
- Current scripts with singularity calls might not work since DLUP is installed as editable and the entire hissl repo is mounted on the singularity, meaning there will be an empty DLUP directory
- It is recommended to remove the
--bind $HISSL_ROOT:/hissl
from the singularity options for the time being, meaning changes you make tohissl
are not reflected inside the container. This will allow you to run all the DeepSMILE reproduction scripts.
- It is recommended to remove the
=========================
HISSL performs self-supervised pre-training of a feature extractor on histopathology data. It is used for DeepSMILE: Contrastive self-supervised pre-training benefits MSI and HRD classification directly from H&E whole-slide images in colorectal and breast cancer, and is meant to
- allow reproduction of the results of DeepSMILE
- be used on other datasets and other tasks by other researchers
Essentially, the repository contains the following:
- DLUP (third_party/dlup): Deep Learning Utilities for Pathology. This is a repository that eases Whole-Slide Image preprocessing and can create datasets to read tiles from WSIs directly without any preprocessing.
- A fork of VISSL (third_party/vissl). VISSL is a repository that allows high-performant self-supervised learning using
a large variety of state of the art self-supervised learning methods. The fork contains custom data sources (dataset classes) and
saves the features in
h5
format with additional metadata. - A docker image (docker/README.md) that has the above repositories properly installed, so that no complex set-up is required, and so that the pre-training can be run on any OS.
- Reproduction scripts (tools/reproduce_deepsmile) that can be run using
bash
that perform all the downloading, preprocessing, pretraining, and feature extraction steps for the feature learning part of DeepSMILE: Contrastive self-supervised pre-training benefits MSI and HRD classification directly from H&E whole-slide images in colorectal and breast cancer - The pre-trained weights of the models trained for DeepSMILE .
- Free software: MIT license
- Please use docker. See
docker/README.md
Run tools/reproduce_deepsmile.sh
on
- a linux server
- with more than 5TB storage space
- with 1-4 CUDA-enabled GPUs
- and
singularity
installed
This script will 0. Check that you have the singularity container
- Download all TCGA-BC WSIs from the TCGA repository
- Precompute masks for TCGA-BC using DLUP
- Download TCGA-CRCk tiles from Zenodo
- Create train-val-test splits for TCGA-BC using a label file from Kather (2019)
- Pretrain (5x) ShufflenetV2x1_0 on TCGA-BC. These models should be similar to those downloaded by
model_zoo/tcga-bc/download_model.sh
- Pretrain (1x) Resnet18 on TCGA-CRCk. This model should be similar to the one downloaded by
model_zoo/tcga-crck/download_model.sh
- Extract and save all features for TCGA-BC and TCGA-CRCk
Our fork of VISSL contains some changes to be used for histopathology data. In summary:
- We add a ShuffleNet backbone (third_party/vissl/vissl/models/trunks/hissl_shufflenet.py)
- We add a dataset for TCGA-CRCk (third_party/vissl/vissl/data/kather_msi_dataset.py)
- We add a DLUP dataset to read tiles directly from WSIs and load/compute masks (third_party/vissl/vissl/data/dlup_dataset.py)
- We add saving of features as H5 datasets for both TCGA-CRCk and TCGA-BC (third_party/vissl/vissl/trainer/trainer_main_hissl.py)