Multitask-VNN is an interpretable neural network-based model that predicts cell response to a set of drugs with similar function. This framework integrates information across multiple levels of cancer cell biology to understand drug response, and can serve to identify and explain biomarkers for clinical application.
Multitask-VNN characterizes each cell line using its genotype; the feature vector for each cell is a binary vector representing mutational status and copy number variations of the genes used in clinical panels like Foundation Medicine (n=718).
Related publication: Zhao, Singhal, et al. Cancer Mutations Converge on a Collection of Protein Assemblies to Predict Resistance to Replication Stress. Cancer Discov 1 March 2024; 14 (3): 508–523. https://doi.org/10.1158/2159-8290.CD-23-0641
The model training/testing scripts require the following environmental setup:
-
Hardware required for training a new model
- GPU server with CUDA>=11 installed
-
Software
- Python >=3.6
- Anaconda
- Relevant information for installing Anaconda can be found in: https://docs.conda.io/projects/conda/en/latest/user-guide/install/.
- PyTorch
- The current release of this model was trained/tested using PyTorch 1.8.0
- Depending on the specification of your machine, run appropriate command to install PyTorch. The installation command line can be found in https://pytorch.org/.
- For a LINUX-based GPU server with CUDA version 11.1, run the following command line:
conda install pytorch torchvision cudatoolkit=11.1 -c pytorch
-
Set up a virtual environment
- If you are training a new model or test the pre-trained model using a GPU server,
run the following command line:
to set up a virtual environment (cuda11_env).
conda env create -f cuda11_env.yml
- If you are training a new model or test the pre-trained model using a GPU server,
run the following command line:
to set up a virtual environment (cuda11_env).
Required input files:
-
Cell feature files:
- gene2ind.txt: make sure you are using gene2ind.txt file provided in this repository.
- cell2ind.txt: a tab-delimited file where the 1st column is index of cells and the 2nd column is the name of cells (genotypes).
- cell2mutation.txt: a comma-delimited file where each row has 718 binary values indicating each gene is mutated (1) or not (0). The column index of each gene should match with those in gene2ind.txt file. The line number should match with the indices of cells in cell2ind.txt file.
- cell2cndeletion.txt: a comma-delimited file where each row has 718 binary values indicating copy number deletion (1) (0 for not).
- cell2amplification.txt: a comma-delimited file where each row has 718 binary values indicating copy number amplification (1) (0 for not).
-
Test data file: test_data.txt
- A tab-delimited file containing all data points that you want to estimate drug response for. The 1st column contains cell IDs and rest of the columns contain drug response AUC.
To load a pre-trained model used for analyses in our manuscript and make prediction for the cell lines of your interest, execute the following:
-
Make sure you have gene2ind.txt, cell2ind.txt, cell2mutation.txt, cell2cndeletion.txt, cell2amplification.txt, and your file containing test data in proper format (examples are provided in sample folder)
-
To run the model in a GPU server, execute the following:
python predict.py -gene2id gene2ind.txt -cell2id cell2ind.txt -genotype cell2mutation.txt -cn_deletions cell2cndeletion.txt -cn_amplifications cell2amplification.txt -predict test_data.txt -hidden <path_to_directory_to_store_hidden_values> -result <path_to_directory_to_store_prediction_results> -load <path_to_model_file> -cuda <GPU_unit_to_use> -batchsize 2000 (or any other value)
- An example bash script (test.sh) is provided in sample folder.
To train a new model using a custom data set, first make sure that you have a proper virtual environment set up. Also make sure that you have all the required files to run the training scripts:
-
Cell feature files:
- gene2ind.txt: make sure you are using gene2ind.txt file provided in this repository.
- cell2ind.txt: a tab-delimited file where the 1st column is index of cells and the 2nd column is the name of cells (genotypes).
- cell2mutation.txt: a comma-delimited file where each row has 718 binary values indicating each gene is mutated (1) or not (0). The column index of each gene should match with those in gene2ind.txt file. The line number should match with the indices of cells in cell2ind.txt file.
- cell2cndeletion.txt: a comma-delimited file where each row has 718 binary values indicating copy number deletion (1) (0 for not).
- cell2amplification.txt: a comma-delimited file where each row has 718 binary values indicating copy number amplification (1) (0 for not).
-
Training data file: train_data.txt
- A tab-delimited file containing all data points that you want to use to train the model. The 1st column contains cell IDs and rest of the columns contain drug response AUC.
- To help create train data from a pan-drug file, a python script "create_train_data.py" can be executed. To execute it, please refer to "scripts/create_input.sh" bash script.
-
Ontology (hierarchy) file: ontology.txt
-
A tab-delimited file that contains the ontology (hierarchy) that defines the structure of a branch of a VNN that encodes the genotypes. The first column is always a term (assembly), and the second column is a term or a gene. The third column should be set to "default" when the line represents a link between terms, "gene" when the line represents an annotation link between a term and a gene. The following is an example describing a sample hierarchy.
GO:0045834 GO:0045923 default GO:0045834 GO:0043552 default GO:0045923 AKT2 gene GO:0045923 IL1B gene GO:0043552 PIK3R4 gene GO:0043552 SRC gene GO:0043552 FLT1 gene
- Example of the file (ontology.txt) is provided in sample folder.
-
There are several optional parameters that you can provide in addition to the input files:
-
-modeldir: a name of directory where you want to store the trained models. The default is set to "MODEL" in the current working directory.
-
-genotype_hiddens: a number of neurons to assign each subsystem in the hierarchy. The default is set to 12.
-
-epoch: the number of epoch to run during the training phase. The default is set to 300.
-
-batchsize: the size of each batch to process at a time. The deafult is set to 5000. You may increase this number to speed up the training process within the memory capacity of your GPU server.
-
lr: Learning rate. Default is set 0.001.
-
wd: Weight decay. Default is set 0.001.
-
-cuda: the ID of GPU unit that you want to use for the model training. The default setting is to use GPU 0.
- All the parameters are mentioned in the src/train_helper.py file.
Finally, to train a multitask-VNN, execute a command line similar to the example provided in sample/train.sh:
python -u train.py -onto ontology.txt
-gene2id gene2ind.txt
-cell2id cell2ind.txt
-genotype cell2mutation.txt
-cn_deletions cell2cndeletion.txt
-cn_amplifications cell2amplification.txt
-tasks task_list_RS.txt
-train train_data.txt
-modeldir sample/model
-genotype_hiddens 12
-epoch 100
-batchsize 64
-cuda 0
-lr 0.0005
-wd 0.0005