Transcriptional Prediction of Lethality (TrPLet)
If used, please cite: "Assembling a landscape of vulnerabilities across rare kidney cancers" (Li* & Sadagopan* et al.)
This repo provides the scripts and workflow to predict cancer dependency scores from tumor or cell-line RNA-seq data for a subset of highly predictable genes (N=657). Although you can predict dependency scores for all genes, the accuracy will be substantially lower since most genetic dependencies are not predictable from RNA-seq data alone. The most general workflow involves:
- Generate/download isoform-level* RNA count data (e.g. RNA fastq -> bam -> counts using STAR/RSEM)
- Merge your data with a large RNA-seq dataset (cell lines: DepMap/CCLE, tumors: TCGA)
- Batch correct your data, read batch correction section if you are considering this
- Normalize RNA-seq counts
- Calculate transcripts per kilobase million (TPM) for each isoform
- Calculate gene TPM by summing isoform TPM per gene
- Convert TPM to log2(TPM+1) to generate log-normal distributions
- Z-score each feature (i.e. expression of each gene)
- Reduce dimensionality (subset the train+test data to the top M features with the highest |Pearson correlation coefficient| to the dependency being predicted in the train data; by default M=5000)
- Predict dependencies on your sample's normalized RNA-seq data
*gene level count data can also be used
The model is trained on the entirety of DepMap and tested on your dataset. When assessing model performance, we used 5-fold cross-validation across DepMap.
The workflow differs slightly depending on input data. There are three major considerations:
- If predicting on TCGA tumor RNA-seq, you can use gene log2(TPM+1), which can be calculated from count data here: https://osf.io/gqrz9/files/osfstorage. Z-score the expression of each gene and continue at step 5.
- If predicting on Non-TCGA tumor RNA-seq, you should merge your data with TCGA (batch correction is likely necessary); if batch correcting, you will merge the external dataset with TCGA isoform-level count data (available here: https://osf.io/gqrz9/files/osfstorage) and continue at step 2 of the workflow.
- If predicting on Cell Line RNA-seq, you can use gene log2(TPM+1) available from here: https://depmap.org/portal/data_page/?tab=currentRelease). Z-score the expression of each gene using the mean/standard deviation from DepMap, and continue at step 5. For this use case, batch correction usually isn't necessary. If it is required, merge with CCLE isoform-level counts (available here: https://osf.io/gqrz9/files/osfstorage) and continue at step 2 of the workflow.
In general, we noticed that cell line RNA-seq often does not require batch correction with CCLE, while tumor RNA-seq with TCGA almost always does (based on tSNE analysis). We recommend batch correcting using ComBat-seq using lineage as a covariate. Choose a lineage of your sample that most closely matches with TCGA or CCLE lineages. TCGA lineages available here: https://gdc.cancer.gov/resources-tcga-users/tcga-code-tables/tcga-study-abbreviations; CCLE lineages are indicated by the string following the underscore "_" in the cell line name as indicated in the sample_info.csv file (https://depmap.org/portal/data_page/?tab=allData).
We recommend using STAR/RSEM, we have not tested other methods of quantification, though they may also work.