diff --git a/docs/model_cards/geneformer.md b/docs/model_cards/geneformer.md index 8812d102..c1bdb44c 100644 --- a/docs/model_cards/geneformer.md +++ b/docs/model_cards/geneformer.md @@ -4,13 +4,13 @@ **Model Name:** Geneformer \ **Model Version:** 1.0 \ -**Model Description:** Geneformer is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of approximately 30 million single-cell transcriptomes. It is designed to enable context-specific predictions in settings with limited data in network biology. The model performs various tasks such as gene network mapping, disease modeling, and therapeutic target identification. \ +**Model Description:** Geneformer is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of approximately 30 million single-cell transcriptomes. It is designed to enable context-specific predictions in settings with limited data in network biology. The model performs various tasks such as gene network mapping, disease modeling, and therapeutic target identification. ## Model Developers **Developed By:** Christina V. Theodoris, Ling Xiao, Anant Chopra, Mark D. Chaffin, Zeina R. Al Sayed, Matthew C. Hill, Helene Mantineo, Elizabeth M. Brydon, Zexian Zeng, X. Shirley Liu, Patrick T. Ellinor \ -**Contact Information:** christina.theodoris@gladstone.ucsf.edu, ellinor@mgh.harvard.edu \ -**License:** Apache-2.0 \ +**Contact Information:** christina.theodoris@gladstone.ucsf.edu \ +**License:** Apache-2.0 ## Model Type @@ -20,7 +20,16 @@ ## Model Purpose -**Intended Use:** +**Technical usage:** +- Tokenizing transcriptomes +- Pre-training +- Hyperparameter tuning +- Fine-tuning +- Extracting and plotting cell embeddings +- In silico perturbation + + +**Broader research applications:** - Research in genomics and network biology - Disease modeling with limited patient data - Identification of candidate therapeutic targets @@ -28,7 +37,6 @@ - Context-specific predictions in gene regulatory networks **Out-of-Scope Use Cases:** -- Direct clinical decision making without human oversight - Applications outside the scope of gene network and transcriptomic analysis ## Training Data @@ -43,23 +51,40 @@ **Preprocessing:** - Exclusion of cells with high mutational burdens - Metrics established for scalable filtering to exclude possible doublets and/or damaged cells -- Rank value encoding of transcriptomes where genes are ranked by expression within each cell +- Rank value encoding of transcriptomes where genes are ranked by scaled expression within each cell. ## Model Performance **Evaluation Metrics:** - Area Under the Receiver Operating Characteristic Curve (AUC) -- Predictive accuracy in distinguishing dosage-sensitive genes, chromatin dynamics, regulatory range of transcription factors, and central vs. peripheral network factors - -**Performance Benchmarks:** -- Gene Dosage Sensitivity: AUC 0.91 -- Chromatin Dynamics: AUC 0.93 for bivalent vs. non-methylated, AUC 0.88 for bivalent vs. H3K4me3-only -- Regulatory Range of Transcription Factors: AUC 0.74 -- Network Hierarchy Prediction: AUC 0.81 +- Predictive accuracy in distinguishing: + - With *fine-tuning*: + - Transcription factor dosage sensitivity + - Chromatin dynamics (bivalently marked promoters) + - Transcription factor regulatory range + - Gene network centrality + - Transcription factor targets + - Cell type annotation + - Batch integration + - Cell state classification across differentiation + - Disease classification + - In silico perturbation to determine disease-driving genes + - In silico treatment to determine candidate therapeutic targets + - With *Zero-shot learning*: + - Batch integration + - Gene context specificity + - In silico reprogramming + - In silico differentiation + - In silico perturbation to determine impact on cell state + - In silico perturbation to determine transcription factor targets + - In silico perturbation to determine transcription factor cooperativity **Testing Data:** - Held-out subsets of the training dataset -- Additional validation using publicly available datasets and experimental validation +- Additional validation using publicly available datasets +- Experimental validation for: + - Prediction of novel transcription factor in cardiomyocytes with zero-shot learning that had a functional impact on cardiomyocytes' contractile force generation + - Prediction of candidate therapeutic targets with in silico treatment analysis that significantly improved contractile force generation of cardiac microtissues in an iPS cell model of cardiomyopathy ## Ethical Considerations @@ -108,14 +133,14 @@ print(embeddings.shape) ``` -## Developers - -Christina V. Theodoris, Ling Xiao, Anant Chopra, Mark D. Chaffin, Zeina R. Al Sayed, Matthew C. Hill, Helene Mantineo, Elizabeth M. Brydon, Zexian Zeng, X. Shirley Liu, Patrick T. Ellinor - ## Contact -christina.theodoris@gladstone.ucsf.edu, ellinor@mgh.harvard.edu +christina.theodoris@gladstone.ucsf.edu ## Citation -Theodoris, C. V., Xiao, L., Chopra, A., Chaffin, M. D., Al Sayed, Z. R., Hill, M. C., Mantineo, H., Brydon, E. M., Zeng, Z., Liu, X. S., & Ellinor, P. T. (2023). Transfer learning enables predictions in network biology. Nature, 618, 616-624. https://doi.org/10.1038/s41586-023-06139-9 \ No newline at end of file +Theodoris, C. V., Xiao, L., Chopra, A., Chaffin, M. D., Al Sayed, Z. R., Hill, M. C., Mantineo, H., Brydon, E. M., Zeng, Z., Liu, X. S., & Ellinor, P. T. (2023). Transfer learning enables predictions in network biology. Nature, 618, 616-624. https://doi.org/10.1038/s41586-023-06139-9 + +## Author contributions + +C.V.T. conceived of the work, developed Geneformer, assembled Genecorpus-30M and designed and performed computational analyses. L.X., A.C., Z.R.A.S., M.C.H., H.M. and E.M.B. performed experimental validation in engineered cardiac microtissues. M.D.C. performed preprocessing, cell annotation and differential expression analysis of the cardiomyopathy dataset. Z.Z. provided data from the TISCH database for inclusion in Genecorpus-30M. X.S.L. and P.T.E. designed analyses and supervised the work. C.V.T., X.S.L. and P.T.E.