From d3fdaa607b3f15bbf0d6e439d8292321212cb5dc Mon Sep 17 00:00:00 2001 From: Mariana Quiroga Londono Date: Mon, 15 Jul 2024 13:29:11 +0200 Subject: [PATCH 1/4] More granularity in 'Data Volume' description --- docs/model_cards/scgpt.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/model_cards/scgpt.md b/docs/model_cards/scgpt.md index fe12318b..f9308e0d 100644 --- a/docs/model_cards/scgpt.md +++ b/docs/model_cards/scgpt.md @@ -35,7 +35,7 @@ - Publicly available single-cell RNA-seq, ATAC-seq, and other omics databases from CELLxGENE and other repositories **Data Volume:** -- Pre-trained on data from over 33 million single-cell samples +- Pre-trained on data from over 33 million human cells under non-disease conditions. This comprehensive dataset encompasses a wide range of cell types from 51 organs or tissues, and 441 studies. **Preprocessing:** - Standardized to remove low-quality cells and sequences From 0db4315c4afbfb9146ab4b0ccb30e4795fcb7108 Mon Sep 17 00:00:00 2001 From: Mariana Quiroga Londono Date: Mon, 5 Aug 2024 14:58:14 +0200 Subject: [PATCH 2/4] Suggestions by model developer C. Theodoris --- docs/model_cards/geneformer.md | 65 +++++++++++++++++++++++----------- 1 file changed, 45 insertions(+), 20 deletions(-) diff --git a/docs/model_cards/geneformer.md b/docs/model_cards/geneformer.md index 8812d102..c1bdb44c 100644 --- a/docs/model_cards/geneformer.md +++ b/docs/model_cards/geneformer.md @@ -4,13 +4,13 @@ **Model Name:** Geneformer \ **Model Version:** 1.0 \ -**Model Description:** Geneformer is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of approximately 30 million single-cell transcriptomes. It is designed to enable context-specific predictions in settings with limited data in network biology. The model performs various tasks such as gene network mapping, disease modeling, and therapeutic target identification. \ +**Model Description:** Geneformer is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of approximately 30 million single-cell transcriptomes. It is designed to enable context-specific predictions in settings with limited data in network biology. The model performs various tasks such as gene network mapping, disease modeling, and therapeutic target identification. ## Model Developers **Developed By:** Christina V. Theodoris, Ling Xiao, Anant Chopra, Mark D. Chaffin, Zeina R. Al Sayed, Matthew C. Hill, Helene Mantineo, Elizabeth M. Brydon, Zexian Zeng, X. Shirley Liu, Patrick T. Ellinor \ -**Contact Information:** christina.theodoris@gladstone.ucsf.edu, ellinor@mgh.harvard.edu \ -**License:** Apache-2.0 \ +**Contact Information:** christina.theodoris@gladstone.ucsf.edu \ +**License:** Apache-2.0 ## Model Type @@ -20,7 +20,16 @@ ## Model Purpose -**Intended Use:** +**Technical usage:** +- Tokenizing transcriptomes +- Pre-training +- Hyperparameter tuning +- Fine-tuning +- Extracting and plotting cell embeddings +- In silico perturbation + + +**Broader research applications:** - Research in genomics and network biology - Disease modeling with limited patient data - Identification of candidate therapeutic targets @@ -28,7 +37,6 @@ - Context-specific predictions in gene regulatory networks **Out-of-Scope Use Cases:** -- Direct clinical decision making without human oversight - Applications outside the scope of gene network and transcriptomic analysis ## Training Data @@ -43,23 +51,40 @@ **Preprocessing:** - Exclusion of cells with high mutational burdens - Metrics established for scalable filtering to exclude possible doublets and/or damaged cells -- Rank value encoding of transcriptomes where genes are ranked by expression within each cell +- Rank value encoding of transcriptomes where genes are ranked by scaled expression within each cell. ## Model Performance **Evaluation Metrics:** - Area Under the Receiver Operating Characteristic Curve (AUC) -- Predictive accuracy in distinguishing dosage-sensitive genes, chromatin dynamics, regulatory range of transcription factors, and central vs. peripheral network factors - -**Performance Benchmarks:** -- Gene Dosage Sensitivity: AUC 0.91 -- Chromatin Dynamics: AUC 0.93 for bivalent vs. non-methylated, AUC 0.88 for bivalent vs. H3K4me3-only -- Regulatory Range of Transcription Factors: AUC 0.74 -- Network Hierarchy Prediction: AUC 0.81 +- Predictive accuracy in distinguishing: + - With *fine-tuning*: + - Transcription factor dosage sensitivity + - Chromatin dynamics (bivalently marked promoters) + - Transcription factor regulatory range + - Gene network centrality + - Transcription factor targets + - Cell type annotation + - Batch integration + - Cell state classification across differentiation + - Disease classification + - In silico perturbation to determine disease-driving genes + - In silico treatment to determine candidate therapeutic targets + - With *Zero-shot learning*: + - Batch integration + - Gene context specificity + - In silico reprogramming + - In silico differentiation + - In silico perturbation to determine impact on cell state + - In silico perturbation to determine transcription factor targets + - In silico perturbation to determine transcription factor cooperativity **Testing Data:** - Held-out subsets of the training dataset -- Additional validation using publicly available datasets and experimental validation +- Additional validation using publicly available datasets +- Experimental validation for: + - Prediction of novel transcription factor in cardiomyocytes with zero-shot learning that had a functional impact on cardiomyocytes' contractile force generation + - Prediction of candidate therapeutic targets with in silico treatment analysis that significantly improved contractile force generation of cardiac microtissues in an iPS cell model of cardiomyopathy ## Ethical Considerations @@ -108,14 +133,14 @@ print(embeddings.shape) ``` -## Developers - -Christina V. Theodoris, Ling Xiao, Anant Chopra, Mark D. Chaffin, Zeina R. Al Sayed, Matthew C. Hill, Helene Mantineo, Elizabeth M. Brydon, Zexian Zeng, X. Shirley Liu, Patrick T. Ellinor - ## Contact -christina.theodoris@gladstone.ucsf.edu, ellinor@mgh.harvard.edu +christina.theodoris@gladstone.ucsf.edu ## Citation -Theodoris, C. V., Xiao, L., Chopra, A., Chaffin, M. D., Al Sayed, Z. R., Hill, M. C., Mantineo, H., Brydon, E. M., Zeng, Z., Liu, X. S., & Ellinor, P. T. (2023). Transfer learning enables predictions in network biology. Nature, 618, 616-624. https://doi.org/10.1038/s41586-023-06139-9 \ No newline at end of file +Theodoris, C. V., Xiao, L., Chopra, A., Chaffin, M. D., Al Sayed, Z. R., Hill, M. C., Mantineo, H., Brydon, E. M., Zeng, Z., Liu, X. S., & Ellinor, P. T. (2023). Transfer learning enables predictions in network biology. Nature, 618, 616-624. https://doi.org/10.1038/s41586-023-06139-9 + +## Author contributions + +C.V.T. conceived of the work, developed Geneformer, assembled Genecorpus-30M and designed and performed computational analyses. L.X., A.C., Z.R.A.S., M.C.H., H.M. and E.M.B. performed experimental validation in engineered cardiac microtissues. M.D.C. performed preprocessing, cell annotation and differential expression analysis of the cardiomyopathy dataset. Z.Z. provided data from the TISCH database for inclusion in Genecorpus-30M. X.S.L. and P.T.E. designed analyses and supervised the work. C.V.T., X.S.L. and P.T.E. From ae95959a467920dd8fc9a4ce1e97ade33ce4dd3b Mon Sep 17 00:00:00 2001 From: Mariana Quiroga Londono Date: Mon, 5 Aug 2024 15:04:07 +0200 Subject: [PATCH 3/4] Model dev info updated based on contributions --- docs/model_cards/geneformer.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/model_cards/geneformer.md b/docs/model_cards/geneformer.md index c1bdb44c..1ff9cab4 100644 --- a/docs/model_cards/geneformer.md +++ b/docs/model_cards/geneformer.md @@ -8,7 +8,7 @@ ## Model Developers -**Developed By:** Christina V. Theodoris, Ling Xiao, Anant Chopra, Mark D. Chaffin, Zeina R. Al Sayed, Matthew C. Hill, Helene Mantineo, Elizabeth M. Brydon, Zexian Zeng, X. Shirley Liu, Patrick T. Ellinor \ +**Developed by:** Christina V. Theodoris conceived of the work, developed Geneformer, assembled Genecorpus-30M and designed and performed computational analyses. Other [author contributions](#citation). \ **Contact Information:** christina.theodoris@gladstone.ucsf.edu \ **License:** Apache-2.0 From d585f3e3107fee34ca50b0202f5676936065851a Mon Sep 17 00:00:00 2001 From: Mariana Quiroga Londono Date: Mon, 5 Aug 2024 16:17:28 +0200 Subject: [PATCH 4/4] removed out of scope use cases + AUC in context --- docs/model_cards/geneformer.md | 4 ---- 1 file changed, 4 deletions(-) diff --git a/docs/model_cards/geneformer.md b/docs/model_cards/geneformer.md index 1ff9cab4..a76bad9f 100644 --- a/docs/model_cards/geneformer.md +++ b/docs/model_cards/geneformer.md @@ -36,9 +36,6 @@ - Prediction of gene dosage sensitivity and chromatin dynamics - Context-specific predictions in gene regulatory networks -**Out-of-Scope Use Cases:** -- Applications outside the scope of gene network and transcriptomic analysis - ## Training Data **Data Sources:** @@ -56,7 +53,6 @@ ## Model Performance **Evaluation Metrics:** -- Area Under the Receiver Operating Characteristic Curve (AUC) - Predictive accuracy in distinguishing: - With *fine-tuning*: - Transcription factor dosage sensitivity