Skip to content

Commit

Permalink
Suggestions by model developer C. Theodoris
Browse files Browse the repository at this point in the history
  • Loading branch information
marianaql committed Aug 5, 2024
1 parent d3fdaa6 commit 0db4315
Showing 1 changed file with 45 additions and 20 deletions.
65 changes: 45 additions & 20 deletions docs/model_cards/geneformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@

**Model Name:** Geneformer \
**Model Version:** 1.0 \
**Model Description:** Geneformer is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of approximately 30 million single-cell transcriptomes. It is designed to enable context-specific predictions in settings with limited data in network biology. The model performs various tasks such as gene network mapping, disease modeling, and therapeutic target identification. \
**Model Description:** Geneformer is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of approximately 30 million single-cell transcriptomes. It is designed to enable context-specific predictions in settings with limited data in network biology. The model performs various tasks such as gene network mapping, disease modeling, and therapeutic target identification.

## Model Developers

**Developed By:** Christina V. Theodoris, Ling Xiao, Anant Chopra, Mark D. Chaffin, Zeina R. Al Sayed, Matthew C. Hill, Helene Mantineo, Elizabeth M. Brydon, Zexian Zeng, X. Shirley Liu, Patrick T. Ellinor \
**Contact Information:** [email protected], [email protected] \
**License:** Apache-2.0 \
**Contact Information:** [email protected] \
**License:** Apache-2.0

## Model Type

Expand All @@ -20,15 +20,23 @@

## Model Purpose

**Intended Use:**
**Technical usage:**
- Tokenizing transcriptomes
- Pre-training
- Hyperparameter tuning
- Fine-tuning
- Extracting and plotting cell embeddings
- In silico perturbation


**Broader research applications:**
- Research in genomics and network biology
- Disease modeling with limited patient data
- Identification of candidate therapeutic targets
- Prediction of gene dosage sensitivity and chromatin dynamics
- Context-specific predictions in gene regulatory networks

**Out-of-Scope Use Cases:**
- Direct clinical decision making without human oversight
- Applications outside the scope of gene network and transcriptomic analysis

## Training Data
Expand All @@ -43,23 +51,40 @@
**Preprocessing:**
- Exclusion of cells with high mutational burdens
- Metrics established for scalable filtering to exclude possible doublets and/or damaged cells
- Rank value encoding of transcriptomes where genes are ranked by expression within each cell
- Rank value encoding of transcriptomes where genes are ranked by scaled expression within each cell.

## Model Performance

**Evaluation Metrics:**
- Area Under the Receiver Operating Characteristic Curve (AUC)
- Predictive accuracy in distinguishing dosage-sensitive genes, chromatin dynamics, regulatory range of transcription factors, and central vs. peripheral network factors

**Performance Benchmarks:**
- Gene Dosage Sensitivity: AUC 0.91
- Chromatin Dynamics: AUC 0.93 for bivalent vs. non-methylated, AUC 0.88 for bivalent vs. H3K4me3-only
- Regulatory Range of Transcription Factors: AUC 0.74
- Network Hierarchy Prediction: AUC 0.81
- Predictive accuracy in distinguishing:
- With *fine-tuning*:
- Transcription factor dosage sensitivity
- Chromatin dynamics (bivalently marked promoters)
- Transcription factor regulatory range
- Gene network centrality
- Transcription factor targets
- Cell type annotation
- Batch integration
- Cell state classification across differentiation
- Disease classification
- In silico perturbation to determine disease-driving genes
- In silico treatment to determine candidate therapeutic targets
- With *Zero-shot learning*:
- Batch integration
- Gene context specificity
- In silico reprogramming
- In silico differentiation
- In silico perturbation to determine impact on cell state
- In silico perturbation to determine transcription factor targets
- In silico perturbation to determine transcription factor cooperativity

**Testing Data:**
- Held-out subsets of the training dataset
- Additional validation using publicly available datasets and experimental validation
- Additional validation using publicly available datasets
- Experimental validation for:
- Prediction of novel transcription factor in cardiomyocytes with zero-shot learning that had a functional impact on cardiomyocytes' contractile force generation
- Prediction of candidate therapeutic targets with in silico treatment analysis that significantly improved contractile force generation of cardiac microtissues in an iPS cell model of cardiomyopathy

## Ethical Considerations

Expand Down Expand Up @@ -108,14 +133,14 @@ print(embeddings.shape)

```

## Developers

Christina V. Theodoris, Ling Xiao, Anant Chopra, Mark D. Chaffin, Zeina R. Al Sayed, Matthew C. Hill, Helene Mantineo, Elizabeth M. Brydon, Zexian Zeng, X. Shirley Liu, Patrick T. Ellinor

## Contact

[email protected], [email protected]
[email protected]

## Citation

Theodoris, C. V., Xiao, L., Chopra, A., Chaffin, M. D., Al Sayed, Z. R., Hill, M. C., Mantineo, H., Brydon, E. M., Zeng, Z., Liu, X. S., & Ellinor, P. T. (2023). Transfer learning enables predictions in network biology. Nature, 618, 616-624. https://doi.org/10.1038/s41586-023-06139-9
Theodoris, C. V., Xiao, L., Chopra, A., Chaffin, M. D., Al Sayed, Z. R., Hill, M. C., Mantineo, H., Brydon, E. M., Zeng, Z., Liu, X. S., & Ellinor, P. T. (2023). Transfer learning enables predictions in network biology. Nature, 618, 616-624. https://doi.org/10.1038/s41586-023-06139-9

## Author contributions

C.V.T. conceived of the work, developed Geneformer, assembled Genecorpus-30M and designed and performed computational analyses. L.X., A.C., Z.R.A.S., M.C.H., H.M. and E.M.B. performed experimental validation in engineered cardiac microtissues. M.D.C. performed preprocessing, cell annotation and differential expression analysis of the cardiomyopathy dataset. Z.Z. provided data from the TISCH database for inclusion in Genecorpus-30M. X.S.L. and P.T.E. designed analyses and supervised the work. C.V.T., X.S.L. and P.T.E.

0 comments on commit 0db4315

Please sign in to comment.