Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adjusting documentation #61

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 46 additions & 25 deletions docs/model_cards/geneformer.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,13 +4,13 @@

**Model Name:** Geneformer \
**Model Version:** 1.0 \
**Model Description:** Geneformer is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of approximately 30 million single-cell transcriptomes. It is designed to enable context-specific predictions in settings with limited data in network biology. The model performs various tasks such as gene network mapping, disease modeling, and therapeutic target identification. \
**Model Description:** Geneformer is a context-aware, attention-based deep learning model pretrained on a large-scale corpus of approximately 30 million single-cell transcriptomes. It is designed to enable context-specific predictions in settings with limited data in network biology. The model performs various tasks such as gene network mapping, disease modeling, and therapeutic target identification.

## Model Developers

**Developed By:** Christina V. Theodoris, Ling Xiao, Anant Chopra, Mark D. Chaffin, Zeina R. Al Sayed, Matthew C. Hill, Helene Mantineo, Elizabeth M. Brydon, Zexian Zeng, X. Shirley Liu, Patrick T. Ellinor \
**Contact Information:** [email protected], [email protected] \
**License:** Apache-2.0 \
**Developed by:** Christina V. Theodoris conceived of the work, developed Geneformer, assembled Genecorpus-30M and designed and performed computational analyses. Other [author contributions](#citation). \
**Contact Information:** [email protected] \
**License:** Apache-2.0

## Model Type

Expand All @@ -20,17 +20,22 @@

## Model Purpose

**Intended Use:**
**Technical usage:**
- Tokenizing transcriptomes
- Pre-training
- Hyperparameter tuning
- Fine-tuning
- Extracting and plotting cell embeddings
- In silico perturbation


**Broader research applications:**
- Research in genomics and network biology
- Disease modeling with limited patient data
- Identification of candidate therapeutic targets
- Prediction of gene dosage sensitivity and chromatin dynamics
- Context-specific predictions in gene regulatory networks

**Out-of-Scope Use Cases:**
- Direct clinical decision making without human oversight
- Applications outside the scope of gene network and transcriptomic analysis

## Training Data

**Data Sources:**
Expand All @@ -43,23 +48,39 @@
**Preprocessing:**
- Exclusion of cells with high mutational burdens
- Metrics established for scalable filtering to exclude possible doublets and/or damaged cells
- Rank value encoding of transcriptomes where genes are ranked by expression within each cell
- Rank value encoding of transcriptomes where genes are ranked by scaled expression within each cell.

## Model Performance

**Evaluation Metrics:**
- Area Under the Receiver Operating Characteristic Curve (AUC)
- Predictive accuracy in distinguishing dosage-sensitive genes, chromatin dynamics, regulatory range of transcription factors, and central vs. peripheral network factors

**Performance Benchmarks:**
- Gene Dosage Sensitivity: AUC 0.91
- Chromatin Dynamics: AUC 0.93 for bivalent vs. non-methylated, AUC 0.88 for bivalent vs. H3K4me3-only
- Regulatory Range of Transcription Factors: AUC 0.74
- Network Hierarchy Prediction: AUC 0.81
- Predictive accuracy in distinguishing:
- With *fine-tuning*:
- Transcription factor dosage sensitivity
- Chromatin dynamics (bivalently marked promoters)
- Transcription factor regulatory range
- Gene network centrality
- Transcription factor targets
- Cell type annotation
- Batch integration
- Cell state classification across differentiation
- Disease classification
- In silico perturbation to determine disease-driving genes
- In silico treatment to determine candidate therapeutic targets
- With *Zero-shot learning*:
- Batch integration
- Gene context specificity
- In silico reprogramming
- In silico differentiation
- In silico perturbation to determine impact on cell state
- In silico perturbation to determine transcription factor targets
- In silico perturbation to determine transcription factor cooperativity

**Testing Data:**
- Held-out subsets of the training dataset
- Additional validation using publicly available datasets and experimental validation
- Additional validation using publicly available datasets
- Experimental validation for:
- Prediction of novel transcription factor in cardiomyocytes with zero-shot learning that had a functional impact on cardiomyocytes' contractile force generation
- Prediction of candidate therapeutic targets with in silico treatment analysis that significantly improved contractile force generation of cardiac microtissues in an iPS cell model of cardiomyopathy

## Ethical Considerations

Expand Down Expand Up @@ -108,14 +129,14 @@ print(embeddings.shape)

```

## Developers

Christina V. Theodoris, Ling Xiao, Anant Chopra, Mark D. Chaffin, Zeina R. Al Sayed, Matthew C. Hill, Helene Mantineo, Elizabeth M. Brydon, Zexian Zeng, X. Shirley Liu, Patrick T. Ellinor

## Contact

[email protected], [email protected]
[email protected]

## Citation

Theodoris, C. V., Xiao, L., Chopra, A., Chaffin, M. D., Al Sayed, Z. R., Hill, M. C., Mantineo, H., Brydon, E. M., Zeng, Z., Liu, X. S., & Ellinor, P. T. (2023). Transfer learning enables predictions in network biology. Nature, 618, 616-624. https://doi.org/10.1038/s41586-023-06139-9
Theodoris, C. V., Xiao, L., Chopra, A., Chaffin, M. D., Al Sayed, Z. R., Hill, M. C., Mantineo, H., Brydon, E. M., Zeng, Z., Liu, X. S., & Ellinor, P. T. (2023). Transfer learning enables predictions in network biology. Nature, 618, 616-624. https://doi.org/10.1038/s41586-023-06139-9

## Author contributions

C.V.T. conceived of the work, developed Geneformer, assembled Genecorpus-30M and designed and performed computational analyses. L.X., A.C., Z.R.A.S., M.C.H., H.M. and E.M.B. performed experimental validation in engineered cardiac microtissues. M.D.C. performed preprocessing, cell annotation and differential expression analysis of the cardiomyopathy dataset. Z.Z. provided data from the TISCH database for inclusion in Genecorpus-30M. X.S.L. and P.T.E. designed analyses and supervised the work. C.V.T., X.S.L. and P.T.E.
2 changes: 1 addition & 1 deletion docs/model_cards/scgpt.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@
- Publicly available single-cell RNA-seq, ATAC-seq, and other omics databases from CELLxGENE and other repositories

**Data Volume:**
- Pre-trained on data from over 33 million single-cell samples
- Pre-trained on data from over 33 million human cells under non-disease conditions. This comprehensive dataset encompasses a wide range of cell types from 51 organs or tissues, and 441 studies.

**Preprocessing:**
- Standardized to remove low-quality cells and sequences
Expand Down
Loading