News
+ ++ +
2024.06.22 TDC-2 preprint is released!
+TDC-2 Features
+-
+
- 10+ new modalities: TDC-2 drastically expands the coverage of ML tasks across therapeutic pipelines and 10+ new modalities, spanning but not limited to single-cell gene expression data, clinical trial data, peptide sequence data, peptidomimetics protein-peptide interaction data regarding newly discovered ligands derived from AS-MS spectroscopy, novel 3D structural data for proteins, and cell-type-specific protein-protein interaction networks at single-cell resolution. +
- Single-cell atlases and foundation model embeddings: TDC-2 introduces over 1,000 multimodal datasets, spanning approximately 85 million cells and pre-calculated embeddings from 5 state-of-the-art single-cell models via CZ CELLxGENE Census and the TDC Model Hub +
- API-First Multimodal Retrieval API via TDC-2 MVC and Resource: TDC-2 drastically expands dataset retrieval capabilities available in TDC-1 beyond those of other leading benchmarks. The software architecture of TDC-2 was redesigned using the Model-View-Controller (MVC) design pattern.
+ The MVC pattern supports the integration of multiple data modalities by using data mappings and views. The MVC-enabled-multimodal retrieval API is powered by TDC-2’s Resource Model and a Domain-Specific-Language.
+
-
+
- TDC-2 Domain-Specific Language: TDC-2 develops an Application-Embedded Domain-Specific Data Definition Programming Language + that facilitates the integration of multiple modalities by generating data views from a mapping of various datasets and functions for transformations, + integration, and multimodal enhancements while maintaining a high level of abstraction for the Resource framework. +
- TDC-2 Resource Model: The Commons introduces a redesign of TDC-1’s dataset layer into a new data model dubbed + the TDC-2 resource, developed under the MVC paradigm to integrate multiple modalities into the API-first model of TDC-2. We leverage + the CZ CellXGene to develop a TDC-2 Resource Model for constructing large-scale single-cell datasets that map gene expression profiles + of individual cells across tissues, healthy, and disease states. +
- Biomedical Knowledge Graphs and External APIs: We have developed a framework for biomedical knowledge graphs to + enhance the multimodality of dataset retrieval via TDC-2’s Resource Model. Our system leverages PrimeKG to integrate + 20 high-quality resources to describe 17,080 diseases with 4,050,249 relationships. Our framework also + extends to external APIs, with data views currently leveraging BioPython, for obtaining nucleotide sequence information for a given + non-coding RNA ID from NCBI, and The Uniprot Consortium’s RESTful GET API for obtaining amino acid sequences +
- TDC-2 Model Hub: In addition, we’ve developed a framework that allows access to predictive and foundation + embedding models under diverse biological contexts via the TDC-2 Model Hub. TDC-2 releases AI-powered endpoints via + The Commons' Model Hub, which enhances multimodal retrieval capabilities by providing access to protein embeddings + under cell-type-specific biological contexts and model predictions for key biomedical challenges including but not limited to + binary classification on SMILES strings for Ether-a-go-go-related gene blockers, blood-brain-barrier permeability, and CYP3A4 + inhibition. These models can be fine-tuned under TDC-2's fine-tuning paradigm and/or used for innovative downstream tasks. +
+ - 7 innovative multimodal ML Tasks and benchmarks: TDC-2 introduces 7 novel ML tasks with fine-grained biological contexts: + contextualized drug-target identification, single-cell chemical/genetic perturbation response prediction, protein-peptide binding + affinity prediction task, and clinical trial outcome prediction task, which introduce antigen-processing-pathway-specific, + cell-type-specific, peptide-specific, and patient-specific biological contexts. TDC-2 also releases benchmarks evaluating + 15+ state-of-the-art models across 5+ new learning tasks evaluating models on diverse biological contexts and sampling approaches. + Among these, TDC-2 provides the first benchmark for context-specific learning. TDC-2, to our knowledge, is also the first to introduce + a protein-peptide binding interaction benchmark. TDC-2's tasks, frameworks, datasets, and models are tailored to take on some of the most + pressing machine learning challenges in biomedicine, including but not limited to cell-type-specific machine learning modeling and + evaluation, the inferential gap in precision medicine, negative-sampling challenges in peptidomimetics, and model generalizability + across unseen cell lines and perturbations. +
For more information on these and additional features, please refer to the bioRxiv preprint.
++
2024.06.19TDC-2 preview was presented at MoML2024 hosted by Mila. You can see full conference here. Our poster can be seen in our tweet as well.
++ +
2023.07.10 TDC 0.4.1
is released! TDC has a new exciting task on clinical trial outcome prediction (Thanks to Tianfan)! Checkout here for more information.
+ +
2023.04. 17 TDC 0.4.0
is released! We're excited to announce the release of a new interface tdc_hf_interface
that allows users to easily access and leverage pre-trained models hosted at HuggingFace on TDC datasets and tasks. In this first batch, we've released nine pre-trained models from DeepPurpose that cover three popular ADMET datasets in the Commons. To load our pre-trainend model, simply do the following:
+
from tdc import tdc_hf_interface
+tdc_hf = tdc_hf_interface("BBB_Martins-AttentiveFP")
+dp_model = tdc_hf.load_deeppurpose('./data')
+tdc_hf.predict_deeppurpose(dp_model, ['CC(=O)NC1=CC=C(O)C=C1'])
+
The TDC-HF space is located at here. Stay tuned for more exciting pre-trained models, tasks & demos!
+ ++ +
2023.01.26 TDC 0.3.9
is released! Here are the changes:
-
+
- TDC has 9 new datasets on high throughput screening
HTS
. These assays cover a wide range of protein target classes and are carefully collated through confirmation screens to validate active compounds. See here on how to access them!
+
Protein Target Class | +PubChem AID | +Protein Target | +Total # of Molecules | +# of Active Molecules | +
---|---|---|---|---|
GPCR | +435008 | +Orexin1 Receptor | +218,158 | +233 | +
GPCR | +1798 | +M1 Muscarinic Receptor Agonists | +61,833 | +187 | +
GPCR | +435034 | +M1 Muscarinic Receptor Antagonists | +61,756 | +362 | +
Ion Channel | +1843 | +Potassium Ion Channel Kir2.1 | +301,493 | +172 | +
Ion Channel | +2258 | +KCNQ2 Potassium Channel | +302,405 | +213 | +
Ion Channel | +463087 | +Cav3 T-type Calcium Channels | +100,875 | +703 | +
Transporter | +488997 | +Choline Transporter | +302,306 | +252 | +
Kinase | +2689 | +Serine/Threonine Kinase 33 | +319,792 | +172 | +
Enzyme | +485290 | +Tyrosyl-DNA Phosphodiesterase | +341,365 | +281 | +
-
+
- TDC has an additional dataset on hERG in the
Tox
task. See here for more info!
+ - TDC now follows black code style! +
+ +
2022.11.03 TDC 0.3.8
is released! Here are the changes:
-
+
- TDC has a new task on structure-based drug design
SBDD
with four datasets PDBBind, DUD-E, scPDB. See here on how to access them!
+ - To support evaluation of SBDD tasks, we also include two evaluation metrics (RMSD, Kabsch-RMSD) that compare distances between two structures. See here for more info. +
- TDC has a new dataset on PAMPA (parallel artificial membrane permeability assay), which is a commonly employed assay to evaluate drug permeability across the cellular membrane in the
ADME
task. See here for more info!
+
+ +
2022.09.06 TDC 0.3.7
is released! Here are the changes:
-
+
- TDC has a new evaluation metric on logAUC. See here and the PR. +
- TDC now supports graphein protein 3D representation for antibody develop-ability prediction. See tutorial and the PR. +
-
QM
task are now in 3D format. See here.
+ - TDC has a harmonize function to deal with duplicated experimental entries in DTI. See here. +
- TDC now has a dataloader for PrimeKG as an auxilliary resource. See how to access PrimeKG here. +
- TDC fixed static scikit-learn version issue for gsk3b, jnk3, drd2 oracles. See here for more info. +
- The PPBR dataset in ADME task now has additional species information and the default is now only containing homo sapiens while you can retrieve other species via a TDC function. See here for more info. + +
+ +
2022.02.19 TDC 0.3.6
is released! TDC has a new task on TCR-Epitope Binding prediction (Thanks to Anna and Jannis)! Checkout here for more information.
+ +
2022.01.23 TDC 0.3.5
is released! Here are the changes:
-
+
- TDC has an updated ChEMBL library (Version 29) in
MolGen
! The previous version is also still kept available. Checkout here for more information.
+
-
+
- Reaction type information can be found within split by turning on the include_reaction_type flag for USPTO-50 in
RetroSyn
! Checkout here for more information.
+
-
+
- Fixed bug on cold split for higher order (>2) multi-instance prediction tasks! (Thanks to Zoe !) Checkout here for more information. +
+ +
2021.12.28 TDC 0.3.4
is released! Bug fixes on docking oracles and KL divergence measure.
+ +
+ +
2021.11.25 TDC 0.3.3
is released! Now added extended support for cold split in multi prediction tasks, see this issue!
+ +
2021.10.17 TDC 0.3.2
is released! We have added support for harmonizing same DTIs with different affinities (KIBA, DAVIS Updated accordingly, see this issue); support for label name retrieval for TWOSIDES (this issue), and add gene symbol info to GDSC (this issue).
+ +
2021.09.04 TDC 0.3.0
is released! We have greatly restructured the code to be contributor friendly while keeping most interfaces the same. We also release the documentation for TDC package at here.
+ +
2021.05.30 TDC updates to 0.2.0
, major changes:
-
+
- TDC has a new molecule generation benchmark on docking scores! Checkout here for more information. +
+ +
2021.03.24 TDC updates to 0.1.9
, major changes:
-
+
- TDC now supports molecule filters! Checkout here for more information. +
+ +
2021.03.17 TDC updates to 0.1.8
, major changes:
-
+
- Leaderboard is reformulated and we invite submission for each individual benchmark! Checkout here for more information. +
+ +
2021.02.26 TDC updates to 0.1.7
, major changes:
-
+
- Streamlined leaderboard programming framework! Checkout here for more information. +
- Label log transformation supported. Checkout here for more information. +
+ +
2021.02.18 TDC just released the white paper in arXiv! Here is the link to the paper. + +
+ +
2021.02.04 TDC updates to 0.1.6
, major changes:
-
+
- New Leaderboard! Just released the second leaderboard on drug combination response prediction! Checkout here for usage. +
+ +
2021.01.16 TDC updates to 0.1.5
, major changes:
-
+
- New Oracles! Added four realistic oracles from docking scores and synthetic accessibility scores! Checkout here for usage. +
+ +
2021.01.09 TDC updates to 0.1.4
, major changes:
-
+
- New Function! Added a data processing helper to map among ~15 molecular formats in 2 lines of code (For 2D: from SMILES/SEFLIES and convert to SELFIES/SMILES, Graph2D, PyG, DGL, ECFP2-6, MACCS, Daylight, RDKit2D, Morgan, PubChem; For 3D: from XYZ, SDF files to Graph3D, Columb Matrix). Checkout here for usage. +
- Quality Check! Canonicalize SMILES on DTI datasets with Drug, Target IDs added. Checkout
DTI
.
+
+ +
2020.12.30 TDC updates to 0.1.3
, major changes:
-
+
- New Dataset! Added a new therapeutic task CRISPR Repair Outcome Prediction! Checkout
CRISPROutcome
.
+ - New Function! Added a data processing helper to map SMILES string to popular cheminformatics fingerprints (ECFP2, ECFP4, ECFP6, MACCS, Daylight-type, RDKit2D, Morgan, Pubchem)! Checkout here for usage. +
+ +
2020.12.24 TDC updates to 0.1.2
, major changes:
-
+
- Leaderboard Release! TDC's first leaderboard on ADMET prediction is released. You can find the leaderboard guide here, where we provide a
BenchmarkGroup
class to do model building on leaderboard tasks rapidly. The ADMET leaderboard is here.
+
+ +
2020.12.19 TDC updates to 0.1.1
, major changes:
-
+
- Quality Check and New datasets! We replaced VD, Half Life and Clearance datasets in
ADME
from new sources that have higher qualities. We also added LD50 toTox
.
+
+ +
2020.12.15 TDC updates to 0.1.0
, major changes:
-
+
- Five New Datasets! Added CYP2C9/2D6/3A4 Substrate, for
ADME
, Carcinogens forTox
and NCI-60 forDrugSyn
.
+ - Quality Check. We conducted a canonicalization of all SMILES and removed ones that return errors in the
ADME
,Tox
, andHTS
datasets.
+
+ +
2020.11.30 TDC updates to 0.0.8
, major changes:
-
+
- Five New Datasets! Added hREG, DILI (Drug Induced Liver Injury), Skin Reaction, Ames Mutagenicity for
Tox
and PPBR from AstraZeneca forADME
.
+ - Distribution Learning Metrics Moved to Evaluators. Checkout here for the updated usage. +
- Meta Oracles. We included a helper function where you can specify your own set of molecules for Rediscovery, Similarity, Medians, Isomers. Checkout an example usage here. +
- Tutorials. We have provided various tutorials for you to start using TDC. Click here . + +
+ + +