A repo of the most seminal applications of geometric deep learning in toxicity prediction tasks.
The most comprehensive professionally curated resource on Geometric Deep learning applied in toox pred tasks including the best tutorials, videos, books, papers, articles, courses, websites, conferences and open-source libraries. Since, predictive toxicology is a niche field, many papers are from neighboring fields such as drug discovery or plain DL which are just applied in toxicology datasets.
The papers will be listed by time order, noting the advancements along the way.
Disclaimer: All the images are sourced from the resources I linked.
- Tutorials
- Papers
- Articles/Blogs
- Repositories
- Videos
- Tools
- What is a graph?
- Graph Neural Networks
- Graph Convolutions
Paper | Author | Year | Github | Comments | Datasets |
---|---|---|---|---|---|
General interest | |||||
Gated Graph Sequence Neural Networks | Li et al. | 2015 | Github | LSTM on graphs | Graph algorithm tasks |
Semi-Supervised Classification with Graph Convolutional Networks | Kipf and Welling | 2016 | Github | The most influential GCN🔥🔥🔥 | Citation datasets |
Graph Attention Networks | Velickovic et al. | 2017 | Github | Introduced attention to GNNs. Not implemented inductively though.🔥🔥🔥 | Citation and PPI |
Inductive Representation Learning on Large Graphs | Hamilton et al. | 2017 | Github | The inductive variant of Kipf's GCN along with different aggregator functions | Citation and PPI |
Geom-GCN: Geometric Graph Convolutional Networks | Pei et al. | 2020 | Github | Transductive model including geometric info | Citation networks |
Pooling? | |||||
Hierarchical Graph Representation Learning with Differentiable Pooling | Ying | 2018 | Github | Pooling hierarchical better than global mean and sum and sortpooling | ENZYMES,PROTEINS,REDDIT.COLAB |
Molecular property/activity/toxicity prediction | |||||
Convolutional Networks on Graphs for Learning Molecular Fingerprints | Duvenaud et al. | 2015 | Github | Aligned the notion of graph embedding to molecular fingerprints | Solubility, Drug efficacy |
Molecular Graph Convolutions: Moving Beyond Fingerprints | Kearnes et al. | 2016 | DeepChem | Introduced edge features. Used weave convolutions and noticed that complex atom/bond featurizations do not enhance the model | PCBA,MUV,Tox21 |
Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction | Coley et al. | 2017 | Github | Atom features in graphs like ECFP, bond features are not updating. Not an improvement over Tox21 Challenge winner | Solubility, Tox21 |
Neural Message Passing for Quantum Chemistry | Gilmer et al. | 2017 | Github | Introduced the concept of Message passing networks. Also, tried to encode spatial info about the graph by distance bins, super nodes and virtual edges and resembled model ensembling by the concept of multiple towers🔥🔥🔥 | QM9 |
Learning Graph-Level Representation for Drug Discovery | Li et al. | 2017 | Github | Introduced dummy super node connected to all the other nodes to learn global features | Tox21(0,76 scaffold)ToxCast,HIV,PCBA,MUV |
PotentialNet for Molecular Property Prediction | Feinberg et al. | 2018 | DGL-lifesci | Used another type of split called agglomerative by the pairwise similarity of every ligand-protein pair,Unknown split on tox21 | PDBBind,QM8,Tox21(0.856) |
Adaptive Graph Convolutional Neural Networks | Li et al. | 2018 | Github | A successful spectral graph model | Tox21,CLinTox,ToxCast |
Graph classification using structural attention | Lee et al. | 2018 | Github | Improved attention | HIV,NCI |
Chemi-Net: A Molecular Graph Convolutional Network for Accurate Drug Property Prediction | Liu et al. | 2019 | None | Predicting ADME properties with a multitask GCN.Surpassed a known tool by far,Cubist | ADME |
Analyzing Learned Molecular Representations for Property Prediction | Yang et al. | 2019 | Chemprop | The directed edges reduce the noise.They added also global features by RDKit.SOTA results on Tox21 but excellent github resource | Most molecular datasets🔥🔥🔥🔥🔥🔥🔥🔥🔥 |
Strategies for Pre-training Graph Neural Networks | Hu et al. | 2019 | DGL-lifesci | Combination of node-wise(context pred and attr masking) and graph-wise supervised pretraining does not cause negative transfer🔥🔥🔥 | ChEMBL,ZINC,Tox21,ToxCast etc |
Pushing the Boundaries of Molecular Representation for Drug Discovery with the Graph Attention Mechanism | Xiong et al. | 2020 | DeepChem | AttentiveFP was a significant improvement to previous models.They added chirality to atom features and stereo to bond features.Also,they used GRU as a readout.Check for aromaticity pretraining task. | Most molecular datasets |
Multi-View Graph Neural Networks for Molecular Property Prediction | Ma et al. | 2020 | None | Graph seen in two ways, edge-central and node-central and a cross-dependent passing enhances more the model, Check for interpretability(Tox21 scaf 0.836🔥🔥🔥 | Most molecular datasets |
Communicative Representation Learning on Attributed Molecular Graphs | Song et al. | 2021 | Github | D-MPNN but with a communicative function to boost the edge messages🔥🔥🔥 | Same as Gilmer DMPNN |
Graph Contrastive learning | |||||
Knowledge graph-enhanced molecular contrastive learning with functional prompt | Y. Fang et al. | 2023 | KNOWLEDGE GRAPH enhanced pretrained, SOTA but expensive KG🔥🔥🔥 | Most molecular datasets(Tox21=0.837) | |
MoCL: Data-driven Molecular Fingerprint via Knowledge-aware Contrastive Learning from Molecular Graph | Sun et al. | 2021 | Local level contrastive learning by bioisostere substitution, and global-level maximizing the similarity of ECFPs ang graph embeddings. NEGATIVE TRANSFER on TOX datasets | Bace,BBBP,Tox21,ToxCast | |
Molecular contrastive learning of representations via graph neural networks | Wang et al. | 2022 | Best contrastive so far, augmenting by subgraph removal | Most molecular datasets(Tox21=0.8) | |
Pretraining-Transfer Learning | |||||
Geometry-enhanced molecular representation learning for property prediction | Fang et al. | 2022 | Github | Pre-train on geometry info improves the downstream task performance🔥🔥🔥 | Most molecular datasets |
PRE-TRAINING MOLECULAR GRAPH REPRESENTATION WITH 3D GEOMETRY | Liu et al. | 2022 | Contrast 2D with 3D or generate 3D from 2D, NOT IMPROVED | Most molecular datasets | |
Multi-modal | |||||
Dual-view Molecule Pre-training | Zhu et al. | 2021 | SMILES transformer and GNN node masking as pretraining, Dual view consistenscy loss | PubChem 10M | LESS THAN 0.8 ON TOX21 â–º ` |
Molecule Property Prediction Based on Spatial Graph Embedding | Wang et al. | 2018 | Github | 1D-convolutions on each atom's features using skip connections + fingerprints | ESOL,lipophilicity,PDBBind |
Data-centric AI | |||||
ASGN: An Active Semi-supervised Graph Neural Network for Molecular Property Prediction | Hao et al. | 2020 | Github | Active Learning surpassed Infograph and Mean Teaches and SelfSL but still expensive | QM9,OPV |
Chemical toxicity prediction based on semi-supervised learning and graph convolutional neural network | Chen et al. | 2021 | Github | Mean-Teacher mediocre results | Tox21,QM9,ZINC |
Low Data Drug Discovery with One-Shot Learning | Altae-Tran et al. | 2017 | DeepChem | One-shot learning in graph classification tasks paired with LSTM updates | Tox21(0.827),SIDER,MUV |
Graph Generation/ De novo molecule design | |||||
Junction Tree Variational Autoencoder for Molecular Graph Generation | Jin et al. | 2018 | Github | JTVAE generating molecules by first creating scaffolds | ZINC |
MolGAN: An implicit generative model for small molecular graphs | De Cao and Kipf | 2018 | Github | GANs and RL combined gave valid, novel but not unique molecules | QM9 |
Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation | You et al. | 2018 | Github | GCPN uses adversarial training and RL,better than JT-VAE but not compared to molgan | ZINC |
MoFlow: An Invertible Flow Model for Generating Molecular Graphs | Zang et al. | 2020 | Github | Best compared to any other graph generation | QM9,ZINC |
GraphAF: a Flow-based Autoregressive Model for Molecular Graph Generation | Shi et al. | 2019 | GithubOne of the first flow based approaches | ZINC | |
Variational Graph Autoencoders | |||||
Variational Graph Auto-Encoders | Kipf amd Welling | 2016 | Github | The first GVAE on link prediction tasks | Citation |
Constrained Graph Variational Autoencoders for Molecule Design | Liu et al. | 2018 | Github | Novel unique and valid molecules🔥🔥🔥 | |
Constrained Generation of Semantically Valid Graphs via Regularizing Variational Autoencoders | Ma et al. | 2018 | None | Mediocre results in graph generation | QM9,ZINC |
Graph Unet | |||||
Graph U-Nets | Gao and Li | 2018 | none | com | |
Graph Transformers | |||||
Self-Supervised Graph Transformer on Large-ScaleMolecular Data | Rong et al. | 2020 | Github | Dynamic MPNN:Number of hops is random | ChEMBL,ZINC |
Graph Transformer Networks | Jun et al. | 2020 | An improvement compared to GAT to learn better node-level representations | IMDB and Citations | |
Graph Explainability | |||||
GNNExplainer: Generating Explanations for Graph Neural Networks | Ying et al. | 2019 | A model-agnostic, single-instance,post-hoc explanation by extracting subgraphs | MUTAG,REDDIT | |
Reinforced Causal Explainer for Graph Neural Networks | Wang et al. | 2022 | It frames the explanation task as a sequential decision process. | MUTAG,REDDIT,Genome | |
Reviews | |||||
Graph convolutional networks: a comprehensive review | Zhang et al. | 2019 | none | com | |
How Powerful are Graph Neural Networks? | Xu et al. | 2019 | none | com | |
Does GNN Pretraining Help Molecular Representation? | Sun et al. | 2022 | MUST READ🔥🔥🔥 | ABLATION STUDIES | |
Graph convolutional networks for computational drug development and discovery | Sun et al. | 2020 | none | com | |
Graph neural networks: A review of methods and applications | Zhou et al. | 2020 | none | com | |
A compact review of molecular property prediction with graph neural networks | Wieder et al. | 2020 | none | 🔥🔥🔥 | |
Could graph neural networks learn better molecular representation for drug discovery? A comparison study of descriptor-based and graph-based models | Jiang et al. | 2021 | none | 🔥🔥 |
- Understanding GNNs 🔥 🔥 🔥
- Introduction to Graph Neural Networks
- Graph Convolutional Networks
- An Introduction to Graph Neural Networks: Models and Applications by Miltos ALlamanis(Microsoft Research)🔥🔥🔥
- Intro to graph neural networks (ML Tech Talks) by Petar Velickovic(DeepMind)🔥🔥🔥
- Understanding Graph Neural Networks by DeepFindr
- The AI EPiphany by Gordic Aleksa
- DGL-lifesci: DGL-LifeSci is a python package for applying graph neural networks to various tasks in chemistry and biology.🔥🔥🔥
- Pytorch-Geometric PYG is a library to easily train Graph Neural Networks (GNNs) for a wide range of applications related to structured data.
- Dive Into Graphs DIG provides a unified testbed for higher level, research-oriented graph deep learning tasks, such as graph generation, self-supervised learning, explainability, and 3D graphs.
- RDKit: A cheminformatics library for generating/calculating molecular descriptors and fingerprints and handling molecules
- MoleculeNet: A library for benchmarking ML models across different molecular tasks
- DeepChem: A toolkit which includes a lot of different models and datasets with relevant tutotials for gentle introduction into molecular ML.
Toxicology is the study of the adverse effects of chemicals or physical agents on living organisms.
It can be clustered by the degree of their damage to cell (cytotoxicity), organ (hepatotoxicity) or systemic toxicants (mutagenicity,genotoxicity). Any given chemical has to undergo a rigorous, expensive and time-consuming toxicity-assessment. The field of computational toxicology tries to alleviate this burden by building QSAR (quantitative structure-activity relationship) models that associate a structure to a specific toxic effect. For many years , that was a task of experienced chemists that know which fragments of a molecule are potentially toxic and build models based on these so called structural alerts. Basically, if a substructure was identified as part of a molecule, that molecule had a higher probability of being toxic. The last two decades there has been a lot of development as we developed ways to represent a molecule in a machine-readable way. Among the most popular are the SMILES strings, the molecular descriptors and the molecular fingeprints. For the computer though, toxicity is just a dataset where a chemical has a label of 1 or 0 encoding being toxic or not. The last fifteen years they have been developed several datasets with these mappings based on lab experiments. MoleculeNet is a library that gathered all these as a benchmark for the ML models. Based on these representations and coupled with ML and DL models, we achieved great results. In the last few years, graph neural networks seem to dominate the research interest of the field. As graphs are a new input for our models, it was intuitive that molecules can be described as graphs and that lit the spark for the development of this field and the reason for this repo.
A graph G is a set of nodes and vertices between them G(V,E). Molecules can be intuitively seen as graphs where the nodes are the atoms and the edges are the bonds between them.
How we use the graphs as input though?
The graph can be represented essentially by three matrices:
- The adjacency matrix, which shows how the nodes(atoms) are connected
- The node features matrix, which encodes information about every node(atom)
- The edge features matrix, whoch encodes information about the edge(bond)
GNNs are a type of networks that operate on these graphs.
There are two ways to develop GNNs, spectrally and spatially.
Both tried to generalize the mathematical concept of convolution to graphs. The spectral methods stuck to the strict mathematical notions resorted to the frequency domain(Laplacian eigenvectors). Being computationally expensive and not applicable to inductive scenarios, they finally died out. Spatial ones form the ones, now known as graph convolutions and are the ones that we are going to analyse more. If you still want to get a basic understanding of spectral methods you can advise the links below.
Oops, I mentioned inductive without even explaining. The image speaks for itself.
Inductive learning: This type of learning is like the usual supervised learning. The model has not seen the nodes/graphs that will later classify. This applies to graph-classification tasks which are our main interst for molecular proprerty prediction.
Transductive learning: In transductive learning, the model has seen the nodes without their labels and/or some features but gets an understanding of how they are connected within the graph. That is useful mainly for node-classification tasks.
Normal Convolutions
A typical feed forward network does a forward pass with the following equation:
- Y = σ(W*X + β),
where σ is a non-linear function(ReLU,tanh), W is the weight associated with each feature, X the features and β is the bias.
In convolutional neural networks, the input usually is an image(i.e a tensor height* width*channels).
-
An RGB image has three channels whereas a greyscale only one.
-
In CNNs, the W is called a filter or kernel and is usually a matrix(2x2, 3x3 etc.) which is the same passed acrossed the image to extract features(patterns) from every part of the image.
-
That is called weight sharing. That is done because a pattern is interesting wherever it is in the image(translational invariance)
The question became how we can generalize the convolutions to graphs?
There are some significant differences between images and graphs.
- Images are positioned in a Euclidean space, and thus have a notion of locality.Pixels that are close to each other will be much more strongly related than distant ones. Graphs on the other hand do not as information about the distance between nodes is not encoded.
- Pixels follow an order while graph nodes do not. So, the locality is achieved in graphs based on neighborhoods. Also, we adopt the weight sharing from the normal convolutions.
Invariance
The order invariance is achieved by applying functions that are order invariant Permutation matrix, P is a matrix that only changes the order of another matrix. So for every P, the following equation should be obeyed.
- f(PX)=f(X)
Equivariance
But if we wanted information on node-level the invariant function would not suffice. Instead, we need a permutation equivariant function that do not change the node order and follow the following equation.
- f(PX)=Pf(X)
We can think of these functions f that transform the xi features of a node to a latent vector hi.
hi = f(xi)
Stacking these will result in H = f(X)
How we can use these latent vectors?
But hold on...
How we incorporate the adjacency matrix A into this equation?
A simple update rule:
Hk+1 = σ(AWHk), where A is the adjacency matrix, k is the number of itearations and we dropped β for simplicity reasons.
Hopefully the similarities with the classical equation are obvious.
Node-wise the equation is written:
hi = Σ (W*hj),
where j is every neighbor of the node i.
Let's see it in practice:
Considering this adjacency matrix, when we update the state of the node v1, we will take into account its neighbor states. That although would be wrong as we'll be entirely dropping the previous state of node v1. So, we need to make a correction to the adjacency matrix A by adding the identity matrix and creating the matrix Ã. That would add 1s across the diagonal making each node a neighbor of itself, i.e we add self-loops.
Each latent vector of a node is a sum of the vectors of its neighbors. So, if the degree of a node( degree shows to how many neighbors a node has) is really high the scale of the latent vector would be entirely different and we'll face vanishing or exploding gradients.
- So, we should normalize based on the degree of the node. Firstly we calculate degree matrix, D by summing up row-wise the adjacency matrix, Ã.
Then, we inverse it and thus the equation takes the form.
Hk+1 = σ(ÃD-1WkHk)
WE DID IT! We now have the first equation upon we can build our different variants of graph convolutions.
This equation essentially describes a simple averaging of the neighbors' vectors. This update of the state of the vectors happens for i number of steps. On each step or a neighborhood hop you aggregate the vectors fo the neighbors. Once we have all the latent vectors for each node after k number of steps, we can use these for node classification or in our case we can aggregate them and reach a unique embedding for every graph.
- GCN
The most known variation of graph convolutions was set by Kipf & Welling in 2017. They introduced a renormalization trick which is more than just a mere average of the neighbors. They normalize by 1÷√(di * dj)
Hk+1 = σ(D-1/2ÃD-1/2WkHk)
From now on, we'll refer to it as the GCN.
- GAT(Graph Attention Networks)
Petar Velickovic had another idea. Instead, of giving an equal weight to every neighbor that will be added explicitly, a concept called attention. So, the node-wise equation now became:
The aij comes from applying a softmax to eij = a(hi,hj) which are non-normalized coefficients across pairs of nodes
Influenced by the results of Vaswani et al. they included multi head attention mechanisms which is essentially a K number of replicates which are then concatenated or aggregated. The following figure from the paper makes it abundantly clear.
The term message-passing arised in 2017 and is really intuitive way to see graph neural nets. The two main points evolve around the two functions that happen in a GNN
- The Update function, q
- The Aggregate function, U
From this youtube video we can sum them up by the figure.
Essentially we concatenate the vector of the node-in-focus of the previous step with the edges K and its neighbors. The resulting vector passed through an update function f and then aggregated by the function U. Finally they are passed through a non-linear function to get new updated representation.
The previously described GCN and GAT, following a similar formalism can be described in the following figures.
This article includes an interactive session to play aroung with graphs and the most essential GNN variants.