Skip to content

Latest commit

 

History

History
60 lines (46 loc) · 2.76 KB

README.md

File metadata and controls

60 lines (46 loc) · 2.76 KB

GAE-DGL

Graph Auto-encoder [1] implemented with DGL by Shion Honda.
Official implementation by the authors is here (TensorFlow, Python 2.7).
Unlike other implementations, this repository supports inductive tasks using molecular graphs (ZINC-250k), showing the power of graph representation learning with GAE.

Installation

Prerequisites

You need PyTorch and DGL at least and the rest to try inductive settings with molecular graphs.

PyTorch
DeepGraphLibrary
RDKit
dill
tqdm

Usage

Transductive tasks ( 🚧 under development 🚧 )

Reproduce the results of the paper [1] by the following command.

$ python train_transductive.py --dataset cora

You can switch the dataset to use by assigning to the --dataset option one from cora/citeseer/pubmed.

Inductive tasks

This repository supports learning graph representations of molecules in the ZINC-250k dataset (or any unlabeled SMILES dataset). Run pre-training by the following commands.

$ python prepare_data.py # download and preprocess zinc dataset
$ python train_inductive.py --hidden_dims 32 16 # pre-train GAE

The ZINC-250k is a subset of ZINC dataset and can be obtained easily by, for example, Chainer Chemistry.
Interestingly, I found GAE also works in inductive settings even though it was not tested in the original paper [1].

Potential Application to Chemistry

Is learned feature through pre-training really useful for predicting molecular properties? Let's check with simple examples. Here I use ESOL (solubility regression) dataset from [2], which can be downloaded here.

Feature + Model RMSE R2
GAE + Ridge 1.813 0.585
GAE + MLP 1.216 0.732
GAE + Random Forest 1.424 0.688
ECFP + Ridge 2.271 0.480
ECFP + MLP 2.034 0.549
ECFP + Random Forest 1.668 0.643

ECFP is a hash-based binary feature of molecules ($D=1024$), which is the most common algorithm as a baseline.
GAE feature is a concatenation of mean, sum, and max aggregation of the hidden vector $\textbf{H} \in \mathbb{R}^{N\times 16}$, so its dimension is 48.
GAE performs better than ECFP in all the combination with three regression models: Ridge, Multi-layer perceptron, and Random Forest.

References

[1] Thomas N. Kipf and Max Welling. "Variational Graph Auto-Encoders." NIPS. 2016.

[2] Zhenqin Wu, Bharath Ramsundar, Evan N. Feinberg, Joseph Gomes, Caleb Geniesse, Aneesh S. Pappu, Karl Leswing, Vijay Pande. "MoleculeNet: A Benchmark for Molecular Machine Learning", Chemical Science. 2018.