Published with permission from Arvinas.
Applying multitask learning to the linkerology data across 8 protein targets.
- Comparing results w/ Random Forest, XGBoost, and single-target neural-net, binary and regression.
- Comparing ECFP from parent PROTAC structures with ECFP from component PROTAC structures. Where component refers to a bit_vector_len*3 matrix with rows: (target_ligand, linker, e3_ligand).
- Comparing multitask vs. single-task vs. single-task w/ pre-loaded parameters from trained multitask model (with shared layer parameters either frozen or not).
80/20 split, where the top 20% newest compounds are placed in the validation set to enchance the robustness of future predictions. Split was performed independently for each target.
Due to a heavy skew towards active compounds, a relatively low 40nM threshold was chosen.
Use of kernel density estimation for "auto-thresholding" was explored, but could not produce useful thresholds for most of the targets.
Trained a XGBoost model and a random forest model for each target independently using grid-search and 3-fold CV. The models were not trained with AR data since it is entirely active.
Contains datastructures and functions for data processing, model architecture, training routines, and visualization. Create a virtual-env and install package from within this repository root directory using pip install -e linkerology_multitask. "linkerology_multitask" would be replaced w/ the path to the package directory if not installing within the repo root. "-e" flag installs in editable mode so that changes can be made within the package and directly noticed by the virtual-env. Importing linkerology_multitask may require installation of additional packages to the virtual-env.
Module within linkerology_multitask for building ML numpy datasets from data in DATA/raw into DATA/processed.
Module within linkerology_multitask for machine learning model architecure using PyTorch. BinaryMultitask using Sigmoid output, MulticlassMultitask uses Softmax output (for binary classification), and RegMultitask uses nn.Linear (node:1) output for regression. BinaryMultitask is used for linkerology_multitask.ml_training.MultitaskTrainerBinary rather and MulticlassMultitask is not used.
Module within linkerology_multitask for training the machine learning models. MultitaskTrainerBinary and MultitaskTrainerReg inherit from _Trainer. Hyperparameters are modifiable here within their constructors (or as properties after instantiation). The trainers will track metrics such as: accuracy, F1, R squared over epochs. They will also save the models best weights by taking the minimum of the loss over epochs.