Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Notebooks and method as implemented in our manuscript #20

Merged
merged 34 commits into from
May 8, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
a9c9405
updated gitignore for figure folder, added intel extension for SKlear…
akalikadien Apr 19, 2023
3fa35b3
Added pca to partially_seen_substrate and within_substrate_class
Apr 20, 2023
4c7ae73
Added pca to partially_seen_substrate and within_substrate_class
Apr 20, 2023
fa0ad46
Merge branch 'visualization_notebooks' of https://github.com/EPiCs-gr…
Apr 20, 2023
be10402
Added ligand ohe data file
Apr 21, 2023
e02dd45
updated gitignore to ignore figures, updated environment yaml to allo…
akalikadien Apr 24, 2023
91009ba
Added ohe for liganf closes #12, updated notebooks
Apr 27, 2023
259d317
added new approach to randomly sample subsets within same substrate c…
akalikadien Apr 28, 2023
57d47be
modified dataclass to save train_data, test_data, random_seed and ran…
akalikadien May 2, 2023
eb8979a
Modified notebook 4, added example of model
CValse May 4, 2023
e687e4c
added code for saving res_df to objective 2 notebook, fixed typos in …
akalikadien May 8, 2023
a139a80
added standard deviation for feature importance plots, added datafram…
akalikadien May 19, 2023
6e08fdd
Modified experimental data file, signs added to DDG
CValse Jun 9, 2023
fdffb1e
Merge branch 'main' of https://github.com/EPiCs-group/obelix-ml-pipel…
akalikadien Jun 12, 2023
48ea6c4
added new selection of descriptors for dft_nbd_model to representatio…
akalikadien Jun 13, 2023
6ad0416
fixed NBD position and added dihedral/pi-bond distances for L2, L74, …
akalikadien Jun 14, 2023
0bed524
added experimental response with sm123 after 1h
CValse Jun 20, 2023
cf59fc4
added predictions to train_data and test_data after ML, modified each…
akalikadien Jun 20, 2023
f8424a1
Updated ecfps for ligands
CValse Jun 22, 2023
00973a3
updated load_experimental_response to get 1H data for SM1,2,3 and 16H…
akalikadien Jun 26, 2023
3bd9e15
Merge branch 'visualization_notebooks' of https://github.com/EPiCs-gr…
akalikadien Jun 26, 2023
77f4656
updated NBO charges based on custom extraction function from obelix i…
akalikadien Aug 3, 2023
8b77216
fixed indexing of the NBD in L31 such that dihedral and pi-bond descr…
akalikadien Aug 11, 2023
4f89349
added free_ligand descriptors to dft_nbd_model, modified selection of…
akalikadien Sep 12, 2023
013c58f
updated clean_tud_set to take free_ligand descriptors into account, u…
akalikadien Oct 9, 2023
cbe9850
Update ligands_ecfp.csv
CValse Oct 11, 2023
02d1bd4
updated descriptor selection for DFT NBD model, sorted clean ligand d…
akalikadien Oct 16, 2023
8fdbf24
calculated quadrant/octant at 7A radius, updated preprocessing to cal…
akalikadien Oct 17, 2023
09612de
added processing and descriptors for structures that we have in commo…
akalikadien Oct 23, 2023
e2cfe03
incorporated latest critial bug fix https://github.com/EPiCs-group/ob…
akalikadien Oct 26, 2023
fbd1691
updated readme
akalikadien Mar 11, 2024
a328dd9
removed redundant ligand representations, notebooks and folders, upda…
akalikadien Mar 11, 2024
ba18751
tested building conda environment from scratch, updated environment.y…
akalikadien Mar 15, 2024
0044b25
modified data_classes and obj 1-4 functions to include target_thresho…
akalikadien May 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 6 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -694,4 +694,9 @@ ipython_config.py
# Remove previous ipynb_checkpoints
# git rm -r .ipynb_checkpoints/

# End of https://www.toptal.com/developers/gitignore/api/jupyternotebooks
# End of https://www.toptal.com/developers/gitignore/api/jupyternotebooks

# ignore all figures generated by the notebooks
obelix_ml_pipeline/notebooks/figures/*/*
obelix_ml_pipeline/notebooks/figures/*/*/*
obelix_ml_pipeline/*.pkl
48 changes: 43 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,11 @@
# obelix-ml-pipeline
Data and code related to the ML pipeline used in the paper '...' introducing our [OBeLiX workflow](https://github.com/EPiCs-group/obelix).
This Github repo contains the ML pipeline used in the paper '...'. For the exact 'production' versions used to generate the
figures in the paper, visit ...

First, homogeneous catalyst structures are featurized using
our [OBeLiX workflow](https://github.com/EPiCs-group/obelix). Various other representations are created for these structures
and used as input for machine learning models. The models are trained to predict the enantioselectivity or conversion.


## Installation
First clone the repository:
Expand All @@ -25,25 +31,57 @@ Afterwards, install the package:
```bash
pip install -e .
```
Then the notebooks can be run to reproduce the results. In your anaconda prompt, run:
Then the notebooks can be run to run examples of the pipeline. In your anaconda prompt, run:

```bash
cd obelix-ml-pipeline/notebooks
jupyter notebook
```

## Repository structure
In the paper, 4 different prediction tasks are performed. Fully out-of-domain, partially out-of-domain, in-domain and
monte-carlo in-domain. The functions to perform these tasks are kept in their own files. Example use cases
are shown in the notebooks. The ligand representations, substrate representations and experimental response are
loaded from the data folder. A more detailed description of the files is given below.

## Data
**filename**: contains ...
* **Experimental response/**
Contains the experimental response for the different substrates and solvents.
* **jnjdata_sm12378_MeOH_16h**: contains the experimental response for 16 hours with one solvent.
* **jnjdata_sm12378_MeOH**: contains the experimental response for at 1 hour for SM1/2/3 and 16 hours for the other substrates with one solvent.

* **Ligand representations/**
Contains the various representations of the ligands.
* **raw_data_processing**: contains the raw data and scripts needed to create each representation.

* **Substrate representations/**
Contains the various representations of the substrates.
* **raw_data_processing**: contains the raw data and scripts needed to create each representation.

## Code
**representation_variables.py**: contains a selection of features for representations of the ligand or substrate.
**representation_variables.py**: contains a selection of features for representations of the ligand or substrate.
When the file itself is run, it will create correlation plots for the features in the file.

**utilities.py**: contains functions for data loading and general utilities.

**load_representations.py**: contains functions for loading representations of the ligand or substrate.

**machine_learning.py**: contains functions for machine learning, training and testing.

**predictions_\*.py**: contains the 3 main use cases of the ML pipeline.
**data_classes.py:** contains classes for the data returned in the ML pipeline.

**predictions\_on\_unseen\_substrate.py**: contains the functions to perform the fully out-of-domain prediction task.

**predictions\_on\_unseen\_substrate\_filtered.py**: contains the functions to fully out-of-domain prediction task, except in this case
if a classification task is performed, ligands that are in the same class across all training substrates will be removed from the set. This was done to test
how well the models work if ligands that always perform well are removed.

**predictions\_on\_partially\_unseen\_substrate.py**: contains the functions to perform the partially out-of-domain prediction task.

**predictions\_within\_substrate\_class.py**: contains the functions to perform the in-domain prediction task.

**predictions\_within\_substrate\_class\_for\_random\_subset.py**: contains the functions to perform the monte-carlo in-domain prediction task.



## Notes
35 changes: 18 additions & 17 deletions environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,24 +4,25 @@ channels:
- anaconda
- defaults
- rdkit
- mordred-descriptor
dependencies:
- python=3.10.5
- pandas==1.4.3
- numpy==1.23.2
- morfeus-ml==0.7.1
- rdkit==2022.09.1
- openpyxl
- scikit-learn
- scipy
- matplotlib
- seaborn
- tpot
- xtb-python
- pip
- pandas=1.4.3
- numpy=1.23.2
- morfeus-ml=0.7.1
- rdkit=2022.09.1
- openpyxl=3.1.1
- scikit-learn=1.3.2
- scipy=1.10.1
- matplotlib=3.7.1
- seaborn=0.12.2
- python-kaleido=0.2.1
- pip=23.0.1
- pip:
- libconeangle
- cclib
- typing-extensions
- notebook
- libconeangle==0.1.2
- cclib==1.7.2
- typing-extensions==4.5.0
- ipython==8.12.0
- notebook==6.5.3
- traitlets==5.9.0
- plotly==5.14.0
- scikit-learn-intelex==2023.1.1
Loading