EPiCs-group · akalikadien · May 8, 2024 · Apr 19, 2023 · Apr 20, 2023 · Apr 20, 2023
diff --git a/.gitignore b/.gitignore
@@ -694,4 +694,9 @@ ipython_config.py
 # Remove previous ipynb_checkpoints
 #   git rm -r .ipynb_checkpoints/
 
-# End of https://www.toptal.com/developers/gitignore/api/jupyternotebooks
+# End of https://www.toptal.com/developers/gitignore/api/jupyternotebooks
+
+# ignore all figures generated by the notebooks
+obelix_ml_pipeline/notebooks/figures/*/*
+obelix_ml_pipeline/notebooks/figures/*/*/*
+obelix_ml_pipeline/*.pkl
diff --git a/README.md b/README.md
@@ -1,5 +1,11 @@
 # obelix-ml-pipeline
-Data and code related to the ML pipeline used in the paper '...' introducing our [OBeLiX workflow](https://github.com/EPiCs-group/obelix).
+This Github repo contains the ML pipeline used in the paper '...'. For the exact 'production' versions used to generate the
+figures in the paper, visit ...  
+
+First, homogeneous catalyst structures are featurized using
+our [OBeLiX workflow](https://github.com/EPiCs-group/obelix). Various other representations are created for these structures
+and used as input for machine learning models. The models are trained to predict the enantioselectivity or conversion. 
+
 
 ## Installation
 First clone the repository:
@@ -25,25 +31,57 @@ Afterwards, install the package:
 ```bash
 pip install -e .
 ```
-Then the notebooks can be run to reproduce the results. In your anaconda prompt, run:
+Then the notebooks can be run to run examples of the pipeline. In your anaconda prompt, run:
 
 ```bash
 cd obelix-ml-pipeline/notebooks
 jupyter notebook
 ```
 
+## Repository structure
+In the paper, 4 different prediction tasks are performed. Fully out-of-domain, partially out-of-domain, in-domain and 
+monte-carlo in-domain. The functions to perform these tasks are kept in their own files. Example use cases
+are shown in the notebooks. The ligand representations, substrate representations and experimental response are 
+loaded from the data folder. A more detailed description of the files is given below.
+
 ## Data
-**filename**: contains ...
+* **Experimental response/**  
+Contains the experimental response for the different substrates and solvents.  
+    * **jnjdata_sm12378_MeOH_16h**: contains the experimental response for 16 hours with one solvent.
+    * **jnjdata_sm12378_MeOH**: contains the experimental response for at 1 hour for SM1/2/3 and 16 hours for the other substrates with one solvent.
+
+* **Ligand representations/**  
+Contains the various representations of the ligands.  
+    * **raw_data_processing**: contains the raw data and scripts needed to create each representation.
+
+* **Substrate representations/**  
+Contains the various representations of the substrates.  
+    * **raw_data_processing**: contains the raw data and scripts needed to create each representation.
 
 ## Code
-**representation_variables.py**: contains a selection of features for representations of the ligand or substrate.
+**representation_variables.py**: contains a selection of features for representations of the ligand or substrate. 
+When the file itself is run, it will create correlation plots for the features in the file.
 
 **utilities.py**: contains functions for data loading and general utilities.
 
 **load_representations.py**: contains functions for loading representations of the ligand or substrate.
 
 **machine_learning.py**: contains functions for machine learning, training and testing.
 
-**predictions_\*.py**: contains the 3 main use cases of the ML pipeline.
+**data_classes.py:** contains classes for the data returned in the ML pipeline.
+
+**predictions\_on\_unseen\_substrate.py**: contains the functions to perform the fully out-of-domain prediction task.
+
+**predictions\_on\_unseen\_substrate\_filtered.py**: contains the functions to fully out-of-domain prediction task, except in this case 
+if a classification task is performed, ligands that are in the same class across all training substrates will be removed from the set. This was done to test 
+how well the models work if ligands that always perform well are removed.
+
+**predictions\_on\_partially\_unseen\_substrate.py**: contains the functions to perform the partially out-of-domain prediction task.
+
+**predictions\_within\_substrate\_class.py**: contains the functions to perform the in-domain prediction task.
+
+**predictions\_within\_substrate\_class\_for\_random\_subset.py**: contains the functions to perform the monte-carlo in-domain prediction task.
+
+
 
 ## Notes
diff --git a/environment.yml b/environment.yml
@@ -4,24 +4,25 @@ channels:
      - anaconda
      - defaults
      - rdkit
-     - mordred-descriptor
 dependencies:
 - python=3.10.5
-- pandas==1.4.3
-- numpy==1.23.2
-- morfeus-ml==0.7.1
-- rdkit==2022.09.1
-- openpyxl
-- scikit-learn
-- scipy
-- matplotlib
-- seaborn
-- tpot
-- xtb-python
-- pip
+- pandas=1.4.3
+- numpy=1.23.2
+- morfeus-ml=0.7.1
+- rdkit=2022.09.1
+- openpyxl=3.1.1
+- scikit-learn=1.3.2
+- scipy=1.10.1
+- matplotlib=3.7.1
+- seaborn=0.12.2
+- python-kaleido=0.2.1
+- pip=23.0.1
 - pip:
-  - libconeangle
-  - cclib
-  - typing-extensions
-  - notebook
+  - libconeangle==0.1.2
+  - cclib==1.7.2
+  - typing-extensions==4.5.0
+  - ipython==8.12.0
+  - notebook==6.5.3
+  - traitlets==5.9.0
   - plotly==5.14.0
+  - scikit-learn-intelex==2023.1.1