This repository contains code for predicting the HOMO-LUMO energy gap of organic compounds using molecular descriptors generated from SMILES (Simplified Molecular Input Line Entry System) representations. HOMO (highest occupied molecular orbital) and LUMO (lowest unoccupied molecular orbital) are frontier molecular orbitals that play a significant role in chemical bond formation and various chemical reactions.
This code requires the following libraries and tools:
- RDKit
- pandas
- numpy
- scikit-learn
- matplotlib
- seaborn
- lazypredict
- lightgbm
You can install these dependencies using pip
or another package manager.
In this part, the code generates canonical SMILES representations for organic compounds and computes 200 general molecular descriptors for each compound.
This part involves the prediction of the HOMO-LUMO energy gap of organic compounds using machine learning models. The following steps are performed:
- Removal of highly correlated features.
- Splitting the dataset into training and testing sets.
- Training various regression models using LazyRegressor, which automatically evaluates multiple regression models.
- Fine-tuning the top-performing models.
- Predicting the energy gap for the test set.
- Calculating model performance metrics such as Mean Absolute Error (MAE) and R-squared (R^2).
- Plotting the predicted vs. observed energy gap.
The code also includes functionality for saving and loading trained machine learning models and a scaler for future use.
To use this code, follow these steps:
- Install the required dependencies.
- Execute the code sections in a Python environment.
- The code will generate predictions for the HOMO-LUMO energy gap of organic compounds and provide model performance metrics.
- Joel Santos