NIH_NCBI

Development of Machine Learning-Based Prediction Models for Chemical Modulators of the Glucocorticoid Receptor Signaling Pathway Using Public-Domain Bioactivity Data

Objectives: Here we present a model that can predict glucocorticoid receptor (GR) activity based on the structure of its small molecule chemical modulators using a machine learning approach. The GR signaling pathway varies depending on the small molecules that bind to it, and can act as an agonist, antagonist, or not act at all. Due to the uncertainty associated with this signaling pathway and the availability of GR high-throughput screening (qHTS) bioassay data on PubChem, the world’s largest freely accessible chemical database, an algorithm can be trained and validated to create a GR behavior model. We can use this information to determine the chemical substructures that play the largest role in determining GR activity. We hope this algorithm will allow for a greater understanding of GR pathway dynamics to be used in predictive analytics, intracellular modeling, and drug discovery. We also pose this model’s development pipeline and use of open-source PubChem data as a framework to predict the behavior of additional receptors.

Solution Concept: We used small molecule structure to predict GR activity, building six different machine learning approaches and testing on five different machine-readable chemical structure keys, or molecular fingerprints. We conducted statistical analysis on the qHTS data to determine the most activity-significant chemical substructures.

Measurements and Main Results: Six machine learning approaches, Naïve Bayes (NB), Decision Trees (DT), Random Forest (RF), K Nearest Neighbors (KNN), Support Vector Machine (SVM), and Neural Networks (NN), were all built using Tox21 qHTS data. Each model took five different molecular fingerprint types as input and predicted GR activity – “Active” or “Inactive” – on a test set and two external datasets. These predictions had an associated Area under the Curve (AUC), Balanced Accuracy Score (BACC), Sensitivity, and Specificity. While the RF, KNN, SVM, and NN all had a test set AUC of 0.96, the RF model performed the most consistently across all fingerprint types, with an AUC range of 0.86% – 0.96%.

Conclusions: Machine learning models built using PubChem open-source bioassay data are a viable approach to predicting GR receptor behavior. It is necessary to train the models on a greater number of compounds, to increase the general applicability and external dataset performance of the model predictions.

File Locations: Eqv All Jupyter Notebook Scripts: Github/NCBI repo All linux machine scripts, input files, output files etc: Shreya Scripts Zipped File.zip All presentations/docs/papers: Shreya.zip All dr kim’s scripts and tox21 filder: All Files.zip

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.ipynb_checkpoints		.ipynb_checkpoints
AID_588532_datatable_all.csv		AID_588532_datatable_all.csv
AID_588533_datatable_all.csv		AID_588533_datatable_all.csv
AID_720693_datatable_all.csv		AID_720693_datatable_all.csv
AID_720719_datatable_all.csv		AID_720719_datatable_all.csv
AID_720725_datatable_all.csv		AID_720725_datatable_all.csv
AID_combined_prop.tab		AID_combined_prop.tab
Applicability Domain.ipynb		Applicability Domain.ipynb
CleanedAgonistData3.csv		CleanedAgonistData3.csv
CleanedAntagonistData.csv		CleanedAntagonistData.csv
CleanedAntagonistData2.csv		CleanedAntagonistData2.csv
CombinedDataCleaned.csv		CombinedDataCleaned.csv
Data Cleaning (old).ipynb		Data Cleaning (old).ipynb
Data Cleaning FINAL-NCGC.ipynb		Data Cleaning FINAL-NCGC.ipynb
Data Cleaning FINAL.ipynb		Data Cleaning FINAL.ipynb
Data Cleaning NEW (deals with multicomponent molecules).ipynb		Data Cleaning NEW (deals with multicomponent molecules).ipynb
Dictionaries NEW-updated.ipynb		Dictionaries NEW-updated.ipynb
Dictionaries OLD.ipynb		Dictionaries OLD.ipynb
Downsampled_NCGC_Full		Downsampled_NCGC_Full
Downsampled_Tox21_Full		Downsampled_Tox21_Full
Downsampling NCGC.ipynb		Downsampling NCGC.ipynb
Downsampling Tox21.ipynb		Downsampling Tox21.ipynb
FINAL_Final_Merged_Cleaned_CSV_7-19		FINAL_Final_Merged_Cleaned_CSV_7-19
FINAL_Merged_Cleaned_CSV_7-19		FINAL_Merged_Cleaned_CSV_7-19
Final_Merged_CSV		Final_Merged_CSV
Final_Merged_Cleaned_CSV_7-1		Final_Merged_Cleaned_CSV_7-1
Fingerprints & Formatting Tox21.ipynb		Fingerprints & Formatting Tox21.ipynb
Fingerprints Chembl		Fingerprints Chembl
Fingerprints Chembl.ipynb		Fingerprints Chembl.ipynb
Fingerprints NCGC		Fingerprints NCGC
Fingerprints NCGC.ipynb		Fingerprints NCGC.ipynb
Fingerprints Tox21		Fingerprints Tox21
Fingerprints Tox21 Test		Fingerprints Tox21 Test
Heatmaps - Applicability Domain.ipynb		Heatmaps - Applicability Domain.ipynb
Heatmaps .ipynb		Heatmaps .ipynb
Initial Data Exploration.ipynb		Initial Data Exploration.ipynb
ML Models !.ipynb		ML Models !.ipynb
Molecular Properties NCGC.ipynb		Molecular Properties NCGC.ipynb
Molecular Properties Tox21.ipynb		Molecular Properties Tox21.ipynb
Molecular_Properties_CSV		Molecular_Properties_CSV
NCGC Data Cleaning OLD.ipynb		NCGC Data Cleaning OLD.ipynb
NCGC_Molecular_Properties_CSV		NCGC_Molecular_Properties_CSV
NCGC_clean		NCGC_clean
Poster Shreya Singh August 2019.pdf		Poster Shreya Singh August 2019.pdf
README.md		README.md
RawAgonistData.csv		RawAgonistData.csv
Structure Alerts.ipynb		Structure Alerts.ipynb
TEST!! Data Cleaning NEW copy.ipynb		TEST!! Data Cleaning NEW copy.ipynb
TEST_Final_Merged_Cleaned_CSV_7-1		TEST_Final_Merged_Cleaned_CSV_7-1
Test_Tox21		Test_Tox21
Train-Test Split Tox21.ipynb		Train-Test Split Tox21.ipynb
Train_Tox21		Train_Tox21
Untitled.ipynb		Untitled.ipynb
Untitled1.ipynb		Untitled1.ipynb
Verifying Data Cleaning.ipynb		Verifying Data Cleaning.ipynb
activate		activate
auc_chembl_appl.png		auc_chembl_appl.png
auc_ncgc_appl.png		auc_ncgc_appl.png
auc_test.png		auc_test.png
auc_test_appl.png		auc_test_appl.png
bacc_chembl_appl.png		bacc_chembl_appl.png
bacc_ncgc_appl.png		bacc_ncgc_appl.png
bacc_test.png		bacc_test.png
bacc_test_appl.png		bacc_test_appl.png
cid_parent.json		cid_parent.json
cid_parent_ag.json		cid_parent_ag.json
concise_data_prop.tab		concise_data_prop.tab
cv0_p100_knn_kr.tif		cv0_p100_knn_kr.tif
domain_appl_ecfp.grep		domain_appl_ecfp.grep
domain_appl_fcfp.grep		domain_appl_fcfp.grep
domain_appl_maccs.grep		domain_appl_maccs.grep
domain_appl_pub.grep		domain_appl_pub.grep
domain_appl_top.grep		domain_appl_top.grep
ecfp_fps.csv		ecfp_fps.csv
fcfp_fps.csv		fcfp_fps.csv
from		from
full_pub_fp.csv		full_pub_fp.csv
grep_dt.txt		grep_dt.txt
grep_dt_appl		grep_dt_appl
grep_knn.txt		grep_knn.txt
grep_knn_appl		grep_knn_appl
grep_nb		grep_nb
grep_nb.txt		grep_nb.txt
grep_nb_appl		grep_nb_appl
grep_nn.txt		grep_nn.txt
grep_nn_appl		grep_nn_appl
grep_rf.txt		grep_rf.txt
grep_rf_appl		grep_rf_appl
grep_svm.txt		grep_svm.txt
grep_svm_appl		grep_svm_appl
input_chembl_ecfp.csv		input_chembl_ecfp.csv
input_chembl_ecfp_appl.csv		input_chembl_ecfp_appl.csv
input_chembl_fcfp.csv		input_chembl_fcfp.csv
input_chembl_fcfp_appl.csv		input_chembl_fcfp_appl.csv
input_chembl_maccs.csv		input_chembl_maccs.csv
input_chembl_maccs_appl.csv		input_chembl_maccs_appl.csv
input_chembl_pub.csv		input_chembl_pub.csv
input_chembl_pub_appl.csv		input_chembl_pub_appl.csv
input_chembl_top.csv		input_chembl_top.csv
input_chembl_top_appl.csv		input_chembl_top_appl.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NIH_NCBI

About

Releases

Packages

Languages

shreyasingh1/NIH_NCBI

Folders and files

Latest commit

History

Repository files navigation

NIH_NCBI

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages