This is a legacy version of PIDGIN, version 2 found here: https://github.com/lhm30/PIDGINv2
Author : Lewis Mervin, [email protected]
Supervisor : Dr. A. Bender
Protein Target Prediction Tool trained on SARs from PubChem (Mined 08/04/14) and ChEMBL18
Molecular Descriptors : 2048bit Morgan Binary Fingerprints (Rdkit) - ECFP4
- Targets: 1,080 Human Targets (activities over 10)
- Bioactivities: 194,563,469
- Distinct compounds: 826,354 (Based on binary fingerprints)
- Distinct Actives: 295,973
- Distinct Inactives: 648,930
- Number of classes requiring sphere exclusion: 480 (Ratio of less than 100:1 Inactive:Active)
- Number of classes requiring under-sampling: 600 (Ratio of less than 100:1 Inactive:Active)
Pathway information from NCBI BioSystems
Dependencies : rdkit, sklearn, numpy
ChemAxon Standardizer was used for structure canonicalization and transformation, JChem 6.0.2.
Dependencies:
Requires Python 2.7, Scikit-learn [1], Numpy [2] and Rdkit [3] to be installed on system.
Follow these steps on Linux:
sudo apt-get install python-numpy
sudo apt-get install python-sklearn
sudo apt-get install python-rdkit librdkit1 rdkit-data
==========================================================================================
Instructions:
IMPORTANT:
- The program currently recognises line separated SMILES in .csv format
- Molecules Should be standardized before running models
- ChemAxon Standardizer should be used for structure canonicalization and is free for academic use at (http://www.chemaxon.com)
- Protocol used to standardise these molecules is provided: StandMoleProt.xml
- Do not modify the 'models' name or directory
- Cytotoxicity_library.csv is included for use as example dataset for test data
Two different python scripts can be used when performing target prediction. Both utilise the Naive Bayes models created using Scikit-learn [1].
-
predict_raw.py filename.csv
This script outputs the raw Naive Bayes score probabilities for the compounds in a matrix.Example of how to run the code:
python predict_raw.py input.csv
-
predict_binary.py theshold filename.csv
This script generates binary predictions for the models after application of Class-Specific Activity Cut-off Thresholds.This script requires an argument for the choice of Class-Specific Cut-off Thresholds.
The options available are Precision (p), Recall (r), F1-score (f), Accuracy (a) and the Default 0.5 global threshold (0.5)
Example of how to run the code:
python predict_binary.py r input.csv
where r would apply Thresholds calculated using the Recall metric
-
predict_binary_heat.py threshold filename.csv
This script generates binary predictions for the models using the thresholds, but does not output predictions for targets with no hits (this reduces the number of classes that one attempts to plot on a heat map)Example of how to run the code:
python predict_binary_heat.py r input.csv
-
predict_single.py filename.csv
This script generates outputs for the raw Naive Bayes score probabilities and the binary output for all the thresholds for a single compound (used to give more in depth analysis for single compounds)Example of how to run the code:
python predict_single.py input.csv
-
predict_ranked.py filename.csv
This script generates ranked targets (based on the raw Naive Bayes score probabilities) for compounds. The final column includes the average ranking position for target classes given the input compounds.Example of how to run the code:
python predict_ranked.py input.csv
-
predict_enriched.py threshold filename.csv
This script enriched targets and pathways for a library of compounds, when compared to a background sample from PubChem. The script will ask for login details for MySQL on Calculon, and the number of background samples to compare the library. The script will live access the NCBI BioSystems repository for up-to-date pathways information.Example of how to run the code:
python predict_enriched.py threshold input.csv
-
predict_enriched_two_libraries.py threshold input_active_library.csv input_inactive_library.csv
This script calculates enriched targets and pathways for a library of phenotypically active compounds compared to phenotypically inactive compounds.Example of how to run the code:
python predict_enriched_two_libraries.py threshold filename_1.csv filename_2.csv
-
predict_fingerprints.py threshold filename.csv
This script calculates target and pathway hits and represents them as binary fingerprints in a matrix.Example of how to run the code:
python predict_fingerprints.py threshold input.csv
-
ad_analysis_all.py number_of_neighbours input.csv
This script takes an input of SMILES and calculates the average nearest-neighbour Tanimoto similarity of the n nearest compounds in the active training set, for each target. From here you can average the distance to a model for your input compounds, or perhaps all the models in PIDGIN etc.Example of how to run the code:
python ad_analysis_all.py 1 input.csv
-
ad_analysis.py number_of_neighbours input.csv
Similar to ad_analysis_all.py, but can be used for more in-depth analysis for near-neighbours. This script takes the list of smiles and retrieves the top n nearest active compound SMILES and Tanimoto's, giving a more detailed view of the distribution of compounds and which compounds are active.Example of how to run the code:
python ad_analysis.py 1 input.csv
-
The predictions produced for this activity-only model reflect the probability that a compound is active for a given target class, when considering the probability of activity for the other activity classes.
-
The model and scripts to use this single model are found in the 'singlemodel' folder.
-
predict_singlemodel.py filename.csv
This script generates raw Naive Bayes score probabilities for compounds (this only considers activity information).Example of how to run the code:
python predict_singlemodel.py input.csv
-
predict_singlemodel_ranked_number.py filename.csv
This script outputs the ranking for targets from compound libraries based on the raw Naive Bayes score probabilities. The final column includes the average ranking position for target classes given the input compounds (this only considers activity information).Example of how to run the code:
python predict_singlemodel_ranked_number.py input.csv
==========================================================================================
[1] http://scikit-learn.org/stable/ [2] http://www.numpy.org [3] http://www.rdkit.org