Skip to content

Recovery of active molecules from larger dataset using shannon entropy as selection metric

Notifications You must be signed in to change notification settings

spadavec/shannon-entropy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

shannon-entropy

Sandbox for attempt to create an iterative method that recovers active molecules from a dataset using the shannon entropy contribution of test molecules to the already existing active molecules in the training set. This is losely based off off Wang et. al (2009)[1].

Requirements

rdkit

How to Run

python shannon.py --input inputfile.csv 

where inputfile.csv has the format of:

SMILES_1, potency_value1
SMILES_2, potency_value2
...
SMILES_n, potency_valuen

Review

The script will take 2 random molcules from the input file, and take the most potent molcule and designate it 'active'. It will then look at all the remaining molecules in the file and look for compounds which minimally increase the shannon entropy of the dataset, were it to be included, using the equation:

shannon entropy = sum(-x*log(x) - (1-x)*log(1-x)) 

where x is the frequency of 'on' bits at every index of the 1024-bit fingerprint of the molcule set (Note that the entropy is calculated before and after the addition of a test molecule, and those with lost dEntropy are added in batches of 96). Here, the Log is base 2, and values of x of 0 or 1 result in a dEntropy of 0.

References

[1] J. Chem. Inf. Model., 2009, 49 (7), pp 1687–1691

About

Recovery of active molecules from larger dataset using shannon entropy as selection metric

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages