-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
COMPETITION ROUND 2: A Predictive Model for Series 4 #1
Comments
Interesting, our team started reviewing the previous runs, datasets etc. We hope to have some promising models. |
Very interested in participating, and iterating on the last competition. Is there a formal definition for the core of Series 4? Curious to know where we can enumerate and where we can't. |
@spadavec I think it would be best to stick to the triazolopyrazine core with substitutents in the northwest and northeast positions (e.g. MMV897698 as a simple example) considering the better potencies that we typically get with those. |
Was there ever a full formal writeup for the first round? I see at #538 that it was delayed due to data embargoes and such, hopefully those have passed. |
Not clear exactly what we should leave in the submitted models folder. By working openly, does this mean I can just place my data etc. in a repository e.g. |
@BenedictIrwin Hi, I have added some details about what will be required for submission to the original post above, but it is more the latter. In short, all entrants will be provided with the molecular identifiers (e.g. SMILES) for a set of existing Series 4 compounds (where we have not revealed the experimental potencies) and you will be required to predict the potencies for these compounds. Yes, working openly means that at any stage, if someone wants to see the progress you've made, the can easily look at your work on an ELN or on Github. Feel free to place your data/working in a repository (either this one or your own) and update/provide links as you make progress. |
Hi, I try to get my head around the provided activity data:
Re the data in Sheet 'Ion Regulation Data for OSM Competition': Willem |
@wvanhoorn I'll try to answer these as best as I can.
Let me know if you have any further questions |
Hi, I'm in the process of creating a dataset containing two fields "SMILES" and "Active/Inactive" status. If I ignore all records where "Smiles" are missing and "Ion Regulation Activity" are neither 0 or 1, I got 576 "clean" compounds (510 - inactive and 66 - active) Taking into consideration @edwintse comments, may I apply the following rule to records where "Ion Regulation Activity" is missing but "Potency vs Parasite (uMol)" is available? Rule:
Nick |
@mmgalushka This rule could only be applied to OSM Series 4 compounds since we know there is correlation between ion regulation activity and in vitro potency. I don't think this could be accurately applied to the other compounds from the Ion Regulation sheet since there are lots of different structural classes of compounds for which we don't know if there is any correlation. |
Will the activity of the test molecules be measured as their activity in the ion regulation assay, the Pfal EC50 assay, or both? Do we have known tolerances or errors for either assay? |
@spadavec The test compounds will have been measured in the Pfal IC50 assay only. I'm actually not sure about the specifics of the assay tolerances/errors, but I know that the Pfal IC50 assay uses Mefloquine as the standard control with an acceptable pIC50 range of 7.5-7 if that helps. |
I'm not from BioChem background and a little bit lost in different domain-specific terminology. I'm posting a script which I'm using to clean the dataset. I exported "Ion Regulation Data for OSM Competition" file in TSV format and applied the following script: Ion_Regulation_Activity = 2
Smiles = 4
with open('datasets/Ion Regulation Data for OSM Competition.csv', mode='w') as w:
with open('datasets/Ion Regulation Data for OSM Competition.tsv') as r:
for record in r.readlines():
fields = record.split('\t')
activity_value = fields[Ion_Regulation_Activity].strip()
smiles_value = fields[Smiles].strip()
if len(smiles_value) > 0 and activity_value in ['0', '1']:
w.write(smiles_value + ',' + activity_value + '\n') The output is CSV file with to columns "SMILES" and "ACTIVITY". I got 851 compounds in total, where 66 is active. Am I on the right track? Do I need to consider something else? |
I am still confused and it seems that I am not the only one. This runs the risk of becoming a data interpretation/cleaning instead of data modeling competition. Could we therefore settle first on a single file with all relevant data without any irrelevant data (for instance only series 4 compounds if the aim is to only predict series 4 compounds) so that we all depart from the same starting point? And provide a specific description what needs to be modeled. I initially thought the aim was to model 'Potency vs Parasite (uMol)', now it seems it should be 'Ion Regulation Activity' but I am still not sure. |
I 100% agree with @wvanhoorn. It would beneficial for all teams to have a single file with samples only relevant to this competition, which containing input feature(s) and a target feature. |
Hi all, Both spreadsheets are supposed to be complementary. The idea behind the two are as follows: Ion regulation spreadsheet Master Chemical List n.b. The models will be evaluated for their ability to predict the potencies of a test set that consists of Series 4 compounds only. With that in mind, you are free to use as much or as little of the provided data that you think will best achieve this goal. I believe that by providing all the data, all aspects can be considered when developing the models. |
I try to make the following statements regarding "specifically" my model. My model takes only one feature (SMILES) as an input. According to your comments am I right to say that we are trying to predict the potencies, which defined in "Ion Regulation Data for OSM Competition" file under the field "Potency vs Parasite (uMol)"? If this is true, our model should predict "real" values. To summarize above, we need to build a regression model which predicts "Potency vs Parasite (uMol)" by compound "SMILES". Do I make the right conclusion? PS: I understand that some potency values can be sourced from "Master Chemical List" file, but at this stage, I just want to concentrate on "Ion Regulation Data for OSM Competition" file. |
@mmgalushka Yes, that's correct. Totally fine to just concentrate on the one file at this stage. |
Thanks a lot @edwintse! I used the following Python script to extract records: Potency_vs_Parasite = 1
Smiles = 4
with open('datasets/Ion Regulation Data for OSM Competition - Malaria Molecules.csv', mode='w') as w:
with open('datasets/Ion Regulation Data for OSM Competition - Malaria Molecules.tsv') as r:
for record in r.readlines():
fields = record.split('\t')
potency_value = fields[Potency_vs_Parasite].strip()
smiles_value = fields[Smiles].strip()
if len(smiles_value) > 0 and len(potency_value) > 0:
try:
float(potency_value) # make sure this is a real value
w.write(smiles_value + ',' + potency_value + '\n')
except:
continue Got the following file; There are many records which potency exactly 10 and 50, considering that the majority records between 0 and 8.0. Are these values "10s" and "50s" correct? |
Compounds with potency values of 10/50 OR have a potency qualifier of '>' can be treated as inactive. It means that the IC50 values were greater than the max concentration that was tested in the assay. |
@edwintse Thanks for all of the clarification! Just as a follow up, if you consider only S4 compounds that have enough data to contribute to a regression model (e.g. have potency and SMILES strings) there are only ~130 compounds, which is definitely on the low side for an accurate model (typically this number needs to be closer to ~500 for pIC50 values to have an error rate of ~1, which is getting close to on-par with errors in wet measurements of IC50/EC50 values). If we expand the criterion for acceptance to be over/under 1uM (e.g. just a classification job), the accuracy and results should be much better across the board--has that been considered at all for this? |
@spadavec Are these ~130 S4 compounds from the ion regulation spreadsheet that have both potency vs parasite and ion regulation activity? or just potency vs parasite and not ion regulation activity? There should be close to 350 S4 compounds that have potency vs parasite data (all on the Master Chemical List but you have to filter out intermediate structures). Still, this is lower than the desired number of compounds. It sounds reasonable to me to expand the criterion if that will provide better accuracy/results for the model. |
Hi! Not sure if the deadline got postponed. I might be able to submit some results too by the end of tomorrow (30/09), but if there is another deadline extension, that would be great. |
Hi all, based on the requests for extension of the deadline, we will be extending the submission date to 2 weeks from now (i.e. final submissions by end of day 11th Oct 2019). Hopefully this will give everyone enough time to put something together. I also wanted to just see if we will still be expecting submissions from @spadavec, @giribio and @BenedictIrwin since you had some activity on this issue earlier on. Anything would be great! |
@edwintse Sorry, I thought you said we needed to put a binary/executable in the submitted models folder (from my early question), so I stopped work on this as it would have taken too much time for that level of submission. Looking now it's just submitting a .csv of predictions. Perhaps some clearer direction on the onset next time would have helped me stay on track. I might be able to throw something quick in, but it won't be the best. Thanks for the nudge |
Hi everybody, |
Hi everyone, Just a reminder that the final deadline for the competition is end of day this Friday. If I'm not mistaken, based on the extension we should expect to be getting submissions from @gcincilla, @holeung, @spadavec, @BenedictIrwin, @IamDavyG, @jonjoncardoso, @giribio and @jsilter? In case you've not seen already, a big thank you to @wvanhoorn who has generated a cleaned version of the Master Chemical List - annotated which can be used for developing your models. Please submit your predictions of the test set (#4) as a .csv file to this repository. If you are unable to directly upload to the submission folder, you can upload it as a zip file in a comment and tag me so I don't miss it. Let me know if you encounter any problems with submission. |
Hi! I have opened a Pull Request (#10) with my submission. Many thanks to @wvanhoorn for creating the clean version of the training data, it was really helpful! |
@jonjoncardoso Thanks for your submission! I've merged it into the master. |
I have some submissions at: https://github.com/BenedictIrwin/OSM/tree/master/FinalModels I hadn't used the Master chemical list for the small set. I tried to predict each assay individually because they seem to be under different conditions/ranges and merging them might not be the best strategy. I also predicted the Single shot inhibition and the Ion regulation, hopefully it is looking consistent. There is a predicted value in the original units and then a low and high error bar for each prediction. Some of them are quite wide as you might expect with the sparse data. I can provide similar information for the entire Master chemical sheet if the model turns out to be useful, i.e. a prediction (potentially noisy) for every cell. There might be some optimal strategy in how to combine the different readings. |
Hi everybody,
If a continuous value is needed to rank the molecules or to evaluate the submitted model, the probability of compounds to be active (i.e. column named “P (Pfal class=active)”) ranging from 0 to 1, can be used for such a purpose. |
Thanks @gcincilla! I've just merged your submission with the master. |
quick question ; one of the compounds can't be parsed by OSM-LO-1 Is there a valid SMILES string for this? |
@spadavec this is a p-carborane containing compound. These compounds are often hard to interpret but this string is the most accurate way to represent the compound so I'm not sure there's an alternative that can be handled by rdkit. |
Hi. I just made a pull request with my submission and description on methodology. I used my own homology model, docking, generated 1-3D features, and then used XGBoost regressor to make my predictions. |
Wonderful, thanks @holeung! It's now been merged |
@spadavec, yeah, the carboranes broke most of my software. If I remember correctly, I think only the Chemaxon software could handle them. Did anyone find a way to handle them? |
@holeung no, i couldn't find a way--I may have been able to figure it out via openbabel or something like that, but I decided to punt that specific prediction for a number of reasons. |
Thanks for your submission @spadavec! |
Hello, I just made my submission as a pull request @edwintse |
@spadavec, for your information OSM-LO-1 can be correctly parsed by CDK. |
Many thanks to @IamDavyG @luiraym and @sladem-tox for your submissions! They have all been received and merged. |
UPDATE: Round 2 has now concluded. Thanks to all who participated! The results announcement can be found here.
OSM will be launching the second round of the predictive modelling competition on August 1st. This will build upon the first round which was run in 2016 (results here). All relevant background can be found in the previous two links and on the Wiki (tab above). Submissions will be allowed up to the end of the day on September 11th.
This aim of the competition is to develop a computational model that predicts new, potent molecules in OSM Series 4.
The target of these molecules is strongly suspected to be PfATP4, since there has so far been essentially a perfect correlation between activity of molecules in this series vs the parasite and in an assay that measured ion regulation, used as a proxy for activity vs PfATP4. PfATP4 is an important target for the development of new drugs for malaria.
We are providing a dataset of actives and inactives. The challenge is to use the data to develop a model that allows us to (better) design compounds in Series 4 that will be active against that target. This competition is part of Open Source Malaria, meaning that everything need to adhere to the Six Laws.
This round of the competition is funded by the AI3SD+ network. Details of the submitted proposal can be found here (#2). The funding allows us to actually make the molecules that are proposed to be active.
Competition Timeline
The Competition
OSM will provide:
PfATP4-PNAS2014.pdb.txt
Submission Rules:
Open Electronic Notebooks (ELN) such as Labtrove or LabArchives can be useful places to post data and work collaboratively. For example, Ho Leung Ng's ELN can be viewed and commented on here. Please note that LabTrove authors are not alerted when a comment is added to an entry so GitHub is a useful place to tag others.
How will entries be assessed?
There is a relatively high confidence level that PfATP4 is the molecular target for Series 4 (i.e. compounds that are potent in vitro show disruption of ion regulation in the PfATP4 assay). Therefore, for this round of the competition, we will be focussing on the prediction of active Series 4 compounds (rather than the prediction of any active compounds vs PfATP4) since the two should correlate.
What's the prize
Two prizes will be awarded, one for a private sector entry and one for a public sector entry.
...also the opportunity to contribute to our understanding of a new class of antimalarials
...and authorship on a resulting peer-reviewed publication arising from the OSM consortium
*A 'valid' entry is one that stands up to the rigour expected from published in silico models. Judges are entitled to use discretion in the case of unconventional entrants, for example those from people with no formal training such as high school students.
Comments and questions can go below. The above rules/guidance will be periodically updated.
The text was updated successfully, but these errors were encountered: