Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPETITION ROUND 2: A Predictive Model for Series 4 #1

Open
edwintse opened this issue Jul 22, 2019 · 77 comments
Open

COMPETITION ROUND 2: A Predictive Model for Series 4 #1

edwintse opened this issue Jul 22, 2019 · 77 comments

Comments

@edwintse
Copy link
Collaborator

edwintse commented Jul 22, 2019

UPDATE: Round 2 has now concluded. Thanks to all who participated! The results announcement can be found here.

OSM will be launching the second round of the predictive modelling competition on August 1st. This will build upon the first round which was run in 2016 (results here). All relevant background can be found in the previous two links and on the Wiki (tab above). Submissions will be allowed up to the end of the day on September 11th.

This aim of the competition is to develop a computational model that predicts new, potent molecules in OSM Series 4.

The target of these molecules is strongly suspected to be PfATP4, since there has so far been essentially a perfect correlation between activity of molecules in this series vs the parasite and in an assay that measured ion regulation, used as a proxy for activity vs PfATP4. PfATP4 is an important target for the development of new drugs for malaria.

We are providing a dataset of actives and inactives. The challenge is to use the data to develop a model that allows us to (better) design compounds in Series 4 that will be active against that target. This competition is part of Open Source Malaria, meaning that everything need to adhere to the Six Laws.

This round of the competition is funded by the AI3SD+ network. Details of the submitted proposal can be found here (#2). The funding allows us to actually make the molecules that are proposed to be active.

Competition Timeline

  • Competition launch: The competition will run from 01/08/19 to 11/09/19.
  • Paper write-up: This will happen as the competition is being run and will be submitted to the forthcoming special issue of the Beilstein Journal of Organic Chemistry.
  • Judging and results: A panel (to be announced) will evaluate the models against an undisclosed test set to determine the model(s) best able to predict activity of knowns.
  • Synthesis of top compounds: With the best performing model(s) as judged above, the relevant submitters will be asked to suggest new potent Series 4 compounds. These will be synthesised and biologically evaluated to determine the predictive capabilities of the models.

The Competition
OSM will provide:

  • A dataset containing actives and inactive compounds against PfATP4 along with their in vitro potencies (here). This list has been updated to include the more recent Pathogen Box results from the Kirk lab that was used as the test set in the last competition.
  • The Master Chemical List which contains activity data for all OSM compounds from Series 1-4.
  • Jeremy Horst's Homology Model built from crystal structures of the closest mammalian homolog (SERCA)
    PfATP4-PNAS2014.pdb.txt
  • Details of the relevant mutations known to be associated with resistance.

Submission Rules:

  • Entries may either be submitted to directly to GitHub (uploaded in the Submitted Models folder in the Code tab above) or be uploaded onto an ELN and a link posted in this repository.
  • Entrants can work individually or in teams (no limit to team size).
  • Entrants must work openly during the competition. This doesn't necessarily mean that inputs have to be logged in real time (although that is strongly encouraged), but entries that have not openly deposited working data on a regular basis prior to the deadline will not be accepted.
    Open Electronic Notebooks (ELN) such as Labtrove or LabArchives can be useful places to post data and work collaboratively. For example, Ho Leung Ng's ELN can be viewed and commented on here. Please note that LabTrove authors are not alerted when a comment is added to an entry so GitHub is a useful place to tag others.
  • Entrants must agree to their work's incorporation into a future OSM journal publication(s).
  • Competition winner(s) will be authors on any relevant future paper(s).
  • Any valid* entries will at least be acknowledged on any relevant future paper(s) and if the contribution is significant may lead to authorship.

How will entries be assessed?
There is a relatively high confidence level that PfATP4 is the molecular target for Series 4 (i.e. compounds that are potent in vitro show disruption of ion regulation in the PfATP4 assay). Therefore, for this round of the competition, we will be focussing on the prediction of active Series 4 compounds (rather than the prediction of any active compounds vs PfATP4) since the two should correlate.

  • For the final submission, entrants will predict the potencies of an undisclosed set of Series 4 compounds (to be provided at a later date)
  • A judging panel (to be announced) will evaluate these predictions in comparison with experimental data to determine the winner(s)

What's the prize
Two prizes will be awarded, one for a private sector entry and one for a public sector entry.
...also the opportunity to contribute to our understanding of a new class of antimalarials
...and authorship on a resulting peer-reviewed publication arising from the OSM consortium

*A 'valid' entry is one that stands up to the rigour expected from published in silico models. Judges are entitled to use discretion in the case of unconventional entrants, for example those from people with no formal training such as high school students.

Comments and questions can go below. The above rules/guidance will be periodically updated.

@giribio
Copy link

giribio commented Aug 4, 2019

Interesting, our team started reviewing the previous runs, datasets etc. We hope to have some promising models.

@spadavec
Copy link
Contributor

spadavec commented Aug 4, 2019

Very interested in participating, and iterating on the last competition. Is there a formal definition for the core of Series 4? Curious to know where we can enumerate and where we can't.

@edwintse
Copy link
Collaborator Author

edwintse commented Aug 5, 2019

@spadavec I think it would be best to stick to the triazolopyrazine core with substitutents in the northwest and northeast positions (e.g. MMV897698 as a simple example) considering the better potencies that we typically get with those.

@jsilter
Copy link

jsilter commented Aug 6, 2019

Was there ever a full formal writeup for the first round? I see at #538 that it was delayed due to data embargoes and such, hopefully those have passed.

@edwintse
Copy link
Collaborator Author

edwintse commented Aug 7, 2019

@jsilter There hasn't been yet, but I am in the process of writing it up on the wiki in this repo so check back there soon. At the same time I am also drafting up this info for the paper (I'll create a new issue about this shortly).

@BenedictIrwin
Copy link

BenedictIrwin commented Aug 8, 2019

Not clear exactly what we should leave in the submitted models folder.
Would a prediction for missing values of each compound already in the sheet suffice?
Or does it have to be a binary capable of taking a new compound SMILES and outputting the predicted activity?

By working openly, does this mean I can just place my data etc. in a repository e.g.
https://github.com/BenedictIrwin/OSM
and update that as I make progress?

@edwintse
Copy link
Collaborator Author

edwintse commented Aug 8, 2019

@BenedictIrwin Hi, I have added some details about what will be required for submission to the original post above, but it is more the latter. In short, all entrants will be provided with the molecular identifiers (e.g. SMILES) for a set of existing Series 4 compounds (where we have not revealed the experimental potencies) and you will be required to predict the potencies for these compounds.

Yes, working openly means that at any stage, if someone wants to see the progress you've made, the can easily look at your work on an ELN or on Github. Feel free to place your data/working in a repository (either this one or your own) and update/provide links as you make progress.

@wvanhoorn
Copy link
Contributor

Hi, I try to get my head around the provided activity data:

  1. All data is in Google Sheet 'Ion Regulation Data for OSM Competition' (http://tinyurl.com/OSM-Series4CompData)? If this is the case what is the relevance of the Master Chemical List (https://docs.google.com/spreadsheets/d/1Rvy6OiM291d1GN_cyT6eSw_C3lSuJ1jaR7AJa8hgGsc/edit#gid=510297618)?

Re the data in Sheet 'Ion Regulation Data for OSM Competition':
2. The red/brown highlights indicate missing data and/or structures, i.e. entries that can be ignored?
3. What is the relevance of the column 'Ion Regulation Activity'? If relevant, what to do with missing data?
4. Rows 608-835 and 960-1278 do not contain activity data, should these be ignored, treated as prediction set, other?
5. Is the data in row 836-959 any different from the data in row 2-607? Why is it separate since the first block is sorted by activity?
6. Do you have a cut-off when to classify a compound as 'active', something like Potency vs Parasite (uMol) <= 1 uM?
7. There is no test set provided as yet? If we generate the models that we can't share since they run on a proprietary platform (which will most likely be our case) how is model performance compared between entries?

Willem

@edwintse
Copy link
Collaborator Author

@wvanhoorn I'll try to answer these as best as I can.

  1. The compounds in the "Ion Regulation Data for OSM Competition" sheet have associated PfATP4 data (i.e. do they have ion regulation activity or not). However, this list contains non-OSM compounds as well. The "Master Chemical List" is the complete list of OSM compounds from Series 1, 3 and 4 with in vitro potencies (n.b. any compound from Series 1 is also known to be inactive against PfATP4). Round 1 of the competition was more focussed on the prediction of active compounds against PfATP4 (not limited to Series 4). For this round, we are looking for predictions for the activities of Series 4 compounds specifically so you can use the Master Chemical List to train your models.
  2. Yes, those entries can be ignored.
  3. Ion regulation activity indicates whether or not it is active in the PfATP4 assay (1 means the compound shows ion regulation activity, 0 means it doesn't). In the case of Series 4, we see correlation between PfATP4 activity and in vitro potency, so any OSM compound in the list should be relatively potent. Any OSM compound without a number in this column can be found in the Master List and can be used for training the predictions.
  4. The compounds in these rows are from the MMV Malaria Box and Pathogen Box and haven't been evaluated against the parasite. Considering that these compounds are all structurally different from Series 4 compounds, I'm not sure how helpful they will be for developing a model to predict the activities of Series 4 compounds specifically, so perhaps it's better to ignore them?
  5. No difference. The data in rows 836-959 were just added more recently and haven't been sorted.
  6. Generally, our compounds as classified as active if they are <1 uM, weakly active between 1-2.5 uM, and inactive >2.5 uM.
  7. Yes, the final test set will be provided at a later data. It's understandable that the model itself won't be able to be shared. We are not focused as much on the actual method, but the accuracy of the prediction. Each submission will need to provide the predicted potencies for this test set. By comparing these predictions with the experimental data for the test set, we can determine which models perform the best. The best model(s) will then be asked to generate new active compounds that will then be synthesised and tested.

Let me know if you have any further questions

@mmgalushka
Copy link

Hi,

I'm in the process of creating a dataset containing two fields "SMILES" and "Active/Inactive" status. If I ignore all records where "Smiles" are missing and "Ion Regulation Activity" are neither 0 or 1, I got 576 "clean" compounds (510 - inactive and 66 - active)

Taking into consideration @edwintse comments, may I apply the following rule to records where "Ion Regulation Activity" is missing but "Potency vs Parasite (uMol)" is available?

Rule:

if "Potency vs Parasite (uMol)" < 1:
      "Ion Regulation Activity"  = 1;
else:
     "Ion Regulation Activity"  = 0;

Nick

@edwintse
Copy link
Collaborator Author

@mmgalushka This rule could only be applied to OSM Series 4 compounds since we know there is correlation between ion regulation activity and in vitro potency. I don't think this could be accurately applied to the other compounds from the Ion Regulation sheet since there are lots of different structural classes of compounds for which we don't know if there is any correlation.

@spadavec
Copy link
Contributor

Will the activity of the test molecules be measured as their activity in the ion regulation assay, the Pfal EC50 assay, or both? Do we have known tolerances or errors for either assay?

@edwintse
Copy link
Collaborator Author

@spadavec The test compounds will have been measured in the Pfal IC50 assay only. I'm actually not sure about the specifics of the assay tolerances/errors, but I know that the Pfal IC50 assay uses Mefloquine as the standard control with an acceptable pIC50 range of 7.5-7 if that helps.

@mmgalushka
Copy link

I'm not from BioChem background and a little bit lost in different domain-specific terminology. I'm posting a script which I'm using to clean the dataset.

I exported "Ion Regulation Data for OSM Competition" file in TSV format and applied the following script:

Ion_Regulation_Activity = 2
Smiles = 4

with open('datasets/Ion Regulation Data for OSM Competition.csv', mode='w') as w:
    with open('datasets/Ion Regulation Data for OSM Competition.tsv') as r:

        for record in r.readlines():
            fields = record.split('\t')

            activity_value = fields[Ion_Regulation_Activity].strip()
            smiles_value = fields[Smiles].strip()

            if len(smiles_value) > 0 and activity_value in ['0', '1']:
                    w.write(smiles_value + ',' + activity_value + '\n')

The output is CSV file with to columns "SMILES" and "ACTIVITY". I got 851 compounds in total, where 66 is active.

Am I on the right track? Do I need to consider something else?

@wvanhoorn
Copy link
Contributor

I am still confused and it seems that I am not the only one. This runs the risk of becoming a data interpretation/cleaning instead of data modeling competition. Could we therefore settle first on a single file with all relevant data without any irrelevant data (for instance only series 4 compounds if the aim is to only predict series 4 compounds) so that we all depart from the same starting point? And provide a specific description what needs to be modeled. I initially thought the aim was to model 'Potency vs Parasite (uMol)', now it seems it should be 'Ion Regulation Activity' but I am still not sure.

@mmgalushka
Copy link

I 100% agree with @wvanhoorn. It would beneficial for all teams to have a single file with samples only relevant to this competition, which containing input feature(s) and a target feature.

@edwintse
Copy link
Collaborator Author

edwintse commented Aug 16, 2019

Hi all,
Apologies if there has been any confusion. To clarify, the aim of this competition is to predict the Pfal IC50 potencies of Series 4 compounds that are active against PfATP4. This is slightly different to the aim of Round 1 where the aim was more broadly to predict any active compounds against PfATP4.

Both spreadsheets are supposed to be complementary. The idea behind the two are as follows:

Ion regulation spreadsheet
We highly suspect PfATP4 to be the target for the Series 4 compounds (potent Series 4 compounds show activity in the ion regulation assay; this is indicated by a 1 in the ion regulation activity column) but the structure of the target protein has not been solved. This means that we don't know what key interactions our compounds are making with the target. The ion regulation spreadsheet contains all known compounds (from many different chemotypes) that have been experimentally evaluated against PfATP4. All of this structural information (along with the provided homology model and relevant mutations) can be used to aid in discerning any key interactions that might be taking place, and therefore be used to predict new potent Series 4 compounds that exploit these interactions.

Master Chemical List
This list contains all OSM compounds from Series 1-4 with in vitro potencies with the additional knowledge that Series 1 does not target PfATP4 (i.e. 0 for ion regulation activity). As we are specifically looking for predictions on compounds with a triazolopyrazine core, the changes in structural features between Series 4 compounds and their associated in vitro potencies can be used to develop and refine your models.

n.b. The models will be evaluated for their ability to predict the potencies of a test set that consists of Series 4 compounds only.

With that in mind, you are free to use as much or as little of the provided data that you think will best achieve this goal. I believe that by providing all the data, all aspects can be considered when developing the models.

@mmgalushka
Copy link

mmgalushka commented Aug 16, 2019

I try to make the following statements regarding "specifically" my model.

My model takes only one feature (SMILES) as an input. According to your comments am I right to say that we are trying to predict the potencies, which defined in "Ion Regulation Data for OSM Competition" file under the field "Potency vs Parasite (uMol)"? If this is true, our model should predict "real" values.

To summarize above, we need to build a regression model which predicts "Potency vs Parasite (uMol)" by compound "SMILES". Do I make the right conclusion?

PS: I understand that some potency values can be sourced from "Master Chemical List" file, but at this stage, I just want to concentrate on "Ion Regulation Data for OSM Competition" file.

@edwintse
Copy link
Collaborator Author

@mmgalushka Yes, that's correct. Totally fine to just concentrate on the one file at this stage.

@mmgalushka
Copy link

mmgalushka commented Aug 16, 2019

Thanks a lot @edwintse!

I used the following Python script to extract records:

Potency_vs_Parasite = 1
Smiles = 4

with open('datasets/Ion Regulation Data for OSM Competition - Malaria Molecules.csv', mode='w') as w:
    with open('datasets/Ion Regulation Data for OSM Competition - Malaria Molecules.tsv') as r:

        for record in r.readlines():
            fields = record.split('\t')

            potency_value = fields[Potency_vs_Parasite].strip()
            smiles_value = fields[Smiles].strip()

            if len(smiles_value) > 0 and len(potency_value) > 0:
                try:
                    float(potency_value) # make sure this is a real value
                    w.write(smiles_value + ',' + potency_value + '\n')
                except:
                    continue

Got the following file;

There are many records which potency exactly 10 and 50, considering that the majority records between 0 and 8.0. Are these values "10s" and "50s" correct?

@edwintse
Copy link
Collaborator Author

Compounds with potency values of 10/50 OR have a potency qualifier of '>' can be treated as inactive. It means that the IC50 values were greater than the max concentration that was tested in the assay.

@spadavec
Copy link
Contributor

@edwintse Thanks for all of the clarification! Just as a follow up, if you consider only S4 compounds that have enough data to contribute to a regression model (e.g. have potency and SMILES strings) there are only ~130 compounds, which is definitely on the low side for an accurate model (typically this number needs to be closer to ~500 for pIC50 values to have an error rate of ~1, which is getting close to on-par with errors in wet measurements of IC50/EC50 values). If we expand the criterion for acceptance to be over/under 1uM (e.g. just a classification job), the accuracy and results should be much better across the board--has that been considered at all for this?

@edwintse
Copy link
Collaborator Author

@spadavec Are these ~130 S4 compounds from the ion regulation spreadsheet that have both potency vs parasite and ion regulation activity? or just potency vs parasite and not ion regulation activity? There should be close to 350 S4 compounds that have potency vs parasite data (all on the Master Chemical List but you have to filter out intermediate structures). Still, this is lower than the desired number of compounds.

It sounds reasonable to me to expand the criterion if that will provide better accuracy/results for the model.

@jonjoncardoso
Copy link
Member

Hi! Not sure if the deadline got postponed.

I might be able to submit some results too by the end of tomorrow (30/09), but if there is another deadline extension, that would be great.

@edwintse
Copy link
Collaborator Author

Hi all, based on the requests for extension of the deadline, we will be extending the submission date to 2 weeks from now (i.e. final submissions by end of day 11th Oct 2019). Hopefully this will give everyone enough time to put something together.

I also wanted to just see if we will still be expecting submissions from @spadavec, @giribio and @BenedictIrwin since you had some activity on this issue earlier on. Anything would be great!

@BenedictIrwin
Copy link

BenedictIrwin commented Sep 30, 2019

@edwintse Sorry, I thought you said we needed to put a binary/executable in the submitted models folder (from my early question), so I stopped work on this as it would have taken too much time for that level of submission. Looking now it's just submitting a .csv of predictions. Perhaps some clearer direction on the onset next time would have helped me stay on track.

I might be able to throw something quick in, but it won't be the best. Thanks for the nudge

@mattodd
Copy link
Member

mattodd commented Sep 30, 2019

Hi @spadavec - pinging you quickly here about the 2 week extension that @edwintse mentions above. Hoping you might be OK to submit something with this extra time.

@gcincilla
Copy link
Contributor

Hi everybody,
This week I could finally find the time to work on this. I hope I can reach a decent model before the final submissions deadline on the 11th Oct 2019. I found extremely useful to read all the comments generated so far and this helped me to get on track quickly. I would like to thank especially @wvanhoorn to generate a cleaned version of the data (file Master Chemical List - annotated ). I think this is a great starting point.

@edwintse
Copy link
Collaborator Author

edwintse commented Oct 8, 2019

Hi everyone,

Just a reminder that the final deadline for the competition is end of day this Friday. If I'm not mistaken, based on the extension we should expect to be getting submissions from @gcincilla, @holeung, @spadavec, @BenedictIrwin, @IamDavyG, @jonjoncardoso, @giribio and @jsilter?

In case you've not seen already, a big thank you to @wvanhoorn who has generated a cleaned version of the Master Chemical List - annotated which can be used for developing your models.

Please submit your predictions of the test set (#4) as a .csv file to this repository. If you are unable to directly upload to the submission folder, you can upload it as a zip file in a comment and tag me so I don't miss it. Let me know if you encounter any problems with submission.

@jonjoncardoso
Copy link
Member

Hi! I have opened a Pull Request (#10) with my submission.

Many thanks to @wvanhoorn for creating the clean version of the training data, it was really helpful!

@edwintse
Copy link
Collaborator Author

edwintse commented Oct 8, 2019

@jonjoncardoso Thanks for your submission! I've merged it into the master.

@BenedictIrwin
Copy link

I have some submissions at: https://github.com/BenedictIrwin/OSM/tree/master/FinalModels

I hadn't used the Master chemical list for the small set.
For the Master model I did.

I tried to predict each assay individually because they seem to be under different conditions/ranges and merging them might not be the best strategy. I also predicted the Single shot inhibition and the Ion regulation, hopefully it is looking consistent.

There is a predicted value in the original units and then a low and high error bar for each prediction. Some of them are quite wide as you might expect with the sparse data.

I can provide similar information for the entire Master chemical sheet if the model turns out to be useful, i.e. a prediction (potentially noisy) for every cell.

There might be some optimal strategy in how to combine the different readings.

@gcincilla
Copy link
Contributor

Hi everybody,
I uploaded my contribution as pull request #11
After having tested several different compounds subsets, sampling methods and descriptors combinations I have to say that this seems an especially challenging modeling response. This may be due to the intrinsic complexity of the underlying target and/or to noise present in experimental data. Finally I think we reached a decent model but for sure its quality is lower than most of the models we are used to work with. As we couldn't reach a reliable regression model predicting the series-4 compound Pfal potency, we opted to develop and validate a classification model. Original Pfal potency were splitted into 2 classes:

  • Pfal potency <= 1 uM: active
  • Pfal potency > 1 uM: inactive

If a continuous value is needed to rank the molecules or to evaluate the submitted model, the probability of compounds to be active (i.e. column named “P (Pfal class=active)”) ranging from 0 to 1, can be used for such a purpose.
More detail are given in the description of the pull request.

@edwintse
Copy link
Collaborator Author

Thanks @gcincilla! I've just merged your submission with the master.

@spadavec
Copy link
Contributor

quick question ; one of the compounds can't be parsed by rdkit as a valid string:

OSM-LO-1
FC(F)OC(C=C1)=CC=C1C2=NN=C3C=NC=C(OCCC4567[BH]89%10[BH]%11%124[BH]8%13%14[BH]%11%15%16[CH]%13%17%18[BH]%149%19[BH]%105%20[BH]%21%226[BH]%17%15([BH]%22%12%167)[BH]%18%19%20%21)N32

Is there a valid SMILES string for this?

@edwintse
Copy link
Collaborator Author

@spadavec this is a p-carborane containing compound. These compounds are often hard to interpret but this string is the most accurate way to represent the compound so I'm not sure there's an alternative that can be handled by rdkit.

@holeung
Copy link
Member

holeung commented Oct 10, 2019

Hi. I just made a pull request with my submission and description on methodology. I used my own homology model, docking, generated 1-3D features, and then used XGBoost regressor to make my predictions.

@edwintse
Copy link
Collaborator Author

Wonderful, thanks @holeung! It's now been merged

@holeung
Copy link
Member

holeung commented Oct 10, 2019

@spadavec, yeah, the carboranes broke most of my software. If I remember correctly, I think only the Chemaxon software could handle them. Did anyone find a way to handle them?

@spadavec
Copy link
Contributor

@holeung no, i couldn't find a way--I may have been able to figure it out via openbabel or something like that, but I decided to punt that specific prediction for a number of reasons.

@edwintse
Copy link
Collaborator Author

Thanks for your submission @spadavec!

@IamDavyG
Copy link
Contributor

Hello, I just made my submission as a pull request @edwintse

@luiraym
Copy link
Contributor

luiraym commented Oct 11, 2019

Hi, I have made my submission as well at this link @edwintse

@gcincilla
Copy link
Contributor

@spadavec, for your information OSM-LO-1 can be correctly parsed by CDK.
Nevertheless, as I excluded compounds with atom types other than H,C,O,N,S,F,Cl,Br,I (3 were originally present in my modeling set), I skipped the prediction for OSM-LO-1.

@sladem-tox
Copy link
Contributor

Hi, I have also made a submission to the challenge!
Here is the link. @edwintse

@edwintse
Copy link
Collaborator Author

Many thanks to @IamDavyG @luiraym and @sladem-tox for your submissions! They have all been received and merged.

@wvanhoorn
Copy link
Contributor

Forgot to post the summary of the modeling work, most credit to go to Laksh Aithani (@aced125).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests