Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with BioThings BindingDB #201

Open
colleenXu opened this issue May 22, 2024 · 7 comments · May be fixed by biothings/BindingDB#2
Open

Problems with BioThings BindingDB #201

colleenXu opened this issue May 22, 2024 · 7 comments · May be fixed by biothings/BindingDB#2
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@colleenXu
Copy link

colleenXu commented May 22, 2024

@newgene @andrewsu @everaldorodrigo @rjawesome

It looks like there's a few problems with the current BioThings Binding DB API, and it would be helpful to fix these and maybe update the data.

  1. Andy Crouse (Translator UI) has found that some relation.bindingdb_link urls now don't work. I wonder if some urls were updated...and maybe using a recent data release would help.
  2. Problems with incorrect, outdated, or problematic object fields. Perhaps using a recent data release would help, PLUS adjusting the parser. I see that Rohan started some work on adjusting the parser...
  3. Not broken, but a nice-to-have-if-possible: adjusting the parser to assign more specific relationships

Note:

@colleenXu colleenXu added bug Something isn't working enhancement New feature or request labels May 22, 2024
@everaldorodrigo
Copy link
Contributor

@colleenXu, the latest data was released to the CI environment.

@everaldorodrigo everaldorodrigo linked a pull request Jun 11, 2024 that will close this issue
@colleenXu
Copy link
Author

I'm looking at the current CI responses now...

I think there's a parsing issue with subject.uniprot.secondary_accession. In this document, it looks like the 1-string-element should have been split for each value "B4DYS6 D3DVV8 P19138 P20426 Q14013 Q5U065". Compare it to the same document in ncats.io.

@colleenXu
Copy link
Author

colleenXu commented Jun 13, 2024

Regarding problem 1 (relation.bindingdb_link urls not reaching the actual webpages)...

This seems to be addressed in CI! It looks like enzyme names were updated, which meant the webpage urls also needed to be updated.

@colleenXu
Copy link
Author

colleenXu commented Jun 13, 2024

Regarding problem 2 (object field values are incorrect/problematic/outdated)...

Some problems were addressed in CI!

  • object.chembl: multiple IDs now seem to be correctly split. I checked all previous examples. Note that I still haven't checked how reliable/accurate these IDs are.
  • object.name: multiple values now seem to be correctly split

One idea is double-check how reliable the chembl IDs are, and if they're good, to switch BTE/x-bte annotation to using it rather than inchikey (current)/pubchem_cid (previous).

However, this would decrease our coverage of this resource to <50% (old breakdown's proportions are still roughly correct).


Some problems still exist. We may have to dig deeper into the data/parser to figure these out...

more investigation into the inchikey examples

Example 1: CI has the object.inchikey YQCLAYRIYWYIKH-UHFFFAOYSA-N. But Translator's NodeNorm doesn't recognize this ID and it maps the object chembl IDs to slightly different inchikeys:

Example 2: CI has the object.inchikey ZUXABONWMNSFBN-UHFFFAOYSA-N for clozapine. But Translator's NodeNorm treats this inchikey as a different entity 3-chloro-6-(4-methyl-1-piperazinyl)-5H-benzo[b][1,4]benzodiazepine. Instead, NodeNorm uses a different inchikey for clozapine: QZUDBNBUXVUHMW-UHFFFAOYSA-N


And a note: problem 3 (optional, more specific relationships) hasn't been addressed yet.

@everaldorodrigo
Copy link
Contributor

I'm looking at the current CI responses now...

I think there's a parsing issue with subject.uniprot.secondary_accession. In this document, it looks like the 1-string-element should have been split for each value "B4DYS6 D3DVV8 P19138 P20426 Q14013 Q5U065". Compare it to the same document in ncats.io.

Hi @colleenXu,

Now, the field subject.uniprot.secondary_accession has the values split for each value.

It's deployed to the CI environment. Let me know if it is as expected.

@colleenXu
Copy link
Author

@everaldorodrigo

subject.uniprot.secondary_accession now looks wrong in a different way.

Sometimes the array's last value is an array (a duplication happening somewhere)? Examples:

@newgene
Copy link
Member

newgene commented Jun 28, 2024

good catch @colleenXu !

Also want to mention that this kind of parsing issue can be identified at its early stage if we run the inspect step after the data upload. It should warn a field if its values have mixed data types. @everaldorodrigo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants