Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify SuppKG parser to better deal with fake UMLS IDs #220

Open
andrewsu opened this issue Jul 25, 2024 · 2 comments
Open

Modify SuppKG parser to better deal with fake UMLS IDs #220

andrewsu opened this issue Jul 25, 2024 · 2 comments

Comments

@andrewsu
Copy link
Member

We created an API for SuppKG in #55 and biothings/biothings_explorer#706. We previously noted that SuppKG created UMLS-like identifiers (which have the format "DCXXXXXXX" instead of "CXXXXXXX"). At the time, we decided to treat them as if they were UMLS IDs, but now that is resulting in some confusing results (e.g., NCATSTranslator/Feedback#836), so it's time to adjust this behavior.

Vlado helped map these fake UMLS "DC" IDs to more common identifiers, the results of which are in supp_kg_chem_nodes.txt. To summarize those results, there were 56636 IDs for suppkg nodes, 53707 of which start with "C" -- we assume these are valid UMLS. Of the remaining 2928 whose IDs that start with "DC", Vlado mapped 841 of those to CHEBI, CID, UNII, MESH, etc. In our parser script, let's replace the "DC" IDs for these IDs in our API. For the remaining 2087 nodes for which Vlado could not find mappings, let's delete records using those IDs in our API.

An analysis of the namespaces used for the 841 (262 are mapped to multiple identifiers):

$ grep '^D' supp_kg_chem_nodes.tsv  | gawkt '$3>0{print $NF}' | tr '|' '\n' | sed 's/:.*//' | sort | uniq -c | sort -k1nr
    626 CHEBI
    298 CID
    181 UNII
     78 MESH
     38 ChEMBL
     19 PHARMGKB.CHEMICAL
      6 CHEMBL.TARGET
      3 HMDB
      2 CAS
      2 DrugBank
@colleenXu
Copy link

Can we map the 6 CHEMBL.TARGET entities to a different ID namespace? Or remove them? It's an odd identifier for a chemical and NodeNorm doesn't really support that ID namespace (example automated test issue).

I also wonder about adjusting some ID-prefixes to the Translator format:

  • CID -> PUBCHEM.COMPOUND
  • ChEMBL -> CHEMBL.COMPOUND
  • DrugBank -> DRUGBANK

@andrewsu
Copy link
Member Author

great points, thanks @colleenXu. Yes, let's revisit these details when we identify someone to work on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants