This repo will host open knowledge graphs from VaidhyaMegha.

Open Knowledge Graph on Clinical Trials

VaidhyaMegha has built an open knowledge graph on clinical trials.

  • This repository contains the source code along with instructions to generate and use this knowledge graph.
  • More information, including references, is available in article and also here

Knowledge graph for technical decision making

VaidhyaMegha is building an open knowledge graph on technical decision making.

  • This repository contains the source code along with instructions to use this periodically curated knowledge graph.
  • More information, including references, is available in article and also here

Getting Started

  • Pre-requisite steps

    • Create a folder 'lib'. Download algs4.jar file from here and place in 'lib' folder.
    • Download hypergraphql jar file from here and place in 'lib' folder.
    • Dowload 'vocabulary_1.0.0.ttl' file from here and place in 'data/open_knowledge_graph_on_clinical_trials' folder.
    • Download mesh2022.nt.gz from here and unzip it. Place mesh2022.nt file 'data/open_knowledge_graph_on_clinical_trials' folder.
    • Download PheGenI from here and place file 'data/open_knowledge_graph_on_clinical_trials' folder.
    • Download detailed_CoOccurs_2021.txt.gz from here and unzip it. Place detailed_CoOccurs_2021.txt file in 'data/open_knowledge_graph_on_clinical_trials' folder.
      • Generate detailed_CoOccurs_2021_selected_fields.txt and detailed_CoOccurs_2021_selected_fields_sorted.txt files using following commands. Place both detailed_CoOccurs_2021_selected_fields.txt and detailed_CoOccurs_2021_selected_fields_sorted.txt files in 'data/open_knowledge_graph_on_clinical_trials' folder.
      cut -d '|' -f1,9,15 data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021.txt > data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021_selected_fields.txt
      sort -u  data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021_selected_fields.txt > data/open_knowledge_graph_on_clinical_trials/detailed_CoOccurs_2021_selected_fields_sorted.txt
  • To compile and package

    mvn clean package assembly:single -DskipTests
  • To build RDF

    java -jar -Xms4096M -Xmx8192M target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar
  • To query using SparQL

    java -jar -Xms4096M -Xmx8144M target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar -m cli -q src/main/sparql/1_count_of_records.rq
  • To query using GraphQL (via HyperGraphQL)

    java -cp "target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar:lib/*" -m server
    • From Postman with ntriples response ntriples
    • From Postman with json response ntriples
    • In a separate terminal execute GraphQL query using curl (alternatively use Postman)
      $ curl --location --request POST 'http://localhost:8080/graphql' --header 'Accept: application/ntriples' --header 'Accept-Language: en-GB,en-US;q=0.9,en;q=0.8,kn;q=0.7' --header 'Content-Type: application/json' --data-raw '{"query":"{\n  trial_GET(limit: 30, offset: 1) {\n    label\n  }\n \n}","variables":{}}'
      <> <> <> .
      <> <> "EUCTR2007-006072-11-SE"^^<> .
      <> <> <> .
      <> <> "NCT02954757"^^<> .
      <> <> <> .
      <> <> "EUCTR2014-005525-13-FI"^^<> .
      <> <> <> .
      <> <> "NCT02721914"^^<> .
      <> <> <> .
      <> <> <> .
      <> <> <> .

Features as on current release - 0.9

Summary : Using any trial id from across the globe find the associated diseases/interventions, research articles and genes. Also discover relationships b/w various medical topics through co-occurrences in articles. Query the graph using SparQL from cli or GraphQL using any API client tool ex: Postman or curl

Feature list :

  • Using GraphQL API knowledge graph can be queried using any API client tool ex: curl or Postman.
  • Graph includes trials from across the globe. Data is sourced from WHO's ICTRP and
  • Links from trial to MeSH vocabulary are added for conditions and interventions employed in the trial.
  • Links from trial to PubMed articles are added. PubMed's experts curate this metadata information for each article.
  • Added MRCOC to the graph for the selected articles linked to clinical trials.
  • Added PheGenI links i.e. links from phenotype to genotype as links between MeSH DUI and GeneID.
  • Added SparQL query execution feature. Adding CLI mode. Adding a count SparQL query for demo.
  • 5 co-existing bi-partite graphs b/w trial--> condition, trial--> intervention, trial --> articles, article --> MeSH DUIs, gene id --> MeSH DUIs together comprise this knowledge graph.

Changes in this release : Server mode of execution is added.

Release notes

  • v0.9
      java -cp "target/vaidhyamegha-knowledge-graphs-v0.9-jar-with-dependencies.jar:lib/*" -m server
  • v0.8
    • Enable GraphQL interface to the knowledge graph using HyperGraphQL
    java -Dorg.slf4j.simpleLogger.defaultLogLevel=debug -jar lib/hypergraphql-3.0.1-exe.jar --config src/main/resources/hql-config.json
  • v0.7
    • Enable SparQL queries
      $ cat src/main/sparql/1_count_of_records.rq 
      SELECT (count(*) as ?count)
      where { ?s ?p ?o}
      $ sparql --data=data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt --query=src/main/sparql/1_count_of_records.rq
      | count   |
      | 4766048 |
      $ wc -l data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt 
      4766048 data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt
  • v0.6.1
    • Externalize the Entrez API invocation threshold probability
    • Patch for below issue
      $ sparql --data=data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt --query=src/main/sparql/example.rq
      04:33:04 ERROR riot            :: [line: 1085476, col: 71] Bad character in IRI (Tab character): <[tab]...>
      Failed to load data
      $ grep "SLCTR/2020/014" data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt 
      <	> <TrialId> "SLCTR/2020/014\t" .
  • v0.6
    • Added PheGenI links i.e. links from phenotype to genotype as links between MeSH DUI and GeneID.
    <> <Gene> <> .
    <> <GeneID> "10014" .
    <> <Gene> <> .
    <> <GeneID> "6923" .
    <> <Gene> <> .
    <> <GeneID> "3198" .
  • v0.5
    • Adding MRCOC to the graph for the selected articles linked to clinical trials.
    <> <MeSH_DUI> <> .
    <> <MeSH_DUI> <> .
    <> <MeSH_DUI> <> .
  • v0.4
    • List of trial ids to be incrementally bounced against Entrez API to generate the necessary incremental mappings b/w trials and PubMed articles
    $ grep "Pubmed_Article" data/open_knowledge_graph_on_clinical_trials/vaidhyamegha_open_kg_clinical_trials.nt 
    <> <Pubmed_Article> "25153486" .
    <> <Pubmed_Article> "34064657" .
  • v0.3
    • Adding links between trials and interventions in addition to trials and conditions.
    • conditions and interventions are fetched from database (instead of files). Corresponding edges b/w trials and conditions, trials and interventions are added to RDF. For example :
      <> <Condition> <> .
      <> <Intervention> <> .
    • All global trial's - 756,169 - are added to RDF. For example :
    <> <TrialId> "NCT00172328" .
    <> <TrialId> "CTRI/2021/05/033487" .
    • Starting with a fresh model for final RDF. MeSH ids that are not linked to any trial not considered. This reduces the graph size considerably.
    • Trial records are fetched from ICTRP's weekly + periodic full export and AACT's daily + monthly full snapshot.
    • Trials are written down to a file (will be used later) : vaidhyamegha_clinical_trials.csv
      $ wc -l vaidhyamegha_clinical_trials.csv
      755272 vaidhyamegha_clinical_trials.csv
    • Download the RDF from here.
  • v0.2
    • Clinical trials are linked to the RDF nodes corresponding to the MeSH terms for conditions. For example :
    • Download the enhanced RDF from here.


Prequels to this project

VaidhyaMegha's prior work on

  • clinical trial registries data linking.
  • symptoms to diseases linking.
  • phenotype to genotype linking.
  • trials to research articles linking.

Last 3 are covered in the "examples" folder here. They were covered in separate public repos here earlier.

Next steps

  • Complete article
  • Full list of trial ids to be used in combination with id_information table to generate a final list of unique trials using WQUPC algorithm
  • Add secondary trial ids to graph (this may increase graph size considerably). However, it could be of utility.
  • Build SparQL + GraphQL version of API to allow direct querying of the graph. Provide some reasonable examples that are harder in SQL.
  • Snowmed CT, ICD 10.
  • Host Knowledge graph on Ne04j's cloud service, Aura DB.
  • Use Neo4j's GraphQL API from Postman to demonstrate sample queries on clinical trials.