Skip to content

Latest commit

 

History

History
120 lines (94 loc) · 18.9 KB

knowledge_graph_references.md

File metadata and controls

120 lines (94 loc) · 18.9 KB

Knowledge graph references

General overview

Seminars

Talks

  • Natural Language Search with Knowledge Graphs - Trey Grainger, Lucidworks video
  • Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists. Talk from AstaZeneca team (one of the BigPharma companies) on Spark+AI Summit 2019 video

Reviews

  • Shaoxiong Ji, Shirui Pan et al. A Survey on Knowledge Graphs: Representation, Acquisition and Applications (2020) paper
  • Graph Technology Landscape 2020. Great overview of raising industry of graph technologies. blog post

Reading list

  • A Reading List of Academic Articles using the Biological Expression Language (BEL) from Charlie Hoyt. It’s divided into the categories of software/visualization tools, algorithms/analytical frameworks, data integration, natural language processing, curation workflows, and downstream applications. bel-papers
  • Generation and Applications of Knowledge Graphs in Systems and Networks Biology. Doctoral thesis of Dr. Charles Tapley Hoyt that was defended on December 3rd, 2019. pdf

Knowledge graphs related to COVID-19

  • COVID-19 Research Knowledge Graph. Knowledge graph build from CORD-19 dataset by JPL NASA group github
  • Covid-19-Community. This project is a community effort to build a Neo4j Knowledge Graph (KG) that links heterogenous data about COVID-19 to help fight this outbreak! It serves as a sandbox and incubator project and the best ideas will be incorporated into the Covid-19-Net KG. github
  • COVID❋GRAPH. A voluntary initiative of graph enthusiasts and companies with the goal to build a knowledge graph with relevant information about the COVID-19 and the SARS-CoV-2 virus. initiative page
  • CoViz. A tool buld by AI2 for exploring associations between concepts appearing in the COVID-19 Open Research Dataset. Searching for a term displays a network of top related terms mined from the corpus. website
  • Knowledge Graph of COVID-19 Literature. Knowledge graph build by IBM as a part of its Corpus Processing Service. This knowledge graph integrates COVID-19 data from various sources. Search on graph, data and reports
  • BioGrakn Knowledge Graph. Collection of knowledge graphs of biomedical data. Build as demonstation by GraknLabs github blog post BioGrakn COVID github
  • COVID-19 Knowledge Graph: a computable, multi-modal, cause-and-effect knowledge model of COVID-19 pathophysiology. paper github
  • COVID-19 Disease Map. Knowledge repository of molecular mechanisms of COVID-19 as a broad community-driven effort. webpage publication fairdomhub
  • Knowledge Extraction to Assist Scientific Discovery from Corona Virus Literature. Knowledge graph constructed which includes 50,752 Gene nodes, 10,781 Disease nodes, 5,738 Chemical nodes, and 535 Organism nodes. These nodes are connected by 133 relation types including Gene–Chemical–Interaction Relationships, Chemical–Disease Associations, Gene–Disease Associations, Chemical–GO Enrichment Associations and Chemical–Pathway Enrichment Associations. webpage

Annotated data related to Covid-19

  • CORD-19. The Semantic Scholar team at the Allen Institute for AI has partnered with leading research groups to provide CORD-19, a free resource of more than 128,000 scholarly articles about the novel coronavirus for use by the global research community. official page CORD-19 explorer Kaggle discussion forum
  • CoronaWhy data lake. Data hub MongoDB service GoogleCloudPlatform
  • COVID-19 Annotated Data by SciBiteLabs. Annotated Data for the COVID-19 Open Research Dataset Challenge. github
  • PubTator collections on COVID-19. Pubtator provides automated annotations of biomedical entities in scientific publications. NLM/NCBI BioNLP Research Group presents recent results of applying PubTator on the literature about COVID-19 and other coronaviruses. In particular, they feature results on two specific data collections: LitCovid and CORD-19. Pubtator annotations are provided for six entity types (gene/protein, drug/chemical, disease, cell type, species and genomic variants) in two formats (BioC JSON and BioC XML). github site
  • CORD-19-on-FHIR. A Linked Data version of the COVID-19 Open Research Dataset (CORD-19) data. github

Ontologies and knowledge databases

  • Unified Medical Language System. The UMLS integrates and distributes key terminology, classification and coding standards, and associated resources. The UMLS includes 3 knowledge sources: metathesaurus (terms and codes from many vocabularie), semantic network (semantic types and their relationships), SPECIALIST Lexicon and Lexical Tools: (A large syntactic lexicon of biomedical and general English and tools for normalizing strings, generating lexical variants, and creating indexes.) website
  • STRING. Protein-Protein Interaction Networks. website
  • PharmaGKB. A pharmacogenomics knowledge resource that encompasses clinical information including clinical guidelines and drug labels, potentially clinically actionable gene-drug associations and genotype-phenotype relationships. website
  • The Immune Epitope Database. IEDB catalogs experimental data on antibody and T cell epitopes studied in humans, non-human primates, and other animal species in the context of infectious disease, allergy, autoimmunity and transplantation. website
  • Evidence and Conclusion Ontology (ECO). An ontology of evidence types for supporting conclusions in scientific research page on bioportal
  • Library of ontologies provided by Bioportal catalog

Building knowledge graph, information extraction

Paper search, filter and scoring

  • Covid-19 Semantic Browser: Browse Covid-19 & SARS-CoV-2 Scientific Papers with Transformers. an interactive experimental tool leveraging a state-of-the-art language model to search relevant content inside the COVID-19 Open Research Dataset (CORD-19) github
  • KDCOVID. This tool retrieves papers by measuring similarity between queries and sentences in the full text of papers in CORD19 corpus using a similarity metric derived from BioSentVec. web-tool github
  • SciFact. Dataset & baseline model built by AI2 for fact-checking: Given a corpus of scientific articles and a claim about a scientific finding, a fact-checking model must identify abstracts that support or refute the claim. paper github
  • The Semantic Scholar Search Reranker provided by AI2. github

Language models

  • BioBERT. BERT trained on Pubmed data by DMIS-lab team. github paper implementations list on paperwithcode
  • SciBERT. A BERT model for scientific text from AI2. github paper
  • CovidBERT. Model CovidBERT trained by DeepSet on AllenAI's CORD19 Dataset of scientific articles about coronaviruses. Impemented as a part of Transformers library github
  • BlueBERT. A BERT model pre-trained on PubMed abstracts and clinical notes (MIMIC-III). Provided by NLM/NCBI BioNLP Research Group github paper

Open information extraction

  • Open IE. System from the University of Washington (UW) and Indian Institute of Technology,Delhi (IIT Delhi). System is used by JPL NASA group github
  • Stanford Open IE. System from Stanford Unversity, part of Stanford CoreNLP. project page
  • Graphene. System outperforms state-of-the-art Open IE systems in the construction of correct n-ary predicate-argument structures. github paper
  • Another unsupervised approach for open relation extraction task is self-organazing maps: Elena Manishina et al. Unsupervised relation extraction from scientific texts using a self-organizing maps paper

Named Entity Recognition

  • BERN. BioBERT-based multi-type NER tool that also supports normalization of extracted entities. Build by DMIS-lab github paper
  • SciSpacy. A full pipeline and models for scientific/biomedical documents NER models. It includes biomedical NER models website github notebook with NER model
  • Comprehensive Named Entity Recognition (NER) on CORD-19 with Distant or Weak Supervision. blog post

Weak supervision and relation extraction

  • Snorkel. The system for programmatically building and managing training data. It is build by team from Stanford unversity, many companies (Google, facebook etc.) are broadly using it website.
  • Short review of weak supervioson approached to relation extraction task: Alisa Smirnova and Philippe Cudré-Mauroux. 2018. Relation Extraction Using Distant Supervision: A Survey. paper
  • A great example of using weak supervision (snorkel particularly) for biomedical information extraction (including numerical data): Kuleshov, V., Ding, J., Vo, C. et al. A machine-compiled database of genome-wide association studies, 2019 paper github
  • Another example of using weak supervision from BenevolentAI team, well-funded startup is building AI system for drug discovery: Julien Fauqueur et al. Constructing large scale biomedical knowledge bases from scratch with rapid annotation of interpretable patterns paper

Relation extraction

  • SpERT. BERT-based model with SOTA performance. paper github
  • GraphREL. An end-to-end relation extraction model which uses graph convolutional networks (GCNs) to jointly learn named entities and relations. paper github
  • SemRep. It might be a good option if you want something can work work right out of the box. It is the NLM triple extraction tool built on top of MetaMap. It comes with the usual UMLS license shenanigans and is not necessarily the latest and greatest, but works reasonably well IME. webpage
  • OpenNRE. An open-source and extensible toolkit that provides a unified framework to implement relation extraction models (including few-shot and document-level models). demosite github paper

Relation descriptions, schema standarts and graph processing tools

  • MI2CAST. Minimum Information about a Molecular Interaction CAusal STatement This checklist defines both the required core information, as well as a comprehensive set of other contextual details valuable to the end user and relevant for reusing and reproducing causal molecular interaction information. paper github
  • BEL. The Biological Expression Language captures causal, correlative, and associative relationships between biological entities along with the experimental/biological context in which they were observed as well as the provenance of the publication from which the relation was reported. language tutorial
  • PyBEL. Python software package that parses BEL documents, validates their semantics, and facilitates data interchange between common formats and database systems like JSON, CSV, Excel, SQL, CX, and Neo4J. github documentation
  • PyBEL-tools. library of functions for analysis of biological networks. github PyBEL-Notebooks
  • BEL4corona. Code, notebooks, and resources for exploring and analyzing mechanistic knowledge graphs about coronagithub

Entity linking, entity normalisation, disambiguation, grounding

  • PyOBO. Tools for biological identifiers, names, synonyms, xrefs, hierarchies, relations, and properties through the perspective of Open Biomedical Ontology (OBO). github blog post
  • Gilda grounding service. Grounding of biomedical named entities with contextual disambiguation. Developed by INDRA labs which is part of the Harvard Program in Therapeutic Science (HiTS). http://grounding.indra.bio github
  • Adeft. Utility for building models to disambiguate acronyms and other abbreviations of biological terms in the scientific literature. Developed by INDRA labs. github paper

Other scientific document processing

  • SciWING. A modern framework from WING-NUS to facilitate Scientific Document Processing. It is built on PyTorch and includes many pre-trained models for fundamental tasks in Scientific Document Processing: Logical Structure Recovery, Header Normalisation, Citation String Parsing, Citation Intent Classification, keyphrase extraction and others. site github paper

Evaluation

  • BLUE. The Biomedical Language Understanding Evaluation benchmark consists of five different biomedicine text-mining tasks (including NER & RE) with ten corpora. Here, we rely on preexisting datasets because they have been widely used by the BioNLP community as shared tasks. paper github

Libraries

  • INDRA (Integrated Network and Dynamical Reasoning Assembler). An an automated model assembly system, funded by DAPRA, draws on natural language processing systems and structured databases to collect mechanistic and causal assertions, represents them in a standardized form (INDRA Statements), and assembles them into various modeling formalisms including causal graphs and dynamical models. website COVID19 model github

Graph analysis

Graph embeddings

  • Heterogeneous Graph Transformer. Graph neural network architecture from Microsoft and University of California. HGT can deal with large-scale heterogeneous and dynamic graphs paper github
  • OpenKE. An open toolkit for knowledge embedding (OpenKE), which provides a unified framework and various fundamental models to embed knowledge graphs into a continuous low-dimensional space. paper github
  • BioNEV. This work aims to systematically evaluate recent advanced graph embedding techniques on biomedical tasks. Authors compile 5 benchmark datasets for 4 biomedical prediction tasks (see paper for details) and use them to evaluate 11 representative graph embedding methods paper github
  • PyTorch-BigGraph. An embedding system from Facebook that incorporates several modifications to traditional multi-relation embedding systems that allow it to scale to graphs with billions of nodes and trillions of edges. paper github
  • BioKEEN. A package for training and evaluating biological knowledge graph embeddings built on PyKEEN. github (parent package - PyKEEN)

Graph Neural Networks

  • Deep Graph Library (DGL). Python package built for easy implementation of graph neural network model family, on top of existing DL frameworks (e.g. PyTorch, MXNet, Gluon etc.). website github docs