You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Graphs4Good GraphHack. Community Effort to Build a Knowledge Graph to Fight COVID-19 video
Talks
Natural Language Search with Knowledge Graphs - Trey Grainger, Lucidworks video
Building a Knowledge Graph with Spark and NLP: How We Recommend Novel Drugs to our Scientists. Talk from AstaZeneca team (one of the BigPharma companies) on Spark+AI Summit 2019 video
Reviews
Shaoxiong Ji, Shirui Pan et al. A Survey on Knowledge Graphs: Representation, Acquisition and Applications (2020) paper
Graph Technology Landscape 2020. Great overview of raising industry of graph technologies. blog post
Reading list
A Reading List of Academic Articles using the Biological Expression Language (BEL) from Charlie Hoyt. It’s divided into the categories of software/visualization tools, algorithms/analytical frameworks, data integration, natural language processing, curation workflows, and downstream applications. bel-papers
Generation and Applications of Knowledge Graphs in Systems and Networks Biology. Doctoral thesis of Dr. Charles Tapley Hoyt that was defended on December 3rd, 2019. pdf
Knowledge graphs related to COVID-19
COVID-19 Research Knowledge Graph. Knowledge graph build from CORD-19 dataset by JPL NASA group github
Covid-19-Community. This project is a community effort to build a Neo4j Knowledge Graph (KG) that links heterogenous data about COVID-19 to help fight this outbreak! It serves as a sandbox and incubator project and the best ideas will be incorporated into the Covid-19-Net KG. github
COVID❋GRAPH. A voluntary initiative of graph enthusiasts and companies with the goal to build a knowledge graph with relevant information about the COVID-19 and the SARS-CoV-2 virus. initiative page
CoViz. A tool buld by AI2 for exploring associations between concepts appearing in the COVID-19 Open Research Dataset. Searching for a term displays a network of top related terms mined from the corpus. website
Knowledge Graph of COVID-19 Literature. Knowledge graph build by IBM as a part of its Corpus Processing Service. This knowledge graph integrates COVID-19 data from various sources. Search on graph, data and reports
BioGrakn Knowledge Graph. Collection of knowledge graphs of biomedical data. Build as demonstation by GraknLabs githubblog postBioGrakn COVID github
COVID-19 Knowledge Graph: a computable, multi-modal, cause-and-effect knowledge model of COVID-19 pathophysiology. papergithub
COVID-19 Disease Map. Knowledge repository of molecular mechanisms of COVID-19 as a broad community-driven effort. webpagepublicationfairdomhub
Knowledge Extraction to Assist Scientific Discovery from Corona Virus Literature. Knowledge graph constructed which includes 50,752 Gene nodes, 10,781 Disease nodes, 5,738 Chemical nodes, and 535 Organism nodes. These nodes are connected by 133 relation types including Gene–Chemical–Interaction Relationships, Chemical–Disease Associations, Gene–Disease Associations, Chemical–GO Enrichment Associations and Chemical–Pathway Enrichment Associations. webpage
Annotated data related to Covid-19
CORD-19. The Semantic Scholar team at the Allen Institute for AI has partnered with leading research groups to provide CORD-19, a free resource of more than 128,000 scholarly articles about the novel coronavirus for use by the global research community. official pageCORD-19 explorerKaggle discussionforum
COVID-19 Annotated Data by SciBiteLabs. Annotated Data for the COVID-19 Open Research Dataset Challenge. github
PubTator collections on COVID-19. Pubtator provides automated annotations of biomedical entities in scientific publications. NLM/NCBI BioNLP Research Group presents recent results of applying PubTator on the literature about COVID-19 and other coronaviruses. In particular, they feature results on two specific data collections: LitCovid and CORD-19. Pubtator annotations are provided for six entity types (gene/protein, drug/chemical, disease, cell type, species and genomic variants) in two formats (BioC JSON and BioC XML). githubsite
CORD-19-on-FHIR. A Linked Data version of the COVID-19 Open Research Dataset (CORD-19) data. github
Ontologies and knowledge databases
Unified Medical Language System. The UMLS integrates and distributes key terminology, classification and coding standards, and associated resources. The UMLS includes 3 knowledge sources: metathesaurus (terms and codes from many vocabularie), semantic network (semantic types and their relationships), SPECIALIST Lexicon and Lexical Tools: (A large syntactic lexicon of biomedical and general English and tools for normalizing strings, generating lexical variants, and creating indexes.) website
PharmaGKB. A pharmacogenomics knowledge resource that encompasses clinical information including clinical guidelines and drug labels, potentially clinically actionable gene-drug associations and genotype-phenotype relationships. website
The Immune Epitope Database. IEDB catalogs experimental data on antibody and T cell epitopes studied in humans, non-human primates, and other animal species in the context of infectious disease, allergy, autoimmunity and transplantation. website
Evidence and Conclusion Ontology (ECO). An ontology of evidence types for supporting conclusions in scientific research page on bioportal
Library of ontologies provided by Bioportal catalog
Building knowledge graph, information extraction
Paper search, filter and scoring
Covid-19 Semantic Browser: Browse Covid-19 & SARS-CoV-2 Scientific Papers with Transformers. an interactive experimental tool leveraging a state-of-the-art language model to search relevant content inside the COVID-19 Open Research Dataset (CORD-19) github
KDCOVID. This tool retrieves papers by measuring similarity between queries and sentences in the full text of papers in CORD19 corpus using a similarity metric derived from BioSentVec. web-toolgithub
SciFact. Dataset & baseline model built by AI2 for fact-checking: Given a corpus of scientific articles and a claim about a scientific finding, a fact-checking model must identify abstracts that support or refute the claim. papergithub
The Semantic Scholar Search Reranker provided by AI2. github
SciBERT. A BERT model for scientific text from AI2. githubpaper
CovidBERT. Model CovidBERT trained by DeepSet on AllenAI's CORD19 Dataset of scientific articles about coronaviruses. Impemented as a part of Transformers library github
BlueBERT. A BERT model pre-trained on PubMed abstracts and clinical notes (MIMIC-III). Provided by NLM/NCBI BioNLP Research Group githubpaper
Open information extraction
Open IE. System from the University of Washington (UW) and Indian Institute of Technology,Delhi (IIT Delhi). System is used by JPL NASA group github
Stanford Open IE. System from Stanford Unversity, part of Stanford CoreNLP. project page
Graphene. System outperforms state-of-the-art Open IE systems in the construction of correct n-ary predicate-argument structures. githubpaper
Another unsupervised approach for open relation extraction task is self-organazing maps: Elena Manishina et al. Unsupervised relation extraction from scientific texts using a self-organizing maps paper
Named Entity Recognition
BERN. BioBERT-based multi-type NER tool that also supports normalization of extracted entities. Build by DMIS-lab githubpaper
SciSpacy. A full pipeline and models for scientific/biomedical documents NER models. It includes biomedical NER models websitegithubnotebook with NER model
Comprehensive Named Entity Recognition (NER) on CORD-19 with Distant or Weak Supervision. blog post
Weak supervision and relation extraction
Snorkel. The system for programmatically building and managing training data. It is build by team from Stanford unversity, many companies (Google, facebook etc.) are broadly using it website.
Short review of weak supervioson approached to relation extraction task: Alisa Smirnova and Philippe Cudré-Mauroux. 2018. Relation Extraction Using Distant Supervision: A Survey. paper
A great example of using weak supervision (snorkel particularly) for biomedical information extraction (including numerical data): Kuleshov, V., Ding, J., Vo, C. et al. A machine-compiled database of genome-wide association studies, 2019 papergithub
Another example of using weak supervision from BenevolentAI team, well-funded startup is building AI system for drug discovery: Julien Fauqueur et al. Constructing large scale biomedical knowledge bases from scratch with rapid annotation of interpretable patterns paper
Relation extraction
SpERT. BERT-based model with SOTA performance. papergithub
GraphREL. An end-to-end relation extraction model which uses graph convolutional networks (GCNs) to jointly learn named entities and relations. papergithub
SemRep. It might be a good option if you want something can work work right out of the box. It is the NLM triple extraction tool built on top of MetaMap. It comes with the usual UMLS license shenanigans and is not necessarily the latest and greatest, but works reasonably well IME. webpage
OpenNRE. An open-source and extensible toolkit that provides a unified framework to implement relation extraction models (including few-shot and document-level models). demositegithubpaper
Relation descriptions, schema standarts and graph processing tools
MI2CAST. Minimum Information about a Molecular Interaction CAusal STatement This checklist defines both the required core information, as well as a comprehensive set of other contextual details valuable to the end user and relevant for reusing and reproducing causal molecular interaction information. papergithub
BEL. The Biological Expression Language captures causal, correlative, and associative relationships between biological entities along with the experimental/biological context in which they were observed as well as the provenance of the publication from which the relation was reported. language tutorial
PyBEL. Python software package that parses BEL documents, validates their semantics, and facilitates data interchange between common formats and database systems like JSON, CSV, Excel, SQL, CX, and Neo4J. githubdocumentation
PyBEL-tools. library of functions for analysis of biological networks. githubPyBEL-Notebooks
BEL4corona. Code, notebooks, and resources for exploring and analyzing mechanistic knowledge graphs about coronagithub
PyOBO. Tools for biological identifiers, names, synonyms, xrefs, hierarchies, relations, and properties through the perspective of Open Biomedical Ontology (OBO). githubblog post
Gilda grounding service. Grounding of biomedical named entities with contextual disambiguation. Developed by INDRA labs which is part of the Harvard Program in Therapeutic Science (HiTS). http://grounding.indra.biogithub
Adeft. Utility for building models to disambiguate acronyms and other abbreviations of biological terms in the scientific literature. Developed by INDRA labs. githubpaper
Other scientific document processing
SciWING. A modern framework from WING-NUS to facilitate Scientific Document Processing. It is built on PyTorch and includes many pre-trained models for fundamental tasks in Scientific Document Processing: Logical Structure Recovery, Header Normalisation, Citation String Parsing, Citation Intent Classification, keyphrase extraction and others. sitegithubpaper
Evaluation
BLUE. The Biomedical Language Understanding Evaluation benchmark consists of five different biomedicine text-mining tasks (including NER & RE) with ten corpora. Here, we rely on preexisting datasets because they have been widely used by the BioNLP community as shared tasks. papergithub
Libraries
INDRA (Integrated Network and Dynamical Reasoning Assembler). An an automated model assembly system, funded by DAPRA, draws on natural language processing systems and structured databases to collect mechanistic and causal assertions, represents them in a standardized form (INDRA Statements), and assembles them into various modeling formalisms including causal graphs and dynamical models. websiteCOVID19 modelgithub
Heterogeneous Graph Transformer. Graph neural network architecture from Microsoft and University of California. HGT can deal with large-scale heterogeneous and dynamic graphs papergithub
OpenKE. An open toolkit for knowledge embedding (OpenKE), which provides a unified framework and various fundamental models to embed knowledge graphs into a continuous low-dimensional space. papergithub
BioNEV. This work aims to systematically evaluate recent advanced graph embedding techniques on biomedical tasks. Authors compile 5 benchmark datasets for 4 biomedical prediction tasks (see paper for details) and use them to evaluate 11 representative graph embedding methods papergithub
PyTorch-BigGraph. An embedding system from Facebook that incorporates several modifications to traditional multi-relation embedding systems that allow it to scale to graphs with billions of nodes and trillions of edges. papergithub
BioKEEN. A package for training and evaluating biological knowledge graph embeddings built on PyKEEN. github (parent package - PyKEEN)
Graph Neural Networks
Deep Graph Library (DGL). Python package built for easy implementation of graph neural network model family, on top of existing DL frameworks (e.g. PyTorch, MXNet, Gluon etc.). websitegithubdocs