LM-ontology-concept-placement

This is the official repository for A Language Model based Framework for New Concept Placement in Ontologies, accepted for ESWC 2024. Our presentation slides with updated results on GPT-4 and Llama-2-13B (at slides page 16-17).

The study provides a Language Model based framework (including pre-trained and large language models) for new concept placement in ontologies, where the input includes a mention in a text corpus with an ontology, and the outputs are the predicted edges in the ontology to place the mention.

The method combines LMs with ontology structure, and includes three steps:

edge search (and concept search),
edge formation and enrichment, and
edge selection.

This repository provides the implementation of the methods above, running scripts, and the dataset scripts for research-based reproducibility.

Minimum running code on Google CoLab: LM-ontology-concept-placement.ipynb.

Requirements

The repository is based on Python 3.8.

See requirements.txt, for running Edge-Bi-encoder, Edge-Cross-encoder, Inverted Index, Fixed Embedding.

See requirements-LLM.txt, for running instruction tuning LLMs.

Note: we noticed some Dependabot alerts from GitHub related to the previous versions of libraries (Transformers, PyTorch, NLTK, Flair, and tqdm, as in requirements.txt and requirements-LLM.txt), but we have limited bandwidth to resolve them for this research-based project. Please be aware of this when you are using the project.

Examples to install packages using conda (optional):

conda create -n onto38 -y python=3.8
conda activate onto38
pip install -r requirements.txt
conda deactivate

conda create -n ontollm38 -y python=3.8
conda activate ontollm38
pip install -r requirements-LLM.txt
conda deactivate

Model Training and Inference

See Edge-Bi-enc+prompt-generation.sh for the steps of running Edge-Bi-encoder, edge enrichment, and prompt generation, with running examples in Edge-Bi-enc+prompt-gen-run-example.sh.

See Edge-Bi-enc+Cross-enc.sh for the steps of running Edge-Bi-encoder, edge enrichment, and Edge-Cross-encoder, with running examples in Edge-Bi-enc+Cross-enc-run-example.sh.

See run_tune_LLAMA_2_from_data_creation.sh a running example for data generation, instruction-tuning, and prompting of LLAMA-2.

See blink/prompting/run_search_snomed_disease-5to10.sh and similar files for the examples of running Inverted Index and fixed embedding based approarches.

See other files in blink/prompting for the prompting of GPT-3.5-turbo, FLAN-T5, and Llama-2.

For all Edge-Bi-enc and Edge-Cross-enc scripts above:

setting train_bi (train Bi-encoder), rep_ents (pre-calculate edge embeddings), eval_biencoder (inference with Bi-encoder and get data for cross encoder), train_cross (train Cross-encoder), inference (whole inference) to true to select to perform (or not perform) each step.
setting eval_set to train,valid,valid-NIL,test-NIL with comma separated for the eval_biencoder step to generate data for each data split.

For Edge-Bi-enc:

setting use_cand_analysis (evaluate Bi-encoder results and generate initial instructions and prompts for LLMs) to true to perform the step.

Datasets

Our work uses the datasets at Zenodo and its JSON keys are described in the dataset folder.

Data and processing sources

Before data creation, the sources below need to be downloaded.

SNOMED CT https://www.nlm.nih.gov/healthit/snomedct/archive.html (and use snomed-owl-toolkit to form .owl files)
UMLS https://www.nlm.nih.gov/research/umls/licensedcontent/umlsarchives04.html (and mainly use MRCONSO for mapping UMLS to SNOMED CT)
MedMentions https://github.com/chanzuckerberg/MedMentions (source of entity linking)

The below tools and libraries are used.

Protege http://protegeproject.github.io/protege/
snomed-owl-toolkit https://github.com/IHTSDO/snomed-owl-toolkit
DeepOnto https://github.com/KRR-Oxford/DeepOnto (based on OWLAPI https://owlapi.sourceforge.net/) for ontology processing and complex concept verbalisation

Data creation scripts

Based on OET repository: the data creation scripts are available in data-construction folder, where run_preprocess_ents_and_data+new.sh provides an overall shell script that calls the other .py files.

Acknowledgement

Our dataset is based on OET and zenodo link.
The baseline implementations are based on BLINKout paper and BLINK repository under the MIT liscence.
Acknowledgement to all data and processing sources listed above.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

LM-ontology-concept-placement

Requirements

Model Training and Inference

Datasets

Data and processing sources

Data creation scripts

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

LM-ontology-concept-placement

Requirements

Model Training and Inference

Datasets

Data and processing sources

Data creation scripts

Acknowledgement