Skip to content

Latest commit

 

History

History
93 lines (63 loc) · 5.03 KB

README.md

File metadata and controls

93 lines (63 loc) · 5.03 KB

Pheno-LLM

In the rapidly evolving field of biomedical text mining, extracting phenotypic entities from unstructured texts remains a pivotal challenge. This project introduces a novel method leveraging Large Language Models (LLMs) to tackle this challenge, particularly focusing on extracting phenotypical entities from freely available texts like Wikipedia. Unlike traditional Named Entity Recognition (NER) techniques, our approach utilizes both local and cloud-based LLMs to achieve superior performance.

Approach

Methodology: Our methodology involves a comprehensive comparison with state-of-the-art tools to demonstrate the efficacy of our approach. By leveraging LLMs, we surpass the limitations of conventional methods and enhance the identification of relevant phenotypic entities.

Key Findings: Our study confirms the significant advantages of LLMs in identifying relevant phenotypic entities, thereby improving researchers' and clinicians' ability to understand and respond to disease dynamics more effectively.

Implications: This work underscores the potential of next-generation LLMs to redefine the standards for the extraction of phenotypic entities in biomedical research, paving the way for advancements in disease understanding and treatment.

Keywords

  • Natural Language Processing (NLP)
  • Information Extraction
  • Large Language Models (LLMs)
  • Disease Understanding

The data used for this project can be found here, the evaluation methodology is explained here and software requiriments in here.

Data

The benchmark for evaluate the models can be found in DOI

Evaluation

LLMs

Model Name Size Purpose Description
meta-llama/Llama-2-7b-chat-hf 7B Chat General-purpose conversational model, optimized for engaging and coherent dialogues.
BioMistral/BioMistral-7B 7B Bio/Healthcare Specialized in biomedical and healthcare-related text processing and knowledge extraction.
mistralai/Mistral-7B-Instruct-v0.2 7B Instructive tasks Designed for following specific instructions and generating informative content across various domains.
OpenAI/GPT-3.5-Turbo Various Multipurpose A more efficient and faster version of GPT-3, designed for real-time applications with improved performance and lower latency.
OpenAI/GPT-4 Various Multipurpose The next iteration of generative pre-trained transformers, featuring significant improvements in understanding and generating human-like text, capable of more nuanced comprehension and output.

An example of a prompt can be found in :

Approach

Baselines

The baselines script can be foun in link.

Tool Model Name
Falcon 2.0 BioFalcon
MetaMap MetaMap
scispaCy en_core_sci_sm
scispaCy en_core_sci_md
scispaCy en_core_sci_lg
scispaCy en_core_sci_scibert
GliNER gliner_base
GliNER gliner_medium
GliNER gliner_large

Requirements

There are two environments to run the code: one for the baselines (requirements) and another for the LLMs (requirements).

To install it execute:


pip install -r requirements.txt

Acknowledgments

  • If you find it useful you can cite it in:
@inproceedings{10600830,
  author={García-Barragán, Álvaro and Calatayud, Alberto González and Prieto-Santamaría, Lucía and Robles, Víctor and Menasalvas, Ernestina and Rodríguez, Alejandro},
  booktitle={2024 IEEE 37th International Symposium on Computer-Based Medical Systems (CBMS)}, 
  title={Step-forward structuring disease phenotypic entities with LLMs for disease understanding}, 
  year={2024},
  volume={},
  number={},
  pages={213-218},
  keywords={Text mining;Large language models;Computational modeling;Named entity recognition;Encyclopedias;Internet;Online services;Natural Language Processing (NLP);Information Extraction;Large Language Models (LLMs);Disease Understanding},
  doi={10.1109/CBMS61543.2024.00043}}