This is the repository for the paper Retrieval-Based Diagnostic Decision Support.
- Download PubMed Abstracts.
cd CliniqIR_model
rsync -Pav ftp.ncbi.nlm.nih.gov::pubmed/baseline/\*.xml.gz Pubmed/
-
Download MIMIC-III datasets.
-
Download DC3 datasets.
- Install QuickUMLS
- jdk-17.0.2
- requirements.txt
Each model requires input data to be in a certain format. Sample files have been provided in the Datasets and CliniqIR_model folder. See/Run the Data_Pre_processing_MIMIC_III.py for MIMIC-III data pre-processing.
The index has four fields: pmid, UMLS concepts of an abstract, abstract title and abstract text with the latter two searchable. The source java files have also been provided to allow for custom use.
- Download PubMed Abstracts. Abstracts should be in the directory "CliniqIR_model/Pubmed".
- Extract UMLS Concepts from PubMed Abstracts.
cd Data_Preprocessing
Python Extract_Pubmed_Concepts.py
- Build the index
cd CliniqIR_model
java -jar Build_Pubmed_Index.jar -cp LuceneJARFiles2
- Filter text queries by running QuickUMLS_FIltering.py in the Data_preprocessing directory.
- Save filtered queries in the directory "CliniqIR_model/Queries.txt"
- Search the index
cd CliniqIR_model
java -jar Search_Pubmed_Index.jar -cp LuceneJARFiles2
- Calculate the PubMed collection frequency of each disease class label by running PubMed_Frequency.py in the Data preprocessing directory.
- Get Clinical BERT's ranks by running Clinical_BERT.py which can be found in the Bert_models directory.
- Obtain CliniqIR's query results by searching the PubMed index.
- Obtain CliniqIR's ranks and get ensemble results by running Evaluate_Mimic-III.py which can be found in the CliniqIR_model directory.
- Run Clinical_BERT.py to use Clinical BERT
- Run Zero_shot_baselines.py to use the zero-shot baselines.
Some of the structure in this repo was adopted from https://github.com/ziy/medline-indexer
Tassallah Amina Abdullahi
-
Luca Soldaini and Nazli Goharian. "QuickUMLS: a fast, unsupervised approach for medical concept extraction." MedIR Workshop, SIGIR 2016.
-
Eickhoff, Carsten, et al. "DC3--A Diagnostic Case Challenge Collection for Clinical Decision Support." Proceedings of the 2019 ACM SIGIR International Conference on Theory of Information Retrieval. 2019.