Go to this website to download the related code to get figure from PubMed https://github.com/yfpeng/PMCFigureX
Go to this website to download the code for subfigure separation https://github.com/hrlblab/ImageSeperation
- Go to https://pubmed.ncbi.nlm.nih.gov/
- Search disease. For example Atelectasis [all_field]. Note: PubMed will automatically find synonyms of atelectasis, e.g., "pulmonary atelectasis" [MeSHTerms] OR ("pulmonary"[All Fields] AND "atelectasis"[All Fields]) OR "pulmonary atelectasis"[All Fields] OR "atelectasis"[All Fields]
- On the left, click "Free full text"
- Click "Save" and choose the "CSV" format: /path/to/Atelectasis.export.csv
$ python figurex_db/convert_pubmed_search_output.py
-s /path/to/Atelectasi.export.csv
-d /path/to/Atelectasi.export.tsv
Change the paths in run_keys_db.sh
disease='Atelectasis' source_dir=$HOME'/path/to/PMCFigureX' venv_dir=$HOME'/path/to/venv' top_dir=$HOME'/path/to/Atelectasi.export.tsv'
##run the bash file: Create database, Get PMC ID from PubMed, Get BioC files, Get figures, and Download local figures
bash run_keys_db.sh step1 step2 step3 step4 step5
##To generate COCO dataset format for image segmentation (pay attention to the path).
python generate_coordinate.py
##To get image size (pay attention to the path)
python generate_coordinate.py
##Subfigure sepearation python detect.py --weights /prj0129/mil4012/glaucoma/Figure_segmentation/runs/train/exp6/weights/best.pt --source /prj0129/mil4012/glaucoma/Figure_segmentation/Pneumonia/images/test --hide-labels --hide-conf --save-txt --save-conf
##Get the json file
python save_jsonnew.py
##Get Get local figures/subfigures, Classify subfigures, and Get text
bash run_keys_db.sh step7 step8 step9
##Get the second classification result for figure
python classifier_second.py
##Produce a CSV file for Radtex, which will be used to verify CXR pathology
python create_csv.py
##CXR pathology verification
Please follow the Process.doc to generate result for CXR pathology verification.
The code is in radtext document.
We will maintain and update the RadText software (https://github.com/bionlplab/radtext).
In this study, we created the PMC-CXR based on the following three criteria: (1) the caption contains a positive mention of the disease (CXR pathology verification), (2) the figure/subfigure is a chest x-ray (CXR) (two classifiers identify the image is a chest x-ray), and (3) the subfigure has a width-to-height or height-to-width ratio greater than 0.5.
python train_add_sample_total.py