This repository includes resources on publications of multi-modal learning in medical and general imaging.
- Survey
- Visual Question Answering
- Image Report Generation
- Visual Grounding
- Multimodal with LLM
- Medical Report Generation
- Medical Visual Question Answering
- Medical Vision-Language Model
- Federative Learning Application in Medical Multimodal(FLMM)
- [arXiv 2022] Visual Attention Methods in Deep Learning: An In-Depth Survey [pdf]
- [arXiv 2022] Vision+X: A Survey on Multimodal Learning in the Light of Data [pdf]
- [arXiv 2023] Vision Language Models for Vision Tasks: A Survey [pdf] [code]
- [Artif Intell Med 2023] Medical Visual Question Answering: A Survey [pdf]
- [arXiv Nov 3 2023] Large Language Models Illuminate a Progressive Pathway to Artificial Healthcare Assistant: A Review [pdf]
- [arXiv Oct 14 2023] Multimodal Federated Learning in Healthcare: a review[pdf]
- [Sensors 2023] Multimodal Federated Learning: A Survey[pdf]
- [Sensors 2023] A Comprehensive Overview of Large Language Models [pdf]
- [IEEE Transactions on Pattern Analysis and Machine Intelligence 2023] Counterfactual Samples Synthesizing and Training for Robust Visual Question Answering [link]
- [IEEE/CVF Winter Conference on Applications of Computer Vision 2023] VLC-BERT: Visual Question Answering With Contextualized Commonsense Knowledge [link]
- [Expert Systems with Applications 2023] Image captioning for effective use of language models in knowledge-based visual question answering [link]
- [Bioengineering 2023] Vision–Language Model for Visual Question Answering in Medical Imagery [link]
- [IEEE Transactions on Pattern Analysis and Machine Intelligence 2023] Cross-Modal Causal Relational Reasoning for Event-Level Visual Question Answering [link]
- [CVPR 2023] Generative Bias for Robust Visual Question Answering [link]
- [CVPR 2023] Prompting Large Language Models With Answer Heuristics for Knowledge-Based Visual Question Answering [link]
- [IEEE Transactions on Geoscience and Remote Sensing 2023] A Spatial Hierarchical Reasoning Network for Remote Sensing Visual Question Answering [link]
- [Applied Intelligence 2023] Sparse co-attention visual question answering networks based on thresholds [link]
- [arXiv 2023] VLSP2022-EVJVQA Challenge: Multilingual Visual Question Answering [link]
- [IEEE Transactions on Knowledge and Data Engineering 2023] Event-Oriented Visual Question Answering: The E-VQA Dataset and Benchmark [link]
- [joint Urban Remote Sensing Event (JURSE) 2023] Multilingual Augmentation for Robust Visual Question Answering in Remote Sensing Images [link]
- [European Conference on Information Retrieval. Cham: Springer Nature Switzerland 2023] Multimodal Inverse Cloze Task for Knowledge-Based Visual Question Answering [link]
- [International Journal of Computational Intelligence Systems 2023] Multiscale Feature Extraction and Fusion of Image and Text in VQA [link]
- [IEEE Transactions on Image Processing 2023] Reducing Vision-Answer Biases for Multiple-Choice VQA [link]
- [CVPR 2023] Q: How To Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! [link]
- [** 2023**] [link]
###2023
- [CVPR 2023] Language Adaptive Weight Generation for Multi-Task Visual Grounding [link]
- [CVPR 2023] EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding [link]
- [arXiv 2023] BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs [link]
- [CVPR 2023] Improving Visual Grounding by Encouraging Consistent Gradient-Based Explanations [link]
- [IEEE Transactions on Pattern Analysis and Machine Intelligence 2023] TransVG++: End-to-End Visual Grounding with Language Conditioned Vision Transformer [link]
- [arXiv 2023] Parallel Vertex Diffusion for Unified Visual Grounding [link]
- [arXiv 2023] Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding [link]
- [arXiv 2023] ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding with GPT and Prototype Guidance [link]
- [ACM Transactions on Multimedia Computing, Communications and Applications 2023] Transformer-Based Visual Grounding with Cross-Modality Interaction [link]
- [arXiv 2023] TreePrompt: Learning to Compose Tree Prompts for Explainable Visual Grounding [link]
- [2022 IEEE Spoken Language Technology Workshop (SLT) 2023] YFACC: A Yorùbá Speech–Image Dataset for Cross-Lingual Keyword Localisation Through Visual Grounding [link]
- [ACM 2023]Language-guided Residual Graph Attention Network and Data Augmentation for Visual Grounding [link]
- [arXiv 2023] CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding [link]
- [Language-led Visual Grounding for Human Computer Interaction 2023] Language-led Visual Grounding for Human Computer Interaction [link]
- [AAAI 2023] MNER-QG: An End-to-End MRC Framework for Multimodal Named Entity Recognition with Query Grounding [link]
- [CVPR 2023] NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations [link]
- [2023 IEEE International Conference on Robotics and Automation (ICRA) 2023] Grounding Language with Visual Affordances over Unstructured Data [link]
- [** 2023**] [link]
- [** 2023**] [link]
- [** 2023**] [link]
- [Proceedings of the AAAI Conference on Artificial Intelligence 2022] Deconfounded Visual Grounding [link]
- [ICCV 2021] TransVG: End-to-End Visual Grounding With Transformers [link]
- [ICCV 2019] A Fast and Accurate One-Stage Approach to Visual Grounding [link]
- [CVPR 2018] Visual Grounding via Accumulated Attention [[link]]
- (https://openaccess.thecvf.com/content_cvpr_2018/html/Deng_Visual_Grounding_via_CVPR_2018_paper.ht)
- [arXiv 2023] LLaVA-Med:Training a Large Language-and-Vision Assistant for Biomedicine in OneDay [link]
- [open-source 2023] Visual Med-Alpaca: A Parameter-Efficient Biomedical LLM with Visual Capabilities [link]
- [ACL 2018] On the Automatic Generation of Medical Imaging Reports [pdf] [code]
- [NeurIPS 2018] Hybrid Retrieval-Generation Reinforced Agent for Medical Image Report Generation [[pdf]](https://proceedings.neurips.cc/paper/2018/file/
- [EMNLP 2018] Automated Generation of Accurate & Fluent Medical X-ray Reports [pdf] [code]
- [MICCAI 2019] Automatic Radiology Report Generation based on Multi-view Image Fusion and Medical Concept Enrichment [pdf]
- [AAAI 2019] Knowledge-Driven Encode, Retrieve, Paraphrase for Medical Image Report Generation [pdf]
- [ICDM 2019] Automatic Generation of Medical Imaging Diagnostic Report with Hierarchical Recurrent Neural Network [pdf]
- [AAAI 2020] When Radiology Report Generation Meets Knowledge Graph [pdf]
- [EMNLP 2020] Generating Radiology Reports via Memory-driven Transformer [pdf] [code]
- [ACCV 2020] Hierarchical X-Ray Report Generation via Pathology tags and Multi Head Attention [pdf] [code]
- [NeurIPS 2021 Datasets and Benchmarks Track (Round 2)] FFA-IR: Towards an Explainable and Reliable Medical Report Generation Benchmark [pdf] [code]
- [ACL 2021] Competence-based Multimodal Curriculum Learning for Medical Report Generation [pdf] e07413354875be01a996dc560274708e-Paper.pdf)
- [CVPR 2021] Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation [pdf]
- [MICCAI 2021] AlignTransformer: Hierarchical Alignment of Visual Regions and Disease Tags for Medical Report Generation [pdf]
- [NAACL-HLT 2021] Improving Factual Completeness and Consistency of Image-to-Text Radiology Report Generation [pdf] [code]
- [MICCAI 2021] RATCHET: Medical Transformer for Chest X-ray Diagnosis and Reporting [pdf][code]
- [MICCAI 2021] Trust It or Not: Confidence-Guided Automatic Radiology Report Generation [pdf]
- [MICCAI 2021] Surgical Instruction Generation with Transformers [pdf]
- [MICCAI 2021] Class-Incremental Domain Adaptation with Smoothing and Calibration for Surgical Report Generation [pdf] [code]
- [ACL-IJCNLP 2021] Cross-modal Memory Networks for Radiology Report Generation [pdf] [code]
- [CVPR 2022] Cross-modal Clinical Graph Transformer for Ophthalmic Report Generation [pdf]
- [Nature Machine Intelligence 2022] Generalized Radiograph Representation Learning via Cross-supervision between Images and Free-text Radiology Reports [pdf] [code]
- [MICCAI 2022] A Self-Guided Framework for Radiology Report Generation [pdf]
- [MICCAI 2022] A Medical Semantic-Assisted Transformer for Radiographic Report Generation [pdf]
- [MIDL 2022] Representative Image Feature Extraction via Contrastive Learning Pretraining for Chest X-ray Report Generation [pdf]
- [MICCAI 2022] RepsNet: Combining Vision with Language for Automated Medical Reports [pdf] [code]
- [PMLR 2022] Improving Radiology Report Generation Systems by Removing Hallucinated References to Non-existent Priors [pdf]
- [TNNLS 2022] Hybrid Reinforced Medical Report Generation with M-Linear Attention and Repetition Penalty [pdf]
- [MedIA 2022] CAMANet: Class Activation Map Guided Attention Network for Radiology Report Generation [pdf]
- [MedIA 2022] Knowledge matters: Chest radiology report generation with general and specific knowledge [pdf] [code]
- [MICCAI 2022] Lesion Guided Explainable Few Weak-shot Medical Report Generation [pdf] [code]
- [arXiv 2022] Self adaptive global-local feature enhancement for radiology report generation [pdf]
- [BMVC 2022] On the Importance of Image Encoding in Automated Chest X-Ray Report Generation [pdf] [code]
- [arXiv 2022] RoentGen: Vision-Language Foundation Model for Chest X-ray Generation [pdf]
- [COLING 2022] DeltaNet:Conditional Medical Report Generation for COVID-19 Diagnosis [pdf] [code]
- [ECCV 2022] Cross-modal Prototype Driven Network for Radiology Report Generation [pdf] [code]
- [TMI 2023] Attributed Abnormality Graph Embedding for Clinically Accurate X-Ray Report Generation [pdf]
- [arXiv 2023] Unified Chest X-ray and Radiology Report Generation Model with Multi-view Chest X-rays [pdf] [code]
- [WWW 2023] Auxiliary signal-guided knowledge encoder-decoder for medical report generation [pdf]
- [CVPR 2023] Dynamic Graph Enhanced Contrastive Learning for Chest X-ray Report Generation [pdf] [code]
- [CVPR 2023] KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation [pdf]
- [CVPR 2023] Interactive and Explainable Region-guided Radiology Report Generation [pdf] [code]
- [MIDL 2023] Multimodal Image-Text Matching Improves Retrieval-based Chest X-Ray Report Generation [pdf] [code]
- [arXiv 2023] Visual-Linguistic Causal Intervention for Radiology Report Generation [pdf] [code]
- [MIDL 2023] Vision-Language Modelling For Radiological Imaging and Reports In The Low Data Regime [pdf]
- [ICASSP 2023] MvCo-DoT:Multi-View Contrastive Domain Transfer Network for Medical Report Generation [pdf]
- [CHIL 2023] Token Imbalance Adaptation for Radiology Report Generation [pdf] [code]
- [arXiv 2023] Boosting Radiology Report Generation by Infusing Comparison Prior [pdf]
- [AAAI 2023] "Nothing Abnormal": Disambiguating Medical Reports via Contrastive Knowledge Infusion [pdf] [code]
- [arXiv 2023] Automatic Radiology Report Generation by Learning with Increasingly Hard Negatives [pdf]
- [arXiv 2023] S4M: Generating Radiology Reports by A Single Model for Multiple Body Parts [pdf] [code]
- [arXiv 2023] XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [pdf] [code]
- [ACL W 2023] shs-nlp at RadSum23: Domain-Adaptive Pre-training of Instruction-tuned LLMs for Radiology Report Impression Generation [pdf]
- [arXiv 2023] Customizing General-Purpose Foundation Models for Medical Report Generation [pdf]
- [CVPR 2023] KiUT: Knowledge-injected U-Transformer for Radiology Report Generation [pdf]
- [arXiv 2023] Utilizing Longitudinal Chest X-Rays and Reports to Pre-Fill Radiology Reports [pdf]
- [ACL 2023] Replace and Report: NLP Assisted Radiology Report Generation [pdf]
- [ICCV 2023] PRIOR: Prototype Representation Joint Learning from Medical Images and Reports [pdf] [code]
- [ICML W 2023] Rethinking Medical Report Generation: Disease Revealing Enhancement with Knowledge Graph [pdf] [code]
- [MICCAI 2023] Rad-ReStruct: A Novel VQA Benchmark and Method for Structured Radiology Reporting [pdf] [code]
- [arXiv 2023] IIHT: Medical Report Generation with Image-to-Indicator Hierarchical Transformer [pdf]
- [arXiv 2023] Can Prompt Learning Benefit Radiology Report Generation? [pdf]
- [arXiv 2023] Finding-Aware Anatomical Tokens for Chest X-Ray Automated Reporting [pdf]
- [arXiv 2023] PromptMRG: Diagnosis-Driven Prompts for Medical Report Generation [pdf]
- [arXiv 2023] R2GenGPT: Radiology Report Generation with Frozen LLMs [pdf] [code]
- [TMI 2020] A Question-Centric Model for Visual Question Answering in Medical Imaging [pdf] [code]
- [CLEF 2020 Working Notes] HCP-MIC at VQA-Med 2020: Effective visual representation for medical visual question answering [pdf] [code]
- [CLEF 2020 Working Notes] TeamS at VQA-Med 2021: BBN-Orchestra for long-tailed medical visual question answering [pdf] [code]
- [arXiv 2021] MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering [pdf]
- [Nature Scientific Reports 2021] MedFuseNet: An attention-based multimodal deep learning model for visual question answering in the medical domain [pdf]
- [MICCAI 2022] Consistency-preserving Visual Question Answering in Medical Imaging [pdf] [code]
- [MICCAI 2022] Surgical-VQA: Visual Question Answering in Surgical Scenes using Transformer [pdf] [code]
- [ECCV 2022] Distilled Dual-Encoder Model for Vision-Language Understanding [pdf] [code]
- [arXiv 2022] A Dual-Attention Learning Network with Word and Sentence Embedding for Medical Visual Question Answering [pdf] [code]
- [arXiv 2022] MF2-MVQA: A Multi-stage Feature Fusion method for Medical Visual Question Answering [pdf]
- [arXiv 2022] UnICLAM:Contrastive Representation Learning with Adversarial Masking for Unified and Interpretable Medical Vision Question Answering [pdf]
- [ISBI 2023] Self-supervised vision-language pretraining for Medical visual question answering [pdf] [code]
- [Findings of the Association for Computational Linguistics: EACL 2023] PubMedCLIP: How Much Does CLIP Benefit Visual Question Answering in the Medical Domain? [link]
- [Information Processing & Management 2023] Medical knowledge-based network for Patient-oriented Visual Question Answering [link]
- [Expert Systems with Applications 2023] Question-guided feature pyramid network for medical visual question answering [link]
- [arXiv 2023] Interpretable Medical Image Visual Question Answering via Multi-Modal Relationship Graph Learning [pdf]
- [arXiv 2023] Medical visual question answering using joint self-supervised learning [pdf]
- [ACM MM 2023] RAMM: Retrieval-augmented Biomedical Visual Question Answering with Multi-modal Pre-training [pdf] [code]
- [IPMI 2023] Q2ATransformer: Improving Medical VQA via an Answer Querying Decoder [pdf]
- [arXiv 2023] Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models [pdf]
- [arXiv 2023] PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering [pdf]
- [MICCAI 2023] Masked Vision and Language Pre-training with Unimodal and Multimodal Contrastive Losses for Medical Visual Question Answering [pdf] [code]
- [MICCAI 2023] Localized Questions in Medical Visual Question Answering [pdf] [code]
- [arXiv 2023] Multimodal Prompt Retrieval for Generative Visual Question Answering [pdf] [code]
- [KDD 2023] Expert Knowledge-Aware Image Difference Graph Representation Learning for Difference-Aware Medical Visual Question Answering [pdf] [code]
- [MICCAI 2023] Revisiting Distillation for Continual Learning on Visual Question Localized-Answering in Robotic Surgery [pdf] [code]
- [MICCAI 2023] CAT-ViL: Co-Attention Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery [pdf] [code]
- [CLEF 2023] UIT-Saviors at MEDVQA-GI 2023: Improving Multimodal Learning with Image Enhancement for Gastrointestinal Visual Question Answering [pdf]
- [EMNLP 2022] Medclip: Contrastive learning from unpaired medical images and text [pdf] [code]
- [NeurIPS W 2022] Adapting Pretrained Vision-Language Foundational Models to Medical Imaging Domains [pdf]
- [ACL 2022] ViLMedic: a framework for research at the intersection of vision and language in medical AI [pdf] [code]
- [MICCAI 2022] Multi-modal Masked Autoencoders for Medical Vision-and-Language Pre-training [pdf] [code]
- [JBHI 2022] Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training [pdf] [code]
- [AAAI 2022] Clinical-BERT: Vision-Language Pre-training for Radiograph Diagnosis and Reports Generation [pdf]
- [JBHI 2022] Vision-language transformer for interpretable pathology visual question answering [link]
- [arXiv 2022] RoentGen: Vision-Language Foundation Model for Chest X-ray Generation [pdf]
- [ECCV 2022] Making the most of text semantics to improve biomedical vision–language processing [pdf]
- [MICCAI 2022] Berthop: An effective vision-and-language model for chest x-ray disease diagnosis [pdf]
- [MICCAI 2022] RepsNet: Combining Vision with Language for Automated Medical Reports [pdf] [code]
- [NeurIPS 2022] Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning [pdf] [code]
- [ICLR 2023] Medical Image Understanding with Pretrained Vision Language Models: A Comprehensive Study [pdf] [code]
- [ICCV 2023] CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection [pdf] [code]
- [arXiv 2023] Towards General Purpose Medical AI: Continual Learning Medical Foundation Model [pdf]
- [TMI 2023] LViT: Language meets Vision Transformer in Medical Image Segmentation [pdf] [code]
- [arXiv 2023] Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts [pdf] [code]
- [arXiv 2023] Large-Scale Domain-Specific Pretraining for Biomedical Vision-Language Processing [pdf] [code]
- [ICLR 2023] Advancing Radiograph Representation Learning with Masked Record Modeling [pdf] [code]
- [arXiv 2023] ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax [pdf]
- [MICCAI 2023] PMC-CLIP: Contrastive Language-Image Pre-training using Biomedical Documents [pdf]
- [arXiv 2023] Open-Ended Medical Visual Question Answering Through Prefix Tuning of Language Models [pdf]
- [arXiv 2023] ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models [pdf][code]
- [ICCV 2023] MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training [pdf] [project]
- [CVPR 2023] Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing [pdf]
- [CVPR W 2023] One-shot and Partially-Supervised Cell Image Segmentation Using Small Visual Prompt [pdf]
- [arXiv 2023] CLIP-Lung: Textual Knowledge-Guided Lung Nodule Malignancy Prediction [pdf]
- [MICCAI 2023] UniSeg: A Prompt-driven Universal Segmentation Model as well as A Strong Representation Learner [pdf] [code]
- [ICCV 2023] UniverSeg: Universal Medical Image Segmentation [pdf] [project website]
- [arXiv 2023] Bi-VLGM : Bi-Level Class-Severity-Aware Vision-Language Graph Matching for Text Guided Medical Image Segmentation [pdf]
- [arXiv 2023] Prompt-based Tuning of Transformer Models for Multi-Center Medical Image Segmentation [pdf]
- [arXiv 2023] FoPro-KD: Fourier Prompted Effective Knowledge Distillation for Long-Tailed Medical Image Recognition [pdf]
- [arXiv 2023] ChatCAD+: Towards a Universal and Reliable Interactive CAD using LLMs [pdf][code]
- [arXiv 2023] XrayGPT: Chest Radiographs Summarization using Medical Vision-Language Models [pdf] [code]
- [CHIL 2023] Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark [pdf] [code]
- [arXiv 2023] Med-UniC: Unifying Cross-Lingual Medical Vision-Language Pre-Training by Diminishing Bias [pdf]
- [arXiv 2023] OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue [pdf] [code]
- [ICML W 2023] A ChatGPT Aided Explainable Framework for Zero-Shot Medical Image Diagnosis [pdf]
- [MICCAI 2023] M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization [pdf] [code]
- [MICCAI 2023] Knowledge Boosting: Rethinking Medical Contrastive Vision-Language Pre-Training [pdf] [code]
- [MICCAI 2023] Unified Medical Image-Text-Label Contrastive Learning With Continuous Prompt [pdf]
- [arXiv 2023] Few-shot medical image classification with simple shape and texture text descriptors using vision-language models [pdf] [code]
- [arXiv 2023] Med-Flamingo: a Multimodal Medical Few-shot Learner [pdf] [code]
- [MICCAI 2023] Ariadne's Thread: Using Text Prompts to Improve Segmentation of Infected Areas from Chest X-ray images [pdf] [code]
- [arXiv 2023] A Foundation LAnguage-Image model of the Retina (FLAIR): Encoding expert knowledge in text supervision [pdf] [code]
- [arXiv 2023] Exploring Transfer Learning in Medical Image Segmentation using Vision-Language Models [pdf]
- [arXiv 2023] Few-shot medical image classification with simple shape and texture text descriptors using vision-language models [pdf]
- [ACM MM 2023] FedVQA: Personalized Federated Visual Question Answering over Heterogeneous Scenes [pdf]
- [ACM UbiComp/ISWC 2023] Inclusive Data Representation in Federated Learning: A Novel Approach Integrating Textual and Visual Prompt [pdf]
- [ECCV 2022] FedVLN: Privacy-Preserving Federated Vision-and-Language Navigation [pdf]
- [AAAI 2020] Federated Learning for Vision-and-Language Grounding Problems [pdf]
- [arXiv 2023]Prot2Text: Multimodal Protein’s Function Generation with GNNs and Transformers [pdf]