Ultimate-Awesome-Transformer-Attention

This repo contains a comprehensive paper list of Vision Transformer & Attention, including papers, codes, and related websites.
This list is maintained by Min-Hung Chen. (Actively keep updating)

If you find some ignored papers, feel free to create pull requests, open issues, or email me.
Contributions in any form to make this list more comprehensive are welcome.

If you find this repository useful, please consider citing and ★STARing this list.
Feel free to share this list with others!

[Update: June, 2023] Added all the related papers from ICML 2023!
[Update: June, 2023] Added all the related papers from CVPR 2023!
[Update: February, 2023] Added all the related papers from ICLR 2023!
[Update: December, 2022] Added attention-free papers from Networks Beyond Attention (GitHub) made by Jianwei Yang
[Update: November, 2022] Added all the related papers from NeurIPS 2022!
[Update: October, 2022] Split the 2nd half of the paper list to README_2.md
[Update: October, 2022] Added all the related papers from ECCV 2022!
[Update: September, 2022] Added the Transformer tutorial slides made by Lucas Beyer!
[Update: June, 2022] Added all the related papers from CVPR 2022!

Overview

Citation
Survey
Image Classification / Backbone
Detection
Segmentation
Video (High-level)
Multi-Modality
References

------ (The following papers are move to README_2.md) ------

Other High-level Vision Tasks
Transfer / X-Supervised / X-Shot / Continual Learning
Low-level Vision Tasks
Reinforcement Learning
- Navigation
- Other RL Tasks
Medical
Other Tasks
Attention Mechanisms in Vision/NLP
- Attention for Vision
- NLP
- Both
- Others

Citation

If you find this repository useful, please consider citing this list:

@misc{chen2022transformerpaperlist,
    title = {Ultimate awesome paper list: transformer and attention},
    author = {Chen, Min-Hung},
    journal = {GitHub repository},
    url = {https://github.com/cmhungsteve/Awesome-Transformer-Attention},
    year = {2022},
}

Survey

"Vision + Language Applications: A Survey", CVPRW, 2023 (Ritsumeikan University, Japan). [Paper][GitHub]
"Multimodal Learning With Transformers: A Survey", TPAMI, 2023 (Tsinghua & Oxford). [Paper]
"A Survey of Visual Transformers", TNNLS, 2023 (CAS). [Paper]
"RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model", arXiv, 2023 (University of Sydney). [Paper]
"A survey on efficient vision transformers: algorithms, techniques, and performance benchmarking", arXiv, 2023 (The University of Sydney). [Paper]
"From CNN to Transformer: A Review of Medical Image Segmentation Models", arXiv, 2023 (UESTC). [Paper]
"Foundational Models Defining a New Era in Vision: A Survey and Outlook", arXiv, 2023 (MBZUAI). [Paper][GitHub]
"A Systematic Survey of Prompt Engineering on Vision-Language Foundation Models", arXiv, 2023 (Oxford). [Paper]
"Robust Visual Question Answering: Datasets, Methods, and Future Challenges", arXiv, 2023 (Xi'an Jiaotong University). [Paper]
"A Survey on Open-Vocabulary Detection and Segmentation: Past, Present, and Future", arXiv, 2023 (HKUST). [Paper]
"Transformers in Reinforcement Learning: A Survey", arXiv, 2023 (Mila). [Paper]
"Vision Language Transformers: A Survey", arXiv, 2023 (Boise State University, Idaho). [Paper]
"Towards Open Vocabulary Learning: A Survey", arXiv, 2023 (Peking). [Paper][GitHub]
"Large Multimodal Models: Notes on CVPR 2023 Tutorial", arXiv, 2023 (Microsoft). [Paper]
"A Survey on Multimodal Large Language Models", arXiv, 2023 (USTC). [Paper][GitHub]
"2D Object Detection with Transformers: A Review", arXiv, 2023 (German Research Center for Artificial Intelligence, Germany). [Paper]
"Visual Question Answering: A Survey on Techniques and Common Trends in Recent Literature", arXiv, 2023 (Eldorado’s Institute of Technology, Brazil). [Paper]
"Vision-Language Models in Remote Sensing: Current Progress and Future Trends", arXiv, 2023 (NYU). [Paper]
"Visual Tuning", arXiv, 2023 (The Hong Kong Polytechnic University). [Paper]
"Self-supervised Learning for Pre-Training 3D Point Clouds: A Survey", arXiv, 2023 (Fudan University). [Paper]
"Semantic Segmentation using Vision Transformers: A survey", arXiv, 2023 (University of Peradeniya, Sri Lanka). [Paper]
"A Review of Deep Learning for Video Captioning", arXiv, 2023 (Deakin University, Australia). [Paper]
"Transformer-Based Visual Segmentation: A Survey", arXiv, 2023 (NTU, Singapore). [Paper][GitHub]
"Vision-Language Models for Vision Tasks: A Survey", arXiv, 2023 (?). [Paper][GitHub (in construction)]
"Text-to-image Diffusion Model in Generative AI: A Survey", arXiv, 2023 (KAIST). [Paper]
"Foundation Models for Decision Making: Problems, Methods, and Opportunities", arXiv, 2023 (Berkeley + Google). [Paper]
"Advances in Medical Image Analysis with Vision Transformers: A Comprehensive Review", arXiv, 2023 (RWTH Aachen University, Germany). [Paper][GitHub]
"Efficiency 360: Efficient Vision Transformers", arXiv, 2023 (IBM). [Paper][GitHub]
"Transformer-based Generative Adversarial Networks in Computer Vision: A Comprehensive Survey", arXiv, 2023 (Indian Institute of Information Technology). [Paper]
"Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey", arXiv, 2023 (Pengcheng Laboratory). [Paper][GitHub]
"A Survey on Visual Transformer", TPAMI, 2022 (Huawei). [Paper]
"A Comprehensive Study of Vision Transformers on Dense Prediction Tasks", VISAP, 2022 (NavInfo Europe, Netherlands). [Paper]
"Vision-and-Language Pretrained Models: A Survey", IJCAI, 2022 (The University of Sydney). [Paper]
"Vision Transformers in Medical Imaging: A Review", arXiv, 2022 (Covenant University, Nigeria). [Paper]
"A Comprehensive Survey of Transformers for Computer Vision", arXiv, 2022 (Sejong University). [Paper]
"Vision-Language Pre-training: Basics, Recent Advances, and Future Trends", arXiv, 2022 (Microsoft). [Paper]
"Vision+X: A Survey on Multimodal Learning in the Light of Data", arXiv, 2022 (Illinois Institute of Technology, Chicago). [Paper]
"Vision Transformers for Action Recognition: A Survey", arXiv, 2022 (Charles Sturt University, Australia). [Paper]
"VLP: A Survey on Vision-Language Pre-training", arXiv, 2022 (CAS). [Paper]
"Transformers in Remote Sensing: A Survey", arXiv, 2022 (MBZUAI). [Paper][Github]
"Medical image analysis based on transformer: A Review", arXiv, 2022 (NUS, Singapore). [Paper]
"3D Vision with Transformers: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
"Vision Transformers: State of the Art and Research Challenges", arXiv, 2022 (NYCU). [Paper]
"Transformers in Medical Imaging: A Survey", arXiv, 2022 (MBZUAI). [Paper][GitHub]
"Multimodal Learning with Transformers: A Survey", arXiv, 2022 (Oxford). [Paper]
"Transforming medical imaging with Transformers? A comparative review of key properties, current progresses, and future perspectives", arXiv, 2022 (CAS). [Paper]
"Transformers in 3D Point Clouds: A Survey", arXiv, 2022 (University of Waterloo). [Paper]
"A survey on attention mechanisms for medical applications: are we moving towards better algorithms?", arXiv, 2022 (INESC TEC and University of Porto, Portugal). [Paper]
"Efficient Transformers: A Survey", arXiv, 2022 (Google). [Paper]
"Are we ready for a new paradigm shift? A Survey on Visual Deep MLP", arXiv, 2022 (Tsinghua). [Paper]
"Vision Transformers in Medical Computer Vision - A Contemplative Retrospection", arXiv, 2022 (National University of Sciences and Technology (NUST), Pakistan). [Paper]
"Video Transformers: A Survey", arXiv, 2022 (Universitat de Barcelona, Spain). [Paper]
"Transformers in Medical Image Analysis: A Review", arXiv, 2022 (Nanjing University). [Paper]
"Recent Advances in Vision Transformer: A Survey and Outlook of Recent Work", arXiv, 2022 (?). [Paper]
"Transformers Meet Visual Learning Understanding: A Comprehensive Review", arXiv, 2022 (Xidian University). [Paper]
"Image Captioning In the Transformer Age", arXiv, 2022 (Alibaba). [Paper][GitHub]
"Visual Attention Methods in Deep Learning: An In-Depth Survey", arXiv, 2022 (Fayoum University, Egypt). [Paper]
"Transformers in Vision: A Survey", ACM Computing Surveys, 2021 (MBZUAI). [Paper]
"Survey: Transformer based Video-Language Pre-training", arXiv, 2021 (Renmin University of China). [Paper]
"A Survey of Transformers", arXiv, 2021 (Fudan). [Paper]
"Attention mechanisms and deep learning for machine vision: A survey of the state of the art", arXiv, 2021 (University of Kashmir, India). [Paper]

Files

README.md

Latest commit

History

README.md

File metadata and controls

Ultimate-Awesome-Transformer-Attention

Overview

Citation

Survey

Image Classification / Backbone

Replace Conv w/ Attention

Pure Attention

Conv-stem + Attention

Conv + Attention

Vision Transformer

General Vision Transformer

Efficient Vision Transformer

Conv + Transformer

Training + Transformer

Robustness + Transformer

Model Compression + Transformer

Attention-Free

MLP-Series

Other Attention-Free

Analysis for Transformer

Detection

Object Detection

3D Object Detection

Multi-Modal Detection

HOI Detection

Salient Object Detection

Other Detection Tasks

Segmentation

Semantic Segmentation

Depth Estimation

Object Segmentation

Other Segmentation Tasks

Video (High-level)

Action Recognition

Action Detection/Localization

Action Prediction/Anticipation

Video Object Segmentation

Video Instance Segmentation

Other Video Tasks

Multi-Modality

Visual Captioning

Visual Question Answering

Visual Grounding

Multi-Modal Representation Learning

Multi-Modal Retrieval

Multi-Modal Generation

Prompt Learning/Tuning:

Visual Document Understanding

Other Multi-Modal Tasks

References