Skip to content

Latest commit

 

History

History
210 lines (180 loc) · 24.2 KB

README.md

File metadata and controls

210 lines (180 loc) · 24.2 KB

ModernAI: Awesome Modern Artificial Intelligence

🔥Hot update in progress ...

Large Model Evolutionary Graph

LLM
MLLM (LLaMA-based)

Survey

  1. Agent AI: Surveying the Horizons of Multimodal Interaction [arXiv 2401] [paper]
  2. MM-LLMs: Recent Advances in MultiModal Large Language Models [arXiv 2401] [paper]

Large Language Model (LLM)

  1. OLMo: Accelerating the Science of Language Models [arXiv 2402] [paper] [code]

Chinese Large Language Model (CLLM)

  1. https://github.com/LinkSoul-AI/Chinese-Llama-2-7b
  2. https://github.com/ymcui/Chinese-LLaMA-Alpaca-2
  3. https://github.com/LlamaFamily/Llama2-Chinese

Large Vision Backbone

  1. AIM: Scalable Pre-training of Large Autoregressive Image Models [arXiv 2401] [paper] [code]

Large Vision Model (LVM)

  1. Sequential Modeling Enables Scalable Learning for Large Vision Models [arXiv 2312] [paper] [code] (💥Visual GPT Time?)

Large Vision-Language Model (VLM)

  1. UMG-CLIP: A Unified Multi-Granularity Vision Generalist for Open-World Understanding [arXiv 2401] [paper] [code]

Vision Foundation Model (VFM)

  1. SAM: Segment Anything Model [ICCV 2023 Best Paper Honorable Mention] [paper] [code]
  2. SSA: Semantic segment anything [github 2023] [paper] [code]
  3. SEEM: Segment Everything Everywhere All at Once [arXiv 2304] [paper] [code]
  4. RAM: Recognize Anything - A Strong Image Tagging Model [arXiv 2306] [paper] [code]
  5. Semantic-SAM: Segment and Recognize Anything at Any Granularity [arXiv 2307] [paper] [code]
  6. UNINEXT: Universal Instance Perception as Object Discovery and Retrieval [CVPR 2023] [paper] [code]
  7. APE: Aligning and Prompting Everything All at Once for Universal Visual Perception [arXiv 2312] [paper] [code]
  8. GLEE: General Object Foundation Model for Images and Videos at Scale [arXiv 2312] [paper] [code]
  9. OMG-Seg : Is One Model Good Enough For All Segmentation? [arXiv 2401] [paper] [[code]]](https://github.com/lxtGH/OMG-Seg)
  10. Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data [arXiv 2401] [paper] [[code]]](https://github.com/LiheYoung/Depth-Anything)
  11. ClipSAM: CLIP and SAM Collaboration for Zero-Shot Anomaly Segmentation [arXiv 2401] [paper] [[code]]](https://github.com/Lszcoding/ClipSAM)
  12. PA-SAM: Prompt Adapter SAM for High-Quality Image Segmentation [arXiv 2401] [paper] [[code]]](https://github.com/xzz2/pa-sam)
  13. YOLO-World: Real-Time Open-Vocabulary Object Detection [arXiv 2401] [paper] [[code]]](https://github.com/AILab-CVC/YOLO-World)

Multimodal Large Language Model (MLLM) / Large Multimodal Model (LMM)

Model Vision Projector LLM OKVQA GQA VSR IconVQA VizWiz HM VQAv2 SQAI VQAT POPE MMEP MMEC MMB MMBCN SEEDI LLaVAW MM-Vet QBench
MiniGPT-v2 EVA-Clip-g Linear LLaMA-2-7B 56.92 60.3 60.62 47.72 32.9 58.22
MiniGPT-v2-Chat EVA-Clip-g Linear LLaMA-2-7B 57.81 60.1 62.91 51.51 53.6 58.81
Qwen-VL-Chat Qwen-7B 57.5 38.9 78.2 68.2 61.5 1487.5 360.72 60.6 56.7 58.2
LLaVA-1.5 Vicuna-1.5-7B 62.0 50.0 78.5 66.8 58.2 85.91 1510.7 316.1+ 64.3 58.3 58.6 63.4 30.5 58.7
LLaVA-1.5 +ShareGPT4V Vicuna-1.5-7B 57.2 80.62 68.4 1567.42 376.41 68.8 62.2 69.71 72.6 37.6 63.41
LLaVA-1.5 Vicuna-1.5-13B 63.31 53.6 80.0 71.6 61.3 85.91 1531.3 295.4+ 67.7 63.6 61.6 70.7 35.4 62.12
VILA-7B LLaMA-2-7B 62.3 57.8 79.9 68.2 64.4 85.52 1533.0 68.9 61.7 61.1 69.7 34.9
VILA-13B LLaMA-2-13B 63.31 60.62 80.81 73.71 66.61 84.2 1570.11 70.32 64.32 62.82 73.02 38.82
VILA-13B +ShareGPT4V LLaMA-2-13B 63.22 62.41 80.62 73.12 65.32 84.8 1556.5 70.81 65.41 61.4 78.41 45.71
SPHINX
SPHINX-Plus
SPHINX-Plus-2K
SPHINX-MoE
InternVL
LLaVA-1.6

+ indicates ShareGPT4V's (Chen et al., 2023e) re-implemented test results.
∗ indicates that the training images of the datasets are observed during training.

Paradigm Comparison
  1. LAVIS: A Library for Language-Vision Intelligence [ACL 2023] [paper] [code]
  2. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models [ICML 2023] [paper] [code]
  3. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning [arXiv 2305] [paper] [code]
  4. MiniGPT-4: Enhancing Vision-language Understanding with Advanced Large Language Models [arXiv 2304] [paper] [code]
  5. MiniGPT-v2: Large Language Model as a Unified Interface for Vision-Language Multi-task Learning [github 2310] [paper] [code]
  6. VisualGLM-6B: Chinese and English multimodal conversational language model [ACL 2022] [paper] [code]
  7. Kosmos-2: Grounding Multimodal Large Language Models to the World [arXiv 2306] [paper] [code]
  8. NExT-GPT: Any-to-Any Multimodal LLM [arXiv 2309] [paper] [code]
  9. LLaVA/-1.5: Large Language and Vision Assistant [NeurIPS 2023] [paper] [arXiv 2310] [paper] [code]
  10. 🦉mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality [arXiv 2304] [paper] [code]
  11. 🦉mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration [arXiv 2311] [paper] [code]
  12. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks [arXiv 2305] [paper] [code]
  13. 🦅Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic [arXiv 2306] [paper] [code]
  14. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond [arXiv 2308] [paper] [code]
  15. LaVIT: Unified Language-Vision Pretraining in LLM with Dynamic Discrete Visual Tokenization [arXiv 2309] [paper] [code]
  16. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model [arXiv 2309] [paper] [code]
  17. InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition [arXiv 2309] [paper] [code]
  18. MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens [arXiv 2310] [paper] [code]
  19. CogVLM: Visual Expert for Large Language Models [github 2310] [paper] [code]
  20. 🐦Woodpecker: Hallucination Correction for Multimodal Large Language Models [arXiv 2310] [paper] [code]
  21. SoM: Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V [arXiv 2310] [paper] [code]
  22. Ferret: Refer and Ground Anything Any-Where at Any Granularity [arXiv 2310] [paper] [code]
  23. 🦦OtterHD: A High-Resolution Multi-modality Model [arXiv 2311] [paper] [code]
  24. NExT-Chat: An LMM for Chat, Detection and Segmentation [arXiv 2311] [paper] [project]
  25. Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models [arXiv 2311] [paper] [code]
  26. InfMLLM: A Unified Framework for Visual-Language Tasks [arXiv 2311] [paper] [code]
  27. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [paper] [code] [dataset]
  28. 🦁LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge [arXiv 2311] [paper] [code]
  29. 🐵Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models [arXiv 2311] [paper] [code]
  30. CG-VLM: Contrastive Vision-Language Alignment Makes Efficient Instruction Learner [arXiv 2311] [paper] [code]
  31. 🐲PixelLM: Pixel Reasoning with Large Multimodal Model [arXiv 2312] [paper] [code]
  32. 🐝Honeybee: Locality-enhanced Projector for Multimodal LLM [arXiv 2312] [paper] [code]
  33. VILA: On Pre-training for Visual Language Models [arXiv 2312] [paper] [code]
  34. CogAgent: A Visual Language Model for GUI Agents [arXiv 2312] [paper] [code] (support 1120×1120 resolution)
  35. PixelLLM: Pixel Aligned Language Models [arXiv 2312] [paper] [code]
  36. 🦅Osprey: Pixel Understanding with Visual Instruction Tuning [arXiv 2312] [paper] [code]
  37. Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision, Language, Audio, and Action [arXiv 2312] [paper] [code]
  38. VistaLLM: Jack of All Tasks, Master of Many: Designing General-purpose Coarse-to-Fine Vision-Language Model [arXiv 2312] [paper] [code]
  39. Emu2: Generative Multimodal Models are In-Context Learners [arXiv 2312] [paper] [code]
  40. V*: Guided Visual Search as a Core Mechanism in Multimodal LLMs [arXiv 2312] [paper] [code]
  41. BakLLaVA-1: BakLLaVA 1 is a Mistral 7B base augmented with the LLaVA 1.5 architecture [github 2310] [paper] [code]
  42. LEGO: Language Enhanced Multi-modal Grounding Model [arXiv 2401] [paper] [code]
  43. MMVP: Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs [arXiv 2401] [paper] [code]
  44. ModaVerse: Efficiently Transforming Modalities with LLMs [arXiv 2401] [paper] [code]
  45. MoE-LLaVA: Mixture of Experts for Large Vision-Language Models [arXiv 2401] [paper] [code]
  46. LLaVA-MoLE: Sparse Mixture of LoRA Experts for Mitigating Data Conflicts in Instruction Finetuning MLLMs [arXiv 2401] [paper] [code]
  47. 🎓InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Models [arXiv 2401] [paper] [code]
  48. MouSi: Poly-Visual-Expert Vision-Language Models [arXiv 2401] [paper] [code]
  49. Yi Vision Language Model [HF 2401]

Multimodal Small Language Model (MSLM) / Small Multimodal Model (SMM)

  1. Vary-toy: Small Language Model Meets with Reinforced Vision Vocabulary [arXiv 2401] [paper] [code]

Image Generation with MLLM

  1. Generating Images with Multimodal Language Models [NeurIPS 2023] [paper] [code]
  2. DreamLLM: Synergistic Multimodal Comprehension and Creation [arXiv 2309] [paper] [code]
  3. Guiding Instruction-based Image Editing via Multimodal Large Language Models [arXiv 2309] [paper] [code]
  4. KOSMOS-G: Generating Images in Context with Multimodal Large Language Models [arXiv 2310] [paper] [code]
  5. LLMGA: Multimodal Large Language Model based Generation Assistant [arXiv 2311] [paper] [code]

Modern Autonomous Driving (MAD)

End-to-End Solution

  1. UniAD: Planning-oriented Autonomous Driving [CVPR 2023] [paper] [code]
  2. Scene as Occupancy [arXiv 2306] [paper] [code]
  3. FusionAD: Multi-modality Fusion for Prediction and Planning Tasks of Autonomous Driving [arXiv 2308] [paper] [code]
  4. BEVGPT: Generative Pre-trained Large Model for Autonomous Driving Prediction, Decision-Making, and Planning [arXiv 2310] [paper] [code]
  5. UniVision: A Unified Framework for Vision-Centric 3D Perception [arXiv 2401] [paper] [code]

with Large Language Model

  1. Drive Like a Human: Rethinking Autonomous Driving with Large Language Models [arXiv 2307] [paper] [code]
  2. LINGO-1: Exploring Natural Language for Autonomous Driving (Vision-Language-Action Models, VLAMs) [Wayve 2309] [blog]
  3. DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model [arXiv 2310] [paper] [code]

Embodied AI (EAI) and Robo Agent

  1. VIMA: General Robot Manipulation with Multimodal Prompts [arXiv 2210] [paper] [code]
  2. PaLM-E: An Embodied Multimodal Language Model [arXiv 2303] [paper] [code]
  3. VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models [arXiv 2307] [CoRL 2023] [paper] [code]
  4. RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control [arXiv 2307] [paper] [project]
  5. RoboAgent: Generalization and Efficiency in Robot Manipulation via Semantic Augmentations and Action Chunking [arXiv 2309] [paper] [code]
  6. MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning [arXiv 2401] [paper] [code]

Neural Radiance Fields (NeRF)

  1. EmerNeRF: Emergent Spatial-Temporal Scene Decomposition via Self-Supervision [arXiv 2311] [paper] [code]

Diffusion Model

  1. ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Real Image [arXiv 2310] [paper] [code]
  2. Vlogger: Make Your Dream A Vlog [arXiv 2401] [paper] [code]
  3. BootPIG: Bootstrapping Zero-shot Personalized Image Generation Capabilities in Pretrained Diffusion Models [arXiv 2401] [paper] [code]

World Model

  1. CWM: Unifying (Machine) Vision via Counterfactual World Modeling [arXiv 2306] [paper] [code]
  2. MILE: Model-Based Imitation Learning for Urban Driving [Wayve 2210] [NeurIPS 2022] [paper] [code] [blog]
  3. GAIA-1: A Generative World Model for Autonomous Driving [Wayve 2310] [arXiv 2309] [paper] [code]
  4. ADriver-I: A General World Model for Autonomous Driving [arXiv 2311] [paper] [code]
  5. OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving [arXiv 2311] [paper] [code]
  6. LWM: World Model on Million-Length Video and Language with RingAttention [arXiv 2402] [paper] [code]

Artificial Intelligence Generated Content (AIGC)

Text-to-Image

Text-to-Video

  1. Sora: Video generation models as world simulators [openai 2402] [technical report] (💥Visual GPT Time?)

Text-to-3D

Image-to-3D

Artificial General Intelligence (AGI)

New Method

  1. [Instruction Tuning] FLAN: Finetuned Language Models are Zero-Shot Learners [ICLR 2022] [paper] [code]

New Dataset

  1. DriveLM: Drive on Language [paper] [project]
  2. MagicDrive: Street View Generation with Diverse 3D Geometry Control [arXiv 2310] [paper] [code]
  3. Open X-Embodiment: Robotic Learning Datasets and RT-X Models [paper] [project] [blog]
  4. To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning (LVIS-Instruct4V) [arXiv 2311] [paper] [code] [dataset]
  5. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks (FLD-5B) [arXiv 2311] [paper] [code] [dataset]
  6. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions [paper] [code] [dataset]

New Vision Backbone

  1. Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model [arXiv 2401] [paper] [code]
  2. VMamba: Visual State Space Model [arXiv 2401] [paper] [code]

Benchmark

  1. Mementos: A Comprehensive Benchmark for Multimodal Large Language Model Reasoning over Image Sequences [arXiv 2401] [paper] [code]

Platform and API

  1. SenseNova 商汤日日新开放平台 [url]

SOTA Downstream Task

Zero-shot Object Detection about of Visual Grounding, Opne-set, Open-vocabulary, Open-world