| Model | Year | Architecture | Training Data | Parameters | Vision Encoder/Tokenizer | Pretrained Backbone Model |
|---|---|---|---|---|---|---|
| GPT-5.4 / GPT-5.4 Thinking (OpenAI) | 03/06/2026 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| Phi-4-Reasoning-Vision-15B (Microsoft) | 03/04/2026 | Decoder-only | Curated synthetic + filtered data | 15B | High-res dynamic-resolution ViT | Phi-4 |
| Gemini 3.0 (Google) | 03/2026 | Unified Model | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| Qwen3.5 (Alibaba) | 02/16/2026 | Unified VL (early fusion) | Trillions of multimodal tokens | 0.8B–397B (MoE, 17B active) | ViT (native) | Qwen3.5 |
| Claude Opus 4.6 (Anthropic) | 02/2026 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| Erin 5.0 (Baidu) | 02/05/2026 | Unified Model (Visual, Text, Audio) | Unified Modality Dataset | - | CNN–ViT (Understanding)/Next-Frame-and-Scale Prediction (Generation) | Unified Autoregressive Transformer |
| Molmo2 (Allen AI) | 01/15/2026 | Decoder-only | 7 new video + 2 multi-image datasets (9.19M videos) | 4B / 7B / 8B | Bi-directional attention ViT | Qwen 3 / OLMo |
| Gemini 3 | 11/18/2025 | Unified Model | Undisclosed | - | - | - |
| Emu3.5 | 10/30/2025 | Deconder-only | Unified Modality Dataset | - | SigLIP | Qwen3 |
| DeepSeek-OCR | 10/20/2025 | Encoder-Deconder | 70% OCR, 20% general vision, 10% text-only | 3B | DeepEncoder | DeepSeek-3B |
| Qwen3-VL | 10/11/2025 | Decoder-Only | - | 8B/4B | ViT | Qwen3 |
| Qwen3-VL-MoE | 09/25/2025 | Decoder-Only | - | 235B-A22B | ViT | Qwen3 |
| Qwen3-Omni (Visual/Audio/Text) | 09/21/2025 | - | Video/Audio/Image | 30B | ViT | Qwen3-Omni-MoE-Thinker |
| LLaVA-Onevision-1.5 | 09/15/2025 | - | Mid-Training-85M & SFT | 8B | Qwen2VLImageProcessor | Qwen3 |
| InternVL3.5 | 08/25/2025 | Decoder-Only | multimodal & text-only | 30B/38B/241B | InternViT-300M/6B | Qwen3 / GPT-OSS |
| SkyWork-Unipic-1.5B | 07/29/2025 | - | image/video.. | - | - | - |
| Grok 4 | 07/09/2025 | - | image/video.. | 1-2 Trillion | - | - |
| Kwai Keye-VL (Kuaishou) | 07/02/2025 | Decdoer-only | image/video.. | 8B | ViT | QWen-3-8B |
| OmniGen2 | 06/23/2025 | Decdoer-only & VAE | LLaVA-OneVision/ SAM-LLaVA.. | - | ViT | QWen-2.5-VL |
| Gemini-2.5-Pro | 06/17/2025 | - | - | - | - | - |
| GPT-o3/o4-mini | 06/10/2025 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| Mimo-VL (Xiaomi) | 06/04/2025 | Decdoer-only | 24 Trillion MLLM tokens | 7B | Qwen2.5-ViT | [Mimo-7B-base | |
| BAGEL (Bytedance) | 05/20/2025 | Unified Model | Video/Image/Text | 7B | SigLIP2-so400m/14](https://arxiv.org/abs/2502.14786) | Qwen2.5 |
| BLIP3-o | 05/14/2025 | Decdoer-only | (BLIP3-o 60K) GPT-4o Generated Image Generation Data | 4/8B | ViT | QWen-2.5-VL |
| InternVL-3 | 04/14/2025 | Decdoer-only | 200 Billion Tokens | 1/2/8/9/14/38/78B | ViT-300M/6B | InterLM2.5/QWen2.5 |
| LLaMA4-Scout/Maverick | 04/04/2025 | Decdoer-only | 40/20 Trillion Tokens | 17B | MetaClip | LLaMA4 |
| Qwen2.5-Omni | 03/26/2025 | Decdoer-only | Video/Audio/Image/Text | 7B | Qwen2-Audio/Qwen2.5-VL ViT | End-to-End Mini-Omni |
| QWen2.5-VL | 01/28/2025 | Decdoer-only | Image caption, VQA, grounding agent, long video | 3B/7B/72B | Redesigned ViT | Qwen2.5 |
| Ola | 2025 | Decoder-only | Image/Video/Audio/Text | 7B | OryxViT | Qwen-2.5-7B, SigLIP-400M, Whisper-V3-Large, BEATs-AS2M(cpt2) |
| Ocean-OCR | 2025 | Decdoer-only | Pure Text, Caption, Interleaved, OCR | 3B | NaViT | Pretrained from scratch |
| SmolVLM | 2025 | Decoder-only | SmolVLM-Instruct | 250M & 500M | SigLIP | SmolLM |
| DeepSeek-Janus-Pro | 2025 | Decoder-only | Undisclosed | 7B | SigLIP | DeepSeek-Janus-Pro |
| Inst-IT | 2024 | Decoder-only | Inst-IT Dataset, LLaVA-NeXT-Data | 7B | CLIP/Vicuna, SigLIP/Qwen2 | LLaVA-NeXT |
| DeepSeek-VL2 | 2024 | Decoder-only | WiT, WikiHow | 4.5B x 74 | SigLIP/SAMB | DeepSeekMoE |
| xGen-MM (BLIP-3) | 2024 | Decoder-only | MINT-1T, OBELICS, Caption | 4B | ViT + Perceiver Resampler | Phi-3-mini |
| TransFusion | 2024 | Encoder-decoder | Undisclosed | 7B | VAE Encoder | Pretrained from scratch on transformer architecture |
| Baichuan Ocean Mini | 2024 | Decoder-only | Image/Video/Audio/Text | 7B | CLIP ViT-L/14 | Baichuan |
| LLaMA 3.2-vision | 2024 | Decoder-only | Undisclosed | 11B-90B | CLIP | LLaMA-3.1 |
| Pixtral | 2024 | Decoder-only | Undisclosed | 12B | CLIP ViT-L/14 | Mistral Large 2 |
| Qwen2-VL | 2024 | Decoder-only | Undisclosed | 7B-14B | EVA-CLIP ViT-L | Qwen-2 |
| NVLM | 2024 | Encoder-decoder | LAION-115M | 8B-24B | Custom ViT | Qwen-2-Instruct |
| Emu3 | 2024 | Decoder-only | Aquila | 7B | MoVQGAN | LLaMA-2 |
| Claude 3 | 2024 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| InternVL | 2023 | Encoder-decoder | LAION-en, LAION- multi | 7B/20B | Eva CLIP ViT-g | QLLaMA |
| InstructBLIP | 2023 | Encoder-decoder | CoCo, VQAv2 | 13B | ViT | Flan-T5, Vicuna |
| CogVLM | 2023 | Encoder-decoder | LAION-2B ,COYO-700M | 18B | CLIP ViT-L/14 | Vicuna |
| PaLM-E | 2023 | Decoder-only | All robots, WebLI | 562B | ViT | PaLM |
| LLaVA-1.5 | 2023 | Decoder-only | COCO | 13B | CLIP ViT-L/14 | Vicuna |
| Gemini | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| GPT-4V | 2023 | Decoder-only | Undisclosed | Undisclosed | Undisclosed | Undisclosed |
| BLIP-2 | 2023 | Encoder-decoder | COCO, Visual Genome | 7B-13B | ViT-g | Open Pretrained Transformer (OPT) |
| Flamingo | 2022 | Decoder-only | M3W, ALIGN | 80B | Custom | Chinchilla |
| BLIP | 2022 | Encoder-decoder | COCO, Visual Genome | 223M-400M | ViT-B/L/g | Pretrained from scratch |
| CLIP | 2021 | Encoder-decoder | 400M image-text pairs | 63M-355M | ViT/ResNet | Pretrained from scratch |
| Dataset | Task | Size |
|---|---|---|
| OmniScience(02/2026) | Scientific Image Understanding | 1.5M figure-caption-context triplets |
| MaD-Mix(02/2026) | Multi-modal Data Mixture Optimization | Framework (0.5B–7B scale) |
| OVID(2026) | Open Video Pre-training | 10M hours, 300M frame-caption pairs |
| Molmo2 Video Datasets(01/2026) | Video Captions, QA, Tracking, Pointing | 9.19M videos (7 video + 2 multi-image datasets) |
| MMFineReason(/1/30/2026) | REasoning | 1.8M |
| FineVision(09/04/2025) | Mixed Domain | 24.3 M/4.48TB |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| MathVision | Visual Math | MC / Answer Match | Human | 3.04 | Repo |
| MathVista | Visual Math | MC / Answer Match | Human | 6 | Repo |
| MathVerse | Visual Math | MC | Human | 4.6 | Repo |
| VisNumBench | Visual Number Reasoning | MC | Python Program generated/Web Collection/Real life photos | 1.91 | Repo |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| ROVER | Reciprocal Cross-Modal Reasoning | Visual Gen + Verbal Gen Eval | Human | 1.3 (1,876 images) | Paper |
| RealUnify | Math, World knowledge, Image Gen | Direct & StepWise Eval (Sec 3.3) | Script & Humanverification | 1.0 | |
| Uni-MMMU | Science, Code, Image Gen | DreamSim (Image Gen Eval) & String Matching (Understanding Eval) | - | 1.0 | Repo |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| MMOU | Omni-modal Long Video Understanding | MC | Human | 15 (9,038 videos) | Paper |
| Video-MMMU | Knowledge Acquisition from Professional Videos | MC + Knowledge Gain | Expert | 0.9 (300 videos) | Paper |
| MMVU | Expert-Level Multi-Discipline Video Understanding | MC | Expert | 3 (27 subjects) | Paper |
| VideoHallu | Video Understanding | LLM Eval | Human | 3.2 | |
| Video SimpleQA | Video Understanding | LLM Eval | Human | 2.03 | Repo |
| MovieChat | Video Understanding | LLM Eval | Human | 1 | Repo |
| Perception‑Test | Video Understanding | MC | Crowd | 11.6 | Repo |
| VideoMME | Video Understanding | MC | Experts | 2.7 | Site |
| EgoSchem | Video Understanding | MC | Synth / Human | 5 | Site |
| Inst‑IT‑Bench | Fine‑grained Image & Video | MC & LLM | Human / Synth | 2 | Repo |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| VisionArena | Multimodal Conversation | Pairwise Pref | Human | 23 | Repo |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| OmniEarth | Geospatial / Remote Sensing VLM Eval | MC + Open VQA | Human (verified) | 44.2 (9,275 images, 28 tasks) | Paper |
| MultiHaystack | Multimodal Retrieval & Reasoning | Retrieval + QA | Human | 0.75 (46K+ candidates) | |
| DatBench | Discriminative, Faithful VLM Eval | MC (format-aware) | Synth | - | |
| MMLU | General MM | MC | Human | 15.9 | |
| MMStar | General MM | MC | Human | 1.5 | Site |
| NaturalBench | General MM | Yes/No, MC | Human | 10 | HF |
| PHYSBENCH | Visual Math Reasoning | MC | Grad STEM | 0.10 | Repo |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| EMMA | Visual Reasoning | MC | Human + Synth | 2.8 | Repo |
| MMTBENCH | Visual Reasoning & QA | MC | AI Experts | 30.1 | Repo |
| MM‑Vet | OCR / Visual Reasoning | LLM Eval | Human | 0.2 | Repo |
| MM‑En/CN | Multilingual MM Understanding | MC | Human | 3.2 | Repo |
| GQA | Visual Reasoning & QA | Answer Match | Seed + Synth | 22 | Site |
| VCR | Visual Reasoning & QA | MC | MTurks | 290 | Site |
| VQAv2 | Visual Reasoning & QA | Yes/No, Ans Match | MTurks | 1100 | Repo |
| MMMU | Visual Reasoning & QA | Ans Match, MC | College | 11.5 | Site |
| MMMU-Pro | Visual Reasoning & QA | Ans Match, MC | College | 5.19 | Site |
| R1‑Onevision | Visual Reasoning & QA | MC | Human | 155 | Repo |
| VLM²‑Bench | Visual Reasoning & QA | Ans Match, MC | Human | 3 | Site |
| VisualWebInstruct | Visual Reasoning & QA | LLM Eval | Web | 0.9 | Site |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| MSCOCO‑30K | Text‑to‑Image | BLEU, ROUGE, Sim | MTurks | 30 | Site |
| GenAI‑Bench | Text‑to‑Image | Human Rating | Human | 80 | HF |
| Dataset | Task | Eval Protocol | Annotators | Size (K) | Code / Site |
|---|---|---|---|---|---|
| HallusionBench | Hallucination | Yes/No | Human | 1.13 | Repo |
| POPE | Hallucination | Yes/No | Human | 9 | Repo |
| CHAIR | Hallucination | Yes/No | Human | 124 | Repo |
| MHalDetect | Hallucination | Ans Match | Human | 4 | Repo |
| Hallu‑Pi | Hallucination | Ans Match | Human | 1.26 | Repo |
| HallE‑Control | Hallucination | Yes/No | Human | 108 | Repo |
| AutoHallusion | Hallucination | Ans Match | Synth | 3.129 | Repo |
| BEAF | Hallucination | Yes/No | Human | 26 | Site |
| GAIVE | Hallucination | Ans Match | Synth | 320 | Repo |
| HalEval | Hallucination | Yes/No | Crowd / Synth | 2 | Repo |
| AMBER | Hallucination | Ans Match | Human | 15.22 | Repo |
| Benchmark | Domain | Type | Project |
|---|---|---|---|
| Drive-Bench | Embodied AI | Autonomous Driving | Website |
| Habitat, Habitat 2.0, Habitat 3.0 | Robotics (Navigation) | Simulator + Dataset | Website |
| Gibson | Robotics (Navigation) | Simulator + Dataset | Website, Github Repo |
| iGibson1.0, iGibson2.0 | Robotics (Navigation) | Simulator + Dataset | Website, Document |
| Isaac Gym | Robotics (Navigation) | Simulator | Website, Github Repo |
| Isaac Lab | Robotics (Navigation) | Simulator | Website, Github Repo |
| AI2THOR | Robotics (Navigation) | Simulator | Website, Github Repo |
| ProcTHOR | Robotics (Navigation) | Simulator + Dataset | Website, Github Repo |
| VirtualHome | Robotics (Navigation) | Simulator | Website, Github Repo |
| ThreeDWorld | Robotics (Navigation) | Simulator | Website, Github Repo |
| VIMA-Bench | Robotics (Manipulation) | Simulator | Website, Github Repo |
| VLMbench | Robotics (Manipulation) | Simulator | Github Repo |
| CALVIN | Robotics (Manipulation) | Simulator | Website, Github Repo |
| GemBench | Robotics (Manipulation) | Simulator | Website, Github Repo |
| WebArena | Web Agent | Simulator | Website, Github Repo |
| UniSim | Robotics (Manipulation) | Generative Model, World Model | Website |
| GAIA-1 | Robotics (Automonous Driving) | Generative Model, World Model | Website |
| LWM | Embodied AI | Generative Model, World Model | Website, Github Repo |
| Genesis | Embodied AI | Generative Model, World Model | Github Repo |
| EMMOE | Embodied AI | Generative Model, World Model | Paper |
| RoboGen | Embodied AI | Generative Model, World Model | Website |
| UnrealZoo | Embodied AI (Tracking, Navigation, Multi Agent) | Simulator | Website |
| Title | Year | Paper | RL | Code |
|---|---|---|---|---|
| wDPO: Winsorized Direct Preference Optimization for Robust Alignment | 03/2026 | Paper | wDPO | - |
| f-GRPO and Beyond: Divergence-Based RL for General LLM Alignment | 02/2026 | Paper | f-GRPO / f-HAL | |
| From Sight to Insight: Improving Visual Reasoning of MLLMs via Reinforcement Learning | 01/2026 | Paper | GRPO (6 reward functions) | |
| SaFeR-VLM: Safety-Aware Reinforcement Learning for Multimodal Reasoning | 2026 (ICLR) | Paper | GRPO + safety reward | |
| SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning | 11/2025 | Paper | Dual-Reward (Thinking + Judging) | |
| GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA | 10/2025 | Paper | GIFT (convex MSE loss) | |
| Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning | 10/12/2025 | Paper | GRPO | |
| Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play | 09/29/2025 | Paper | GRPO | - |
| Vision-SR1: Self-rewarding vision-language model via reasoning decomposition | 08/26/2025 | Paper | GRPO | - |
| Group Sequence Policy Optimization | 06/24/2025 | Paper | GSPO | - |
| Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning | 05/20/2025 | Paper | GRPO | - |
| VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning | 2025/04/10 | Paper | GRPO | Code |
| OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement | 2025/03/21 | Paper | GRPO | Code |
| Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning | 2025/03/10 | Paper | GRPO | Code |
| OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference | 2025 | Paper | DPO | Code |
| Multimodal Open R1/R1-Multimodal-Journey | 2025 | - | GRPO | Code |
| R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization | 2025 | Paper | GRPO | Code |
| Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning | 2025 | - | PPO/REINFORCE++/GRPO | Code |
| MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning | 2025 | Paper | REINFORCE Leave-One-Out (RLOO) | Code |
| MM-RLHF: The Next Step Forward in Multimodal LLM Alignment | 2025 | Paper | DPO | Code |
| LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL | 2025 | Paper | PPO | Code |
| Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models | 2025 | Paper | GRPO | Code |
| Unified Reward Model for Multimodal Understanding and Generation | 2025 | Paper | DPO | Code |
| Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step | 2025 | Paper | DPO | Code |
| All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning | 2025 | Paper | Online RL | - |
| Video-R1: Reinforcing Video Reasoning in MLLMs | 2025 | Paper | GRPO | Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| MERGETUNE: Continued Fine-Tuning of Vision-Language Models | 2026/01 (ICLR 2026) | Paper | - | - |
| Mask Fine-Tuning (MFT): Unlocking Hidden Capabilities in Vision-Language Models | 2025/12 | Paper | - | |
| Image-LoRA: Towards Minimal Fine-Tuning of VLMs | 2025/12 | Paper | - | |
| Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning | 2025/12 | Paper | - | |
| Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models | 2025/04/21 | Paper | Website | |
| OMNICAPTIONER: One Captioner to Rule Them All | 2025/04/09 | Paper | Website | Code |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | 2024 | Paper | Website | Code |
| LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression | 2024 | Paper | Website | Code |
| ViTamin: Designing Scalable Vision Models in the Vision-Language Era | 2024 | Paper | Website | Code |
| Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model | 2024 | Paper | - | - |
| Should VLMs be Pre-trained with Image Data? | 2025 | Paper | - | - |
| VisionArena: 230K Real World User-VLM Conversations with Preference Labels | 2024 | Paper | - | Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| EvoPrompt: Evolving Prompt Adaptation for Vision-Language Models | 2026/03 | Paper | - | - |
| MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation | 2026/02 | Paper | - | |
| Multimodal Prompt Optimizer (MPO): Joint Optimization of Multimodal Prompts | 2025/10 | Paper | - | |
| Evolutionary Prompt Optimization Discovers Emergent Multimodal Reasoning Strategies | 2025/03 | Paper | - | |
| In-ContextEdit:EnablingInstructionalImageEditingwithIn-Context GenerationinLargeScaleDiffusionTransformer | 2025/04/30 | Paper | Website |
| Title | Year | Paper Link |
|---|---|---|
| Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI | 2024 | Paper |
| ScreenAI: A Vision-Language Model for UI and Infographics Understanding | 2024 | Paper |
| ChartLlama: A Multimodal LLM for Chart Understanding and Generation | 2023 | Paper |
| SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement | 2024 | 📄 Paper |
| Training a Vision Language Model as Smartphone Assistant | 2024 | Paper |
| ScreenAgent: A Vision-Language Model-Driven Computer Control Agent | 2024 | Paper |
| Embodied Vision-Language Programmer from Environmental Feedback | 2024 | Paper |
| VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method | 2025 | 📄 Paper |
| MP-GUI: Modality Perception with MLLMs for GUI Understanding | 2025 | 📄 Paper |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning | 2023 | 📄 Paper | 🌍 Website | 💾 Code |
| Spurious Correlation in Multimodal LLMs | 2025 | 📄 Paper | - | - |
| WeGen: A Unified Model for Interactive Multimodal Generation as We Chat | 2025 | 📄 Paper | - | 💾 Code |
| VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation | 2024 | 📄 Paper | 🌍 Website | - |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | 2024 | 📄 Paper | 🌍 Website | - |
| Vision-language model-driven scene understanding and robotic object manipulation | 2024 | 📄 Paper | - | - |
| Guiding Long-Horizon Task and Motion Planning with Vision Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers | 2023 | 📄 Paper | 🌍 Website | - |
| VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model | 2024 | 📄 Paper | - | - |
| Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems? | 2023 | 📄 Paper | 🌍 Website | - |
| DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| MotionGPT: Human Motion as a Foreign Language | 2023 | 📄 Paper | - | 💾 Code |
| Learning Reward for Robot Skills Using Large Language Models via Self-Alignment | 2024 | 📄 Paper | - | - |
| Language to Rewards for Robotic Skill Synthesis | 2023 | 📄 Paper | 🌍 Website | - |
| Eureka: Human-Level Reward Design via Coding Large Language Models | 2023 | 📄 Paper | 🌍 Website | - |
| Integrated Task and Motion Planning | 2020 | 📄 Paper | - | - |
| Jailbreaking LLM-Controlled Robots | 2024 | 📄 Paper | 🌍 Website | - |
| Robots Enact Malignant Stereotypes | 2022 | 📄 Paper | 🌍 Website | - |
| LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions | 2024 | 📄 Paper | - | - |
| Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics | 2024 | 📄 Paper | 🌍 Website | - |
| EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents | 2025 | 📄 Paper | 🌍 Website | 💾 Code & Dataset |
| Gemini Robotics: Bringing AI into the Physical World | 2025 | 📄 Technical Report | 🌍 Website | - |
| GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation | 2024 | 📄 Paper | 🌍 Website | - |
| Magma: A Foundation Model for Multimodal AI Agents | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| DayDreamer: World Models for Physical Robot Learning | 2022 | 📄 Paper | 🌍 Website | 💾 Code |
| Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models | 2025 | 📄 Paper | - | - |
| RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| Unified Video Action Model | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation | 03/2026 | 📄 Paper | - | |
| NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models | 03/2026 | 📄 Paper | - | |
| Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control | 02/2026 | 📄 Paper | - | |
| ST4VLA: Spatial Guided Training for Vision-Language-Action Models | 02/2026 | 📄 Paper | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| VIMA: General Robot Manipulation with Multimodal Prompts | 2022 | 📄 Paper | 🌍 Website | |
| Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model | 2023 | 📄 Paper | - | - |
| Creative Robot Tool Use with Large Language Models | 2023 | 📄 Paper | 🌍 Website | - |
| RoboVQA: Multimodal Long-Horizon Reasoning for Robotics | 2024 | 📄 Paper | - | - |
| RT-1: Robotics Transformer for Real-World Control at Scale | 2022 | 📄 Paper | 🌍 Website | - |
| RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control | 2023 | 📄 Paper | 🌍 Website | - |
| Open X-Embodiment: Robotic Learning Datasets and RT-X Models | 2023 | 📄 Paper | 🌍 Website | - |
| ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| Masked World Models for Visual Control | 2022 | 📄 Paper | 🌍 Website | 💾 Code |
| Multi-View Masked World Models for Visual Robotic Manipulation | 2023 | 📄 Paper | 🌍 Website | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings | 2022 | 📄 Paper | - | - |
| LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation | 2024 | 📄 Paper | - | - |
| LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action | 2022 | 📄 Paper | 🌍 Website | - |
| NaVILA: Legged Robot Vision-Language-Action Model for Navigation | 2022 | 📄 Paper | 🌍 Website | - |
| VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation | 2024 | 📄 Paper | - | - |
| Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning | 2023 | 📄 Paper | 🌍 Website | - |
| Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments | 2025 | 📄 Paper | - | - |
| Navigation World Models | 2024 | 📄 Paper | 🌍 Website | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| MUTEX: Learning Unified Policies from Multimodal Task Specifications | 2023 | 📄 Paper | 🌍 Website | - |
| LaMI: Large Language Models for Multi-Modal Human-Robot Interaction | 2024 | 📄 Paper | 🌍 Website | - |
| VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models | 2024 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving | 03/2026 | 📄 Paper | - | - |
| DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe Autonomous Driving | 03/2026 | 📄 Paper | - | |
| HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving | 02/2026 | 📄 Paper | - | |
| OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model | 03/2025 | 📄 Paper | - | |
| Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives | 01/07/2025 | 📄 Paper | 🌍 Website | |
| DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| GPT-Driver: Learning to Drive with GPT | 2023 | 📄 Paper | - | - |
| LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving | 2023 | 📄 Paper | 🌍 Website | - |
| Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving | 2023 | 📄 Paper | - | - |
| Referring Multi-Object Tracking | 2023 | 📄 Paper | - | 💾 Code |
| VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision | 2023 | 📄 Paper | - | 💾 Code |
| MotionLM: Multi-Agent Motion Forecasting as Language Modeling | 2023 | 📄 Paper | - | - |
| DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models | 2023 | 📄 Paper | 🌍 Website | - |
| VLP: Vision Language Planning for Autonomous Driving | 2024 | 📄 Paper | - | - |
| DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model | 2023 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis | 2024 | 📄 Paper | - | 💾 Code |
| LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration – A Robot Sous-Chef Application | 2024 | 📄 Paper | - | - |
| Pretrained Language Models as Visual Planners for Human Assistance | 2023 | 📄 Paper | - | - |
| Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research | 2024 | 📄 Paper | - | - |
| Image and Data Mining in Reticular Chemistry Using GPT-4V | 2023 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis | 2023 | 📄 Paper | - | - |
| CogAgent: A Visual Language Model for GUI Agents | 2023 | 📄 Paper | - | 💾 Code |
| WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models | 2024 | 📄 Paper | - | 💾 Code |
| ShowUI: One Vision-Language-Action Model for GUI Visual Agent | 2024 | 📄 Paper | - | 💾 Code |
| ScreenAgent: A Vision Language Model-driven Computer Control Agent | 2024 | 📄 Paper | - | 💾 Code |
| Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation | 2024 | 📄 Paper | - | 💾 Code |
| MolmoWeb: An Open Agent for Automating Web Tasks | 03/2026 | 📄 Blog | 🌍 Website |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework | 03/2026 | 📄 Paper | - | - |
| MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images | 02/2026 | 📄 Paper | - | |
| Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning | 12/2025 | 📄 Paper | - | |
| Frontiers in Intelligent Colonoscopy | 02/2025 | 📄 Paper | - | 💾 Code |
| VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge | 2024 | 📄 Paper | - | 💾 Code |
| Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology | 2024 | 📄 Paper | - | - |
| M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization | 2023 | 📄 Paper | - | - |
| MedCLIP: Contrastive Learning from Unpaired Medical Images and Text | 2022 | 📄 Paper | - | 💾 Code |
| Med-Flamingo: A Multimodal Medical Few-Shot Learner | 2023 | 📄 Paper | - | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy | 2024 | 📄 Paper | - | - |
| Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence | 2024 | 📄 Paper | - | - |
| Harnessing Large Vision and Language Models in Agriculture: A Review | 2024 | 📄 Paper | - | - |
| A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping | 2024 | 📄 Paper | - | - |
| Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models | 2024 | 📄 Paper | - | 💾 Code |
| DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images | 2024 | 📄 Paper | - | - |
| MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models | 2024 | 📄 Paper | - | 💾 Code |
| Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps | 2024 | 📄 Paper | - | 💾 Code |
| He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation | 2021 | 📄 Paper | - | - |
| UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling | 2024 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token | 03/2026 | 📄 Paper | 🌍 ACL | - |
| Tone Matters: The Impact of Linguistic Tone on Hallucination in VLMs | 01/2026 | 📄 Paper | - | |
| Object Hallucination in Image Captioning | 2018 | 📄 Paper | - | |
| Evaluating Object Hallucination in Large Vision-Language Models | 2023 | 📄 Paper | - | 💾 Code |
| Detecting and Preventing Hallucinations in Large Vision Language Models | 2023 | 📄 Paper | - | - |
| HallE-Control: Controlling Object Hallucination in Large Multimodal Models | 2023 | 📄 Paper | - | 💾 Code |
| Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs | 2024 | 📄 Paper | - | 💾 Code |
| BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models | 2023 | 📄 Paper | - | 💾 Code |
| AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning | 2023 | 📄 Paper | - | 💾 Code |
| Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models | 2024 | 📄 Paper | - | 💾 Code |
| AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation | 2023 | 📄 Paper | - | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| SaFeR-VLM: Safety into Multimodal Reasoning via Reinforcement Learning | 2026 (ICLR) | 📄 Paper | - | - |
| HoliSafe: Holistic Safety Evaluation for Vision-Language Models | 2026 (ICLR) | 📄 Paper | - | |
| JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models | 2024 | 📄 Paper | 🌍 Website | |
| Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments | 2023 | 📄 Paper | - | - |
| SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models | 2024 | 📄 Paper | - | - |
| JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks | 2024 | 📄 Paper | - | - |
| SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models | 2024 | 📄 Paper | - | 💾 Code |
| Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models | 2024 | 📄 Paper | - | - |
| Jailbreaking Attack against Multimodal Large Language Model | 2024 | 📄 Paper | - | - |
| Embodied Red Teaming for Auditing Robotic Foundation Models | 2025 | 📄 Paper | 🌍 Website | |
| Safety Guardrails for LLM-Enabled Robots | 2025 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Hallucination of Multimodal Large Language Models: A Survey | 2024 | 📄 Paper | - | - |
| Bias and Fairness in Large Language Models: A Survey | 2023 | 📄 Paper | - | - |
| Fairness and Bias in Multimodal AI: A Survey | 2024 | 📄 Paper | - | - |
| Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models | 2023 | 📄 Paper | - | - |
| FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks | 2024 | 📄 Paper | - | - |
| FairCLIP: Harnessing Fairness in Vision-Language Learning | 2024 | 📄 Paper | - | - |
| FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models | 2024 | 📄 Paper | - | - |
| Benchmarking Vision Language Models for Cultural Understanding | 2024 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding | 2024 | 📄 Paper | - | - |
| Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement | 2024 | 📄 Paper | - | - |
| Assessing and Learning Alignment of Unimodal Vision and Language Models | 2024 | 📄 Paper | 🌍 Website | - |
| Extending Multi-modal Contrastive Representations | 2023 | 📄 Paper | - | 💾 Code |
| OneLLM: One Framework to Align All Modalities with Language | 2023 | 📄 Paper | - | 💾 Code |
| What You See is What You Read? Improving Text-Image Alignment Evaluation | 2023 | 📄 Paper | 🌍 Website | 💾 Code |
| Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| VBench: Comprehensive BenchmarkSuite for Video Generative Models | 2023 | 📄 Paper | 🌍 Website | 💾 Code |
| VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| PhysBench: Benchmarking and Enhancing VLMs for Physical World Understanding | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| VideoPhy: Evaluating Physical Commonsense for Video Generation | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| WorldSimBench: Towards Video Generation Models as World Simulators | 2024 | 📄 Paper | 🌍 Website | - |
| WorldModelBench: Judging Video Generation Models As World Models | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation | 2025 | 📄 Paper | - | 💾 Code |
| Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency | 2025 | 📄 Paper | - | 💾 Code |
| Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding | 2025 | 📄 Paper | - | - |
| SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| Do generative video models understand physical principles? | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| How Far is Video Generation from World Model: A Physical Law Perspective | 2024 | 📄 Paper | 🌍 Website | 💾 Code |
| Imagine while Reasoning in Space: Multimodal Visualization-of-Thought | 2025 | 📄 Paper | - | - |
| VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness | 2025 | 📄 Paper | 🌍 Website | 💾 Code |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules | 02/2026 | 📄 Paper | - | - |
| GRACE: Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs | 01/2026 | 📄 Paper | - | |
| VLMQ: Post-Training Quantization for Large Vision-Language Models | 2026 (ICLR) | 📄 Paper | - | |
| VILA: On Pre-training for Visual Language Models | 2023 | 📄 Paper | - | |
| SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | 2021 | 📄 Paper | - | - |
| LoRA: Low-Rank Adaptation of Large Language Models | 2021 | 📄 Paper | - | 💾 Code |
| QLoRA: Efficient Finetuning of Quantized LLMs | 2023 | 📄 Paper | - | - |
| Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback | 2022 | 📄 Paper | - | 💾 Code |
| RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback | 2023 | 📄 Paper | - | - |
| Title | Year | Paper | Website | Code |
|---|---|---|---|---|
| A Survey on Bridging VLMs and Synthetic Data | 2025 | 📄 Paper | - | 💾 Code |
| Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning | 2024 | 📄 Paper | Website | 💾 Code |
| SLIP: Self-supervision meets Language-Image Pre-training | 2021 | 📄 Paper | - | 💾 Code |
| Synthetic Vision: Training Vision-Language Models to Understand Physics | 2024 | 📄 Paper | - | - |
| Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings | 2024 | 📄 Paper | - | - |
| KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data | 2024 | 📄 Paper | - | - |
| Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation | 2024 | 📄 Paper | - | - |