CVPR 2025 Workshop Updated Mar 2026 Open Source

Vision-Language Models
Survey & Overview

A curated collection of state-of-the-art VLMs, benchmarks, RL alignment methods, applications, and open challenges in the multimodal AI landscape.

54Models
81Benchmarks
50RL & SFT
115Applications
ModelYearArchitectureTraining DataParametersVision Encoder/TokenizerPretrained Backbone Model
GPT-5.4 / GPT-5.4 Thinking (OpenAI)03/06/2026Decoder-onlyUndisclosedUndisclosedUndisclosedUndisclosed
Phi-4-Reasoning-Vision-15B (Microsoft)03/04/2026Decoder-onlyCurated synthetic + filtered data15BHigh-res dynamic-resolution ViTPhi-4
Gemini 3.0 (Google)03/2026Unified ModelUndisclosedUndisclosedUndisclosedUndisclosed
Qwen3.5 (Alibaba)02/16/2026Unified VL (early fusion)Trillions of multimodal tokens0.8B–397B (MoE, 17B active)ViT (native)Qwen3.5
Claude Opus 4.6 (Anthropic)02/2026Decoder-onlyUndisclosedUndisclosedUndisclosedUndisclosed
Erin 5.0 (Baidu)02/05/2026Unified Model (Visual, Text, Audio)Unified Modality Dataset-CNN–ViT (Understanding)/Next-Frame-and-Scale Prediction (Generation)Unified Autoregressive Transformer
Molmo2 (Allen AI)01/15/2026Decoder-only7 new video + 2 multi-image datasets (9.19M videos)4B / 7B / 8BBi-directional attention ViTQwen 3 / OLMo
Gemini 311/18/2025Unified ModelUndisclosed---
Emu3.510/30/2025Deconder-onlyUnified Modality Dataset-SigLIPQwen3
DeepSeek-OCR10/20/2025Encoder-Deconder70% OCR, 20% general vision, 10% text-only3BDeepEncoderDeepSeek-3B
Qwen3-VL10/11/2025Decoder-Only-8B/4BViTQwen3
Qwen3-VL-MoE09/25/2025Decoder-Only-235B-A22BViTQwen3
Qwen3-Omni (Visual/Audio/Text)09/21/2025-Video/Audio/Image30BViTQwen3-Omni-MoE-Thinker
LLaVA-Onevision-1.509/15/2025-Mid-Training-85M & SFT8BQwen2VLImageProcessorQwen3
InternVL3.508/25/2025Decoder-Onlymultimodal & text-only30B/38B/241BInternViT-300M/6BQwen3 / GPT-OSS
SkyWork-Unipic-1.5B07/29/2025-image/video..---
Grok 407/09/2025-image/video..1-2 Trillion--
Kwai Keye-VL (Kuaishou)07/02/2025Decdoer-onlyimage/video..8BViTQWen-3-8B
OmniGen206/23/2025Decdoer-only & VAELLaVA-OneVision/ SAM-LLaVA..-ViTQWen-2.5-VL
Gemini-2.5-Pro06/17/2025-----
GPT-o3/o4-mini06/10/2025Decoder-onlyUndisclosedUndisclosedUndisclosedUndisclosed
Mimo-VL (Xiaomi)06/04/2025Decdoer-only24 Trillion MLLM tokens7BQwen2.5-ViT | [Mimo-7B-base
BAGEL (Bytedance)05/20/2025Unified ModelVideo/Image/Text7BSigLIP2-so400m/14](https://arxiv.org/abs/2502.14786)Qwen2.5
BLIP3-o05/14/2025Decdoer-only(BLIP3-o 60K) GPT-4o Generated Image Generation Data4/8BViTQWen-2.5-VL
InternVL-304/14/2025Decdoer-only200 Billion Tokens1/2/8/9/14/38/78BViT-300M/6BInterLM2.5/QWen2.5
LLaMA4-Scout/Maverick04/04/2025Decdoer-only40/20 Trillion Tokens17BMetaClipLLaMA4
Qwen2.5-Omni03/26/2025Decdoer-onlyVideo/Audio/Image/Text7BQwen2-Audio/Qwen2.5-VL ViTEnd-to-End Mini-Omni
QWen2.5-VL01/28/2025Decdoer-onlyImage caption, VQA, grounding agent, long video3B/7B/72BRedesigned ViTQwen2.5
Ola2025Decoder-onlyImage/Video/Audio/Text7BOryxViTQwen-2.5-7B, SigLIP-400M, Whisper-V3-Large, BEATs-AS2M(cpt2)
Ocean-OCR2025Decdoer-onlyPure Text, Caption, Interleaved, OCR3BNaViTPretrained from scratch
SmolVLM2025Decoder-onlySmolVLM-Instruct250M & 500MSigLIPSmolLM
DeepSeek-Janus-Pro2025Decoder-onlyUndisclosed7BSigLIPDeepSeek-Janus-Pro
Inst-IT2024Decoder-onlyInst-IT Dataset, LLaVA-NeXT-Data7BCLIP/Vicuna, SigLIP/Qwen2LLaVA-NeXT
DeepSeek-VL22024Decoder-onlyWiT, WikiHow4.5B x 74SigLIP/SAMBDeepSeekMoE
xGen-MM (BLIP-3)2024Decoder-onlyMINT-1T, OBELICS, Caption4BViT + Perceiver ResamplerPhi-3-mini
TransFusion2024Encoder-decoderUndisclosed7BVAE EncoderPretrained from scratch on transformer architecture
Baichuan Ocean Mini2024Decoder-onlyImage/Video/Audio/Text7BCLIP ViT-L/14Baichuan
LLaMA 3.2-vision2024Decoder-onlyUndisclosed11B-90BCLIPLLaMA-3.1
Pixtral2024Decoder-onlyUndisclosed12BCLIP ViT-L/14Mistral Large 2
Qwen2-VL2024Decoder-onlyUndisclosed7B-14BEVA-CLIP ViT-LQwen-2
NVLM2024Encoder-decoderLAION-115M 8B-24BCustom ViTQwen-2-Instruct
Emu32024Decoder-onlyAquila7BMoVQGANLLaMA-2
Claude 32024Decoder-onlyUndisclosedUndisclosedUndisclosedUndisclosed
InternVL2023Encoder-decoderLAION-en, LAION- multi7B/20BEva CLIP ViT-gQLLaMA
InstructBLIP2023Encoder-decoderCoCo, VQAv213BViTFlan-T5, Vicuna
CogVLM2023Encoder-decoderLAION-2B ,COYO-700M18BCLIP ViT-L/14Vicuna
PaLM-E2023Decoder-onlyAll robots, WebLI562BViTPaLM
LLaVA-1.52023Decoder-onlyCOCO13BCLIP ViT-L/14Vicuna
Gemini2023Decoder-onlyUndisclosedUndisclosedUndisclosedUndisclosed
GPT-4V2023Decoder-onlyUndisclosedUndisclosedUndisclosedUndisclosed
BLIP-22023Encoder-decoderCOCO, Visual Genome7B-13BViT-gOpen Pretrained Transformer (OPT)
Flamingo2022Decoder-onlyM3W, ALIGN80BCustomChinchilla
BLIP2022Encoder-decoderCOCO, Visual Genome223M-400MViT-B/L/gPretrained from scratch
CLIP2021Encoder-decoder400M image-text pairs63M-355MViT/ResNetPretrained from scratch
DatasetTaskSize
OmniScience(02/2026)Scientific Image Understanding1.5M figure-caption-context triplets
MaD-Mix(02/2026)Multi-modal Data Mixture OptimizationFramework (0.5B–7B scale)
OVID(2026)Open Video Pre-training10M hours, 300M frame-caption pairs
Molmo2 Video Datasets(01/2026)Video Captions, QA, Tracking, Pointing9.19M videos (7 video + 2 multi-image datasets)
MMFineReason(/1/30/2026)REasoning1.8M
FineVision(09/04/2025)Mixed Domain24.3 M/4.48TB
DatasetTaskEval ProtocolAnnotatorsSize (K)Code / Site
MathVisionVisual MathMC / Answer MatchHuman3.04Repo
MathVistaVisual MathMC / Answer MatchHuman6Repo
MathVerseVisual MathMCHuman4.6Repo
VisNumBenchVisual Number ReasoningMCPython Program generated/Web Collection/Real life photos1.91Repo
DatasetTaskEval ProtocolAnnotatorsSize (K)Code / Site
ROVERReciprocal Cross-Modal ReasoningVisual Gen + Verbal Gen EvalHuman1.3 (1,876 images)Paper
RealUnifyMath, World knowledge, Image GenDirect & StepWise Eval (Sec 3.3)Script & Humanverification1.0
Uni-MMMUScience, Code, Image GenDreamSim (Image Gen Eval) & String Matching (Understanding Eval)-1.0Repo
DatasetTaskEval ProtocolAnnotatorsSize (K)Code / Site
MMOUOmni-modal Long Video UnderstandingMCHuman15 (9,038 videos)Paper
Video-MMMUKnowledge Acquisition from Professional VideosMC + Knowledge GainExpert0.9 (300 videos)Paper
MMVUExpert-Level Multi-Discipline Video UnderstandingMCExpert3 (27 subjects)Paper
VideoHalluVideo UnderstandingLLM EvalHuman3.2
Video SimpleQAVideo UnderstandingLLM EvalHuman2.03Repo
MovieChatVideo UnderstandingLLM EvalHuman1Repo
Perception‑TestVideo UnderstandingMCCrowd11.6Repo
VideoMMEVideo UnderstandingMCExperts2.7Site
EgoSchemVideo UnderstandingMCSynth / Human5Site
Inst‑IT‑BenchFine‑grained Image & VideoMC & LLMHuman / Synth2Repo
DatasetTaskEval ProtocolAnnotatorsSize (K)Code / Site
VisionArenaMultimodal ConversationPairwise PrefHuman23Repo
DatasetTaskEval ProtocolAnnotatorsSize (K)Code / Site
OmniEarthGeospatial / Remote Sensing VLM EvalMC + Open VQAHuman (verified)44.2 (9,275 images, 28 tasks)Paper
MultiHaystackMultimodal Retrieval & ReasoningRetrieval + QAHuman0.75 (46K+ candidates)
DatBenchDiscriminative, Faithful VLM EvalMC (format-aware)Synth-
MMLUGeneral MMMCHuman15.9
MMStarGeneral MMMCHuman1.5Site
NaturalBenchGeneral MMYes/No, MCHuman10HF
PHYSBENCHVisual Math ReasoningMCGrad STEM0.10Repo
DatasetTaskEval ProtocolAnnotatorsSize (K)Code / Site
EMMAVisual ReasoningMCHuman + Synth2.8Repo
MMTBENCHVisual Reasoning & QAMCAI Experts30.1Repo
MM‑VetOCR / Visual ReasoningLLM EvalHuman0.2Repo
MM‑En/CNMultilingual MM UnderstandingMCHuman3.2Repo
GQAVisual Reasoning & QAAnswer MatchSeed + Synth22Site
VCRVisual Reasoning & QAMCMTurks290Site
VQAv2Visual Reasoning & QAYes/No, Ans MatchMTurks1100Repo
MMMUVisual Reasoning & QAAns Match, MCCollege11.5Site
MMMU-ProVisual Reasoning & QAAns Match, MCCollege5.19Site
R1‑OnevisionVisual Reasoning & QAMCHuman155Repo
VLM²‑BenchVisual Reasoning & QAAns Match, MCHuman3Site
VisualWebInstructVisual Reasoning & QALLM EvalWeb0.9Site
DatasetTaskEval ProtocolAnnotatorsSize (K)Code / Site
TextVQAVisual Text UnderstandingAns MatchExpert28.6Repo
DocVQADocument VQAAns MatchCrowd50Site
ChartQAChart Graphic UnderstandingAns MatchCrowd / Synth32.7Repo
DatasetTaskEval ProtocolAnnotatorsSize (K)Code / Site
MSCOCO‑30KText‑to‑ImageBLEU, ROUGE, SimMTurks30Site
GenAI‑BenchText‑to‑ImageHuman RatingHuman80HF
DatasetTaskEval ProtocolAnnotatorsSize (K)Code / Site
HallusionBenchHallucinationYes/NoHuman1.13Repo
POPEHallucinationYes/NoHuman9Repo
CHAIRHallucinationYes/NoHuman124Repo
MHalDetectHallucinationAns MatchHuman4Repo
Hallu‑PiHallucinationAns MatchHuman1.26Repo
HallE‑ControlHallucinationYes/NoHuman108Repo
AutoHallusionHallucinationAns MatchSynth3.129Repo
BEAFHallucinationYes/NoHuman26Site
GAIVEHallucinationAns MatchSynth320Repo
HalEvalHallucinationYes/NoCrowd / Synth2Repo
AMBERHallucinationAns MatchHuman15.22Repo
BenchmarkDomainTypeProject
Drive-BenchEmbodied AIAutonomous DrivingWebsite
Habitat, Habitat 2.0, Habitat 3.0Robotics (Navigation)Simulator + DatasetWebsite
GibsonRobotics (Navigation)Simulator + DatasetWebsite, Github Repo
iGibson1.0, iGibson2.0Robotics (Navigation)Simulator + DatasetWebsite, Document
Isaac GymRobotics (Navigation)SimulatorWebsite, Github Repo
Isaac LabRobotics (Navigation)SimulatorWebsite, Github Repo
AI2THORRobotics (Navigation)SimulatorWebsite, Github Repo
ProcTHORRobotics (Navigation)Simulator + DatasetWebsite, Github Repo
VirtualHomeRobotics (Navigation)SimulatorWebsite, Github Repo
ThreeDWorldRobotics (Navigation)SimulatorWebsite, Github Repo
VIMA-BenchRobotics (Manipulation)SimulatorWebsite, Github Repo
VLMbenchRobotics (Manipulation)SimulatorGithub Repo
CALVINRobotics (Manipulation)SimulatorWebsite, Github Repo
GemBenchRobotics (Manipulation)SimulatorWebsite, Github Repo
WebArenaWeb AgentSimulatorWebsite, Github Repo
UniSimRobotics (Manipulation)Generative Model, World ModelWebsite
GAIA-1Robotics (Automonous Driving)Generative Model, World ModelWebsite
LWMEmbodied AIGenerative Model, World ModelWebsite, Github Repo
GenesisEmbodied AIGenerative Model, World ModelGithub Repo
EMMOEEmbodied AIGenerative Model, World ModelPaper
RoboGenEmbodied AIGenerative Model, World ModelWebsite
UnrealZooEmbodied AI (Tracking, Navigation, Multi Agent)SimulatorWebsite
TitleYearPaperRLCode
wDPO: Winsorized Direct Preference Optimization for Robust Alignment03/2026PaperwDPO-
f-GRPO and Beyond: Divergence-Based RL for General LLM Alignment02/2026Paperf-GRPO / f-HAL
From Sight to Insight: Improving Visual Reasoning of MLLMs via Reinforcement Learning01/2026PaperGRPO (6 reward functions)
SaFeR-VLM: Safety-Aware Reinforcement Learning for Multimodal Reasoning2026 (ICLR)PaperGRPO + safety reward
SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning11/2025PaperDual-Reward (Thinking + Judging)
GIFT: Group-relative Implicit Fine Tuning Integrates GRPO with DPO and UNA10/2025PaperGIFT (convex MSE loss)
Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning10/12/2025PaperGRPO
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play09/29/2025PaperGRPO-
Vision-SR1: Self-rewarding vision-language model via reasoning decomposition08/26/2025PaperGRPO-
Group Sequence Policy Optimization06/24/2025PaperGSPO-
Visionary-R1: Mitigating Shortcuts in Visual Reasoning with Reinforcement Learning05/20/2025PaperGRPO-
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning2025/04/10PaperGRPOCode
OpenVLThinker: An Early Exploration to Complex Vision-Language Reasoning via Iterative Self-Improvement2025/03/21PaperGRPOCode
Boosting the Generalization and Reasoning of Vision Language Models with Curriculum Reinforcement Learning2025/03/10PaperGRPOCode
OmniAlign-V: Towards Enhanced Alignment of MLLMs with Human Preference2025PaperDPOCode
Multimodal Open R1/R1-Multimodal-Journey2025-GRPOCode
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization2025PaperGRPOCode
Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning2025-PPO/REINFORCE++/GRPOCode
MM-Eureka: Exploring Visual Aha Moment with Rule-based Large-scale Reinforcement Learning2025PaperREINFORCE Leave-One-Out (RLOO)Code
MM-RLHF: The Next Step Forward in Multimodal LLM Alignment2025PaperDPOCode
LMM-R1: Empowering 3B LMMs with Strong Reasoning Abilities Through Two-Stage Rule-Based RL2025PaperPPOCode
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models2025PaperGRPOCode
Unified Reward Model for Multimodal Understanding and Generation2025PaperDPOCode
Can We Generate Images with CoT? Let's Verify and Reinforce Image Generation Step by Step2025PaperDPOCode
All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning2025PaperOnline RL-
Video-R1: Reinforcing Video Reasoning in MLLMs2025PaperGRPOCode
TitleYearPaperWebsiteCode
MERGETUNE: Continued Fine-Tuning of Vision-Language Models2026/01 (ICLR 2026)Paper--
Mask Fine-Tuning (MFT): Unlocking Hidden Capabilities in Vision-Language Models2025/12Paper-
Image-LoRA: Towards Minimal Fine-Tuning of VLMs2025/12Paper-
Reassessing the Role of Supervised Fine-Tuning: An Empirical Study in VLM Reasoning2025/12Paper-
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models2025/04/21PaperWebsite
OMNICAPTIONER: One Captioner to Rule Them All2025/04/09PaperWebsiteCode
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning2024PaperWebsiteCode
LLaVolta: Efficient Multi-modal Models via Stage-wise Visual Context Compression2024PaperWebsiteCode
ViTamin: Designing Scalable Vision Models in the Vision-Language Era2024PaperWebsiteCode
Espresso: High Compression For Rich Extraction From Videos for Your Vision-Language Model2024Paper--
Should VLMs be Pre-trained with Image Data?2025Paper--
VisionArena: 230K Real World User-VLM Conversations with Preference Labels2024Paper-Code
ProjectRepository Link
Verl🔗 GitHub
EasyR1🔗 GitHub
OpenR1🔗 GitHub
LLaMAFactory🔗 GitHub
MM-Eureka-Zero🔗 GitHub
MM-RLHF🔗 GitHub
LMM-R1🔗 GitHub
TitleYearPaperWebsiteCode
EvoPrompt: Evolving Prompt Adaptation for Vision-Language Models2026/03Paper--
MMLoP: Multi-Modal Low-Rank Prompting for Efficient Vision-Language Adaptation2026/02Paper-
Multimodal Prompt Optimizer (MPO): Joint Optimization of Multimodal Prompts2025/10Paper-
Evolutionary Prompt Optimization Discovers Emergent Multimodal Reasoning Strategies2025/03Paper-
In-ContextEdit:EnablingInstructionalImageEditingwithIn-Context GenerationinLargeScaleDiffusionTransformer2025/04/30PaperWebsite
TitleYearPaper Link
Aligning Cyber Space with Physical World: A Comprehensive Survey on Embodied AI2024Paper
ScreenAI: A Vision-Language Model for UI and Infographics Understanding2024Paper
ChartLlama: A Multimodal LLM for Chart Understanding and Generation2023Paper
SciDoc2Diagrammer-MAF: Towards Generation of Scientific Diagrams from Documents guided by Multi-Aspect Feedback Refinement2024📄 Paper
Training a Vision Language Model as Smartphone Assistant2024Paper
ScreenAgent: A Vision-Language Model-Driven Computer Control Agent2024Paper
Embodied Vision-Language Programmer from Environmental Feedback2024Paper
VLMs Play StarCraft II: A Benchmark and Multimodal Decision Method2025📄 Paper
MP-GUI: Modality Perception with MLLMs for GUI Understanding2025📄 Paper
TitleYearPaperWebsiteCode
GPT4Motion: Scripting Physical Motions in Text-to-Video Generation via Blender-Oriented GPT Planning2023📄 Paper🌍 Website💾 Code
Spurious Correlation in Multimodal LLMs2025📄 Paper--
WeGen: A Unified Model for Interactive Multimodal Generation as We Chat2025📄 Paper-💾 Code
VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning2025📄 Paper🌍 Website💾 Code
TitleYearPaperWebsiteCode
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation2024📄 Paper🌍 Website-
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities2024📄 Paper🌍 Website-
Vision-language model-driven scene understanding and robotic object manipulation2024📄 Paper--
Guiding Long-Horizon Task and Motion Planning with Vision Language Models2024📄 Paper🌍 Website-
AutoTAMP: Autoregressive Task and Motion Planning with LLMs as Translators and Checkers2023📄 Paper🌍 Website-
VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model2024📄 Paper--
Scalable Multi-Robot Collaboration with Large Language Models: Centralized or Decentralized Systems?2023📄 Paper🌍 Website-
DART-LLM: Dependency-Aware Multi-Robot Task Decomposition and Execution using Large Language Models2024📄 Paper🌍 Website-
MotionGPT: Human Motion as a Foreign Language2023📄 Paper-💾 Code
Learning Reward for Robot Skills Using Large Language Models via Self-Alignment2024📄 Paper--
Language to Rewards for Robotic Skill Synthesis2023📄 Paper🌍 Website-
Eureka: Human-Level Reward Design via Coding Large Language Models2023📄 Paper🌍 Website-
Integrated Task and Motion Planning2020📄 Paper--
Jailbreaking LLM-Controlled Robots2024📄 Paper🌍 Website-
Robots Enact Malignant Stereotypes2022📄 Paper🌍 Website-
LLM-Driven Robots Risk Enacting Discrimination, Violence, and Unlawful Actions2024📄 Paper--
Highlighting the Safety Concerns of Deploying LLMs/VLMs in Robotics2024📄 Paper🌍 Website-
EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents2025📄 Paper🌍 Website💾 Code & Dataset
Gemini Robotics: Bringing AI into the Physical World2025📄 Technical Report🌍 Website-
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation2024📄 Paper🌍 Website-
Magma: A Foundation Model for Multimodal AI Agents2025📄 Paper🌍 Website💾 Code
DayDreamer: World Models for Physical Robot Learning2022📄 Paper🌍 Website💾 Code
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models2025📄 Paper--
RL-VLM-F: Reinforcement Learning from Vision Language Foundation Model Feedback2024📄 Paper🌍 Website💾 Code
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data2024📄 Paper🌍 Website💾 Code
Unified Video Action Model2025📄 Paper🌍 Website💾 Code
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model2025📄 Paper🌍 Website💾 Code
DAM-VLA: A Dynamic Action Model-Based Vision-Language-Action Framework for Robot Manipulation03/2026📄 Paper-
NS-VLA: Towards Neuro-Symbolic Vision-Language-Action Models03/2026📄 Paper-
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control02/2026📄 Paper-
ST4VLA: Spatial Guided Training for Vision-Language-Action Models02/2026📄 Paper-
TitleYearPaperWebsiteCode
VIMA: General Robot Manipulation with Multimodal Prompts2022📄 Paper🌍 Website
Instruct2Act: Mapping Multi-Modality Instructions to Robotic Actions with Large Language Model2023📄 Paper--
Creative Robot Tool Use with Large Language Models2023📄 Paper🌍 Website-
RoboVQA: Multimodal Long-Horizon Reasoning for Robotics2024📄 Paper--
RT-1: Robotics Transformer for Real-World Control at Scale2022📄 Paper🌍 Website-
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control2023📄 Paper🌍 Website-
Open X-Embodiment: Robotic Learning Datasets and RT-X Models2023📄 Paper🌍 Website-
ExploRLLM: Guiding Exploration in Reinforcement Learning with Large Language Models2024📄 Paper🌍 Website-
AnyTouch: Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors2025📄 Paper🌍 Website💾 Code
Masked World Models for Visual Control2022📄 Paper🌍 Website💾 Code
Multi-View Masked World Models for Visual Robotic Manipulation2023📄 Paper🌍 Website💾 Code
TitleYearPaperWebsiteCode
ZSON: Zero-Shot Object-Goal Navigation using Multimodal Goal Embeddings2022📄 Paper--
LOC-ZSON: Language-driven Object-Centric Zero-Shot Object Retrieval and Navigation2024📄 Paper--
LM-Nav: Robotic Navigation with Large Pre-Trained Models of Language, Vision, and Action2022📄 Paper🌍 Website-
NaVILA: Legged Robot Vision-Language-Action Model for Navigation2022📄 Paper🌍 Website-
VLFM: Vision-Language Frontier Maps for Zero-Shot Semantic Navigation2024📄 Paper--
Navigation with Large Language Models: Semantic Guesswork as a Heuristic for Planning2023📄 Paper🌍 Website-
Vi-LAD: Vision-Language Attention Distillation for Socially-Aware Robot Navigation in Dynamic Environments2025📄 Paper--
Navigation World Models2024📄 Paper🌍 Website-
TitleYearPaperWebsiteCode
MUTEX: Learning Unified Policies from Multimodal Task Specifications2023📄 Paper🌍 Website-
LaMI: Large Language Models for Multi-Modal Human-Robot Interaction2024📄 Paper🌍 Website-
VLM-Social-Nav: Socially Aware Robot Navigation through Scoring using Vision-Language Models2024📄 Paper--
TitleYearPaperWebsiteCode
AutoMoT: A Unified Vision-Language-Action Model with Asynchronous Mixture-of-Transformers for End-to-End Autonomous Driving03/2026📄 Paper--
DriveVLM-RL: Neuroscience-Inspired Reinforcement Learning with Vision-Language Models for Safe Autonomous Driving03/2026📄 Paper-
HiST-VLA: A Hierarchical Spatio-Temporal Vision-Language-Action Model for End-to-End Autonomous Driving02/2026📄 Paper-
OpenDriveVLA: Towards End-to-end Autonomous Driving with Large Vision Language Action Model03/2025📄 Paper-
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives01/07/2025📄 Paper🌍 Website
DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models2024📄 Paper🌍 Website-
GPT-Driver: Learning to Drive with GPT2023📄 Paper--
LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving2023📄 Paper🌍 Website-
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving2023📄 Paper--
Referring Multi-Object Tracking2023📄 Paper-💾 Code
VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision2023📄 Paper-💾 Code
MotionLM: Multi-Agent Motion Forecasting as Language Modeling2023📄 Paper--
DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models2023📄 Paper🌍 Website-
VLP: Vision Language Planning for Autonomous Driving2024📄 Paper--
DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model2023📄 Paper--
TitleYearPaperWebsiteCode
DLF: Disentangled-Language-Focused Multimodal Sentiment Analysis2024📄 Paper-💾 Code
LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration – A Robot Sous-Chef Application2024📄 Paper--
Pretrained Language Models as Visual Planners for Human Assistance2023📄 Paper--
Promoting AI Equity in Science: Generalized Domain Prompt Learning for Accessible VLM Research2024📄 Paper--
Image and Data Mining in Reticular Chemistry Using GPT-4V2023📄 Paper--
TitleYearPaperWebsiteCode
A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis2023📄 Paper--
CogAgent: A Visual Language Model for GUI Agents2023📄 Paper-💾 Code
WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models2024📄 Paper-💾 Code
ShowUI: One Vision-Language-Action Model for GUI Visual Agent2024📄 Paper-💾 Code
ScreenAgent: A Vision Language Model-driven Computer Control Agent2024📄 Paper-💾 Code
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation2024📄 Paper-💾 Code
MolmoWeb: An Open Agent for Automating Web Tasks03/2026📄 Blog🌍 Website
TitleYearPaperWebsiteCode
X-World: Accessibility, Vision, and Autonomy Meet2021📄 Paper--
Context-Aware Image Descriptions for Web Accessibility2024📄 Paper--
Improving VR Accessibility Through Automatic 360 Scene Description Using Multimodal Large Language Models2024📄 Paper--
TitleYearPaperWebsiteCode
CARE: Towards Clinical Accountability in Multi-Modal Medical Reasoning with an Evidence-Grounded Agentic Framework03/2026📄 Paper--
MedMO: Grounding and Understanding Multimodal Large Language Model for Medical Images02/2026📄 Paper-
Colon-X: Advancing Intelligent Colonoscopy from Multimodal Understanding to Clinical Reasoning12/2025📄 Paper-
Frontiers in Intelligent Colonoscopy02/2025📄 Paper-💾 Code
VisionUnite: A Vision-Language Foundation Model for Ophthalmology Enhanced with Clinical Knowledge2024📄 Paper-💾 Code
Multimodal Healthcare AI: Identifying and Designing Clinically Relevant Vision-Language Applications for Radiology2024📄 Paper--
M-FLAG: Medical Vision-Language Pre-training with Frozen Language Models and Latent Space Geometry Optimization2023📄 Paper--
MedCLIP: Contrastive Learning from Unpaired Medical Images and Text2022📄 Paper-💾 Code
Med-Flamingo: A Multimodal Medical Few-Shot Learner2023📄 Paper-💾 Code
TitleYearPaperWebsiteCode
Analyzing K-12 AI Education: A Large Language Model Study of Classroom Instruction on Learning Theories, Pedagogy, Tools, and AI Literacy2024📄 Paper--
Students Rather Than Experts: A New AI for Education Pipeline to Model More Human-Like and Personalized Early Adolescence2024📄 Paper--
Harnessing Large Vision and Language Models in Agriculture: A Review2024📄 Paper--
A Vision-Language Model for Predicting Potential Distribution Land of Soybean Double Cropping2024📄 Paper--
Vision-Language Model is NOT All You Need: Augmentation Strategies for Molecule Language Models2024📄 Paper-💾 Code
DrawEduMath: Evaluating Vision Language Models with Expert-Annotated Students’ Hand-Drawn Math Images2024📄 Paper--
MultiMath: Bridging Visual and Mathematical Reasoning for Large Language Models2024📄 Paper-💾 Code
Vision-Language Models Meet Meteorology: Developing Models for Extreme Weather Events Detection with Heatmaps2024📄 Paper-💾 Code
He is Very Intelligent, She is Very Beautiful? On Mitigating Social Biases in Language Modeling and Generation2021📄 Paper--
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Region Profiling2024📄 Paper--
TitleYearPaperWebsiteCode
HALP: Detecting Hallucinations in Vision-Language Models without Generating a Single Token03/2026📄 Paper🌍 ACL-
Tone Matters: The Impact of Linguistic Tone on Hallucination in VLMs01/2026📄 Paper-
Object Hallucination in Image Captioning2018📄 Paper-
Evaluating Object Hallucination in Large Vision-Language Models2023📄 Paper-💾 Code
Detecting and Preventing Hallucinations in Large Vision Language Models2023📄 Paper--
HallE-Control: Controlling Object Hallucination in Large Multimodal Models2023📄 Paper-💾 Code
Hallu-PI: Evaluating Hallucination in Multi-modal Large Language Models within Perturbed Inputs2024📄 Paper-💾 Code
BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-Language Models2024📄 Paper🌍 Website-
HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models2023📄 Paper-💾 Code
AUTOHALLUSION: Automatic Generation of Hallucination Benchmarks for Vision-Language Models2024📄 Paper🌍 Website-
Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning2023📄 Paper-💾 Code
Hal-Eval: A Universal and Fine-grained Hallucination Evaluation Framework for Large Vision Language Models2024📄 Paper-💾 Code
AMBER: An LLM-free Multi-dimensional Benchmark for MLLMs Hallucination Evaluation2023📄 Paper-💾 Code
TitleYearPaperWebsiteCode
SaFeR-VLM: Safety into Multimodal Reasoning via Reinforcement Learning2026 (ICLR)📄 Paper--
HoliSafe: Holistic Safety Evaluation for Vision-Language Models2026 (ICLR)📄 Paper-
JailbreakZoo: Survey, Landscapes, and Horizons in Jailbreaking Large Language and Vision-Language Models2024📄 Paper🌍 Website
Safe-VLN: Collision Avoidance for Vision-and-Language Navigation of Autonomous Robots Operating in Continuous Environments2023📄 Paper--
SafeBench: A Safety Evaluation Framework for Multimodal Large Language Models2024📄 Paper--
JailBreakV: A Benchmark for Assessing the Robustness of MultiModal Large Language Models against Jailbreak Attacks2024📄 Paper--
SHIELD: An Evaluation Benchmark for Face Spoofing and Forgery Detection with Multimodal Large Language Models2024📄 Paper-💾 Code
Images are Achilles' Heel of Alignment: Exploiting Visual Vulnerabilities for Jailbreaking Multimodal Large Language Models2024📄 Paper--
Jailbreaking Attack against Multimodal Large Language Model2024📄 Paper--
Embodied Red Teaming for Auditing Robotic Foundation Models2025📄 Paper🌍 Website
Safety Guardrails for LLM-Enabled Robots2025📄 Paper--
TitleYearPaperWebsiteCode
Hallucination of Multimodal Large Language Models: A Survey2024📄 Paper--
Bias and Fairness in Large Language Models: A Survey2023📄 Paper--
Fairness and Bias in Multimodal AI: A Survey2024📄 Paper--
Multi-Modal Bias: Introducing a Framework for Stereotypical Bias Assessment beyond Gender and Race in Vision–Language Models2023📄 Paper--
FMBench: Benchmarking Fairness in Multimodal Large Language Models on Medical Tasks2024📄 Paper--
FairCLIP: Harnessing Fairness in Vision-Language Learning2024📄 Paper--
FairMedFM: Fairness Benchmarking for Medical Imaging Foundation Models2024📄 Paper--
Benchmarking Vision Language Models for Cultural Understanding2024📄 Paper--
TitleYearPaperWebsiteCode
Mitigating Hallucinations in Large Vision-Language Models with Instruction Contrastive Decoding2024📄 Paper--
Enhancing Visual-Language Modality Alignment in Large Vision Language Models via Self-Improvement2024📄 Paper--
Assessing and Learning Alignment of Unimodal Vision and Language Models2024📄 Paper🌍 Website-
Extending Multi-modal Contrastive Representations2023📄 Paper-💾 Code
OneLLM: One Framework to Align All Modalities with Language2023📄 Paper-💾 Code
What You See is What You Read? Improving Text-Image Alignment Evaluation2023📄 Paper🌍 Website💾 Code
Critic-V: VLM Critics Help Catch VLM Errors in Multimodal Reasoning2024📄 Paper🌍 Website💾 Code
TitleYearPaperWebsiteCode
VBench: Comprehensive BenchmarkSuite for Video Generative Models2023📄 Paper🌍 Website💾 Code
VBench++: Comprehensive and Versatile Benchmark Suite for Video Generative Models2024📄 Paper🌍 Website💾 Code
PhysBench: Benchmarking and Enhancing VLMs for Physical World Understanding2025📄 Paper🌍 Website💾 Code
VideoPhy: Evaluating Physical Commonsense for Video Generation2024📄 Paper🌍 Website💾 Code
WorldSimBench: Towards Video Generation Models as World Simulators2024📄 Paper🌍 Website-
WorldModelBench: Judging Video Generation Models As World Models2025📄 Paper🌍 Website💾 Code
VideoScore: Building Automatic Metrics to Simulate Fine-grained Human Feedback for Video Generation2024📄 Paper🌍 Website💾 Code
WISE: A World Knowledge-Informed Semantic Evaluation for Text-to-Image Generation2025📄 Paper-💾 Code
Content-Rich AIGC Video Quality Assessment via Intricate Text Alignment and Motion-Aware Consistency2025📄 Paper-💾 Code
Your Large Vision-Language Model Only Needs A Few Attention Heads For Visual Grounding2025📄 Paper--
SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities2024📄 Paper🌍 Website💾 Code
Do generative video models understand physical principles?2025📄 Paper🌍 Website💾 Code
PhysGen: Rigid-Body Physics-Grounded Image-to-Video Generation2024📄 Paper🌍 Website💾 Code
How Far is Video Generation from World Model: A Physical Law Perspective2024📄 Paper🌍 Website💾 Code
Imagine while Reasoning in Space: Multimodal Visualization-of-Thought2025📄 Paper--
VBench-2.0: Advancing Video Generation Benchmark Suite for Intrinsic Faithfulness2025📄 Paper🌍 Website💾 Code
TitleYearPaperWebsiteCode
LoRA-Squeeze: Simple and Effective Post-Tuning and In-Tuning Compression of LoRA Modules02/2026📄 Paper--
GRACE: Gated Relational Alignment via Confidence-based Distillation for Efficient VLMs01/2026📄 Paper-
VLMQ: Post-Training Quantization for Large Vision-Language Models2026 (ICLR)📄 Paper-
VILA: On Pre-training for Visual Language Models2023📄 Paper-
SimVLM: Simple Visual Language Model Pretraining with Weak Supervision2021📄 Paper--
LoRA: Low-Rank Adaptation of Large Language Models2021📄 Paper-💾 Code
QLoRA: Efficient Finetuning of Quantized LLMs2023📄 Paper--
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback2022📄 Paper-💾 Code
RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback2023📄 Paper--
TitleYearPaperWebsiteCode
A Survey on Bridging VLMs and Synthetic Data2025📄 Paper-💾 Code
Inst-IT: Boosting Multimodal Instance Understanding via Explicit Visual Prompt Instruction Tuning2024📄 PaperWebsite💾 Code
SLIP: Self-supervision meets Language-Image Pre-training2021📄 Paper-💾 Code
Synthetic Vision: Training Vision-Language Models to Understand Physics2024📄 Paper--
Synth2: Boosting Visual-Language Models with Synthetic Captions and Image Embeddings2024📄 Paper--
KALIE: Fine-Tuning Vision-Language Models for Open-World Manipulation without Robot Data2024📄 Paper--
Web Agents with World Models: Learning and Leveraging Environment Dynamics in Web Navigation2024📄 Paper--