MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models
Published in CVPR 2026, 2026
MASS injects motion-aware spatial-temporal signals into VLMs, improving physics reasoning on a new benchmark of 4,350 real-world and AI-generated videos.
Recommended citation: Xiyang Wu, Zongxia Li, Jihui Jin, Guangyao Shi, Gouthaman KV, Vishnu Raj, Nilotpal Sinha, Jingxi Chen, Fan Du, Dinesh Manocha. (2026). "MASS: Motion-Aware Spatial-Temporal Grounding for Physics Reasoning and Comprehension in Vision-Language Models." CVPR.
Download Paper
