VisPlay: Self-Evolving Vision-Language Models from Images

Published in CVPR 2026, 2026

VisPlay is a self-evolving RL framework that enables VLMs to autonomously improve reasoning abilities from large amounts of unlabeled image data.

arXivProject Page

Recommended citation: Yicheng He*, Chengsong Huang*, Zongxia Li*, Jiaxin Huang, Yonghui Yang. (2026). "VisPlay: Self-Evolving Vision-Language Models from Images." CVPR.
Download Paper