Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation

Published in Preprint, 2025

We propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO, showing better semantic reward feedback than ROUGE-L and BERTScore.

arXiv

Code

Recommended citation: Zongxia Li, Yapei Chang, Yuhang Zhou, Xiyang Wu, Zichao Liang, Yoo Yeon Sung, Jordan Lee Boyd-Graber. (2025). "Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation." Preprint.
Download Paper

Share on

Bluesky Facebook LinkedIn Mastodon X (formerly Twitter)