Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation
Published in Preprint, 2025
We propose PrefBERT, a scoring model for evaluating open-ended long-form generation in GRPO, showing better semantic reward feedback than ROUGE-L and BERTScore.
| arXiv | Code |
Recommended citation: Zongxia Li, Yapei Chang, Yuhang Zhou, Xiyang Wu, Zichao Liang, Yoo Yeon Sung, Jordan Lee Boyd-Graber. (2025). "Semantically-Aware Rewards for Open-Ended R1 Training in Free-Form Generation." Preprint.
Download Paper
