publications
publications by categories in reversed chronological order. generated by jekyll-scholar.
2026
- WWW’26AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and ProgressIn Proceedings of the ACM Web Conference (WWW), Jun 2026TL;DRProcess reward models that score each agent step by its "promise" and "progress", improving long-horizon LLM-agent decision-making.35 citations
- arXiv’26MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement LearningarXiv preprint arXiv:2604.13579, Apr 2026TL;DRMulti-turn RL trains agents to answer long-document visual questions by iteratively gathering and reasoning over evidence.2 citations
- ACL’26Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward ModelsIn Annual Meeting of the Association for Computational Linguistics (ACL), Feb 2026TL;DRReward models can be right for the wrong reasons; aligning their reasoning process—not just outcomes—escapes this deceptive alignment.8 citations
2025
- ICLR’25RMB: Comprehensively Benchmarking Reward Models in LLM AlignmentIn International Conference on Learning Representations (ICLR), Apr 2025TL;DRA comprehensive 49-scenario reward-model benchmark that correlates with downstream alignment and exposes generalization gaps in state-of-the-art RMs.57 citations
2024
- EMNLP’24Reward Modeling Requires Automatic Adjustment Based on Data QualityIn Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024TL;DRThe score gap a reward model assigns reveals preference-data quality, enabling automatic reweighting that stabilizes RM training under noisy labels.15 citations
- EMNLP’24Improving Discriminative Capability of Reward Models in RLHF Using Contrastive LearningIn Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov 2024TL;DRContrastive learning sharpens a reward model’s ability to discriminate between subtly different responses, improving RLHF.5 citations