publications | Binghai Wang

2026

Tech Report
The Verification Horizon: No Silver Bullet for Coding Agent Rewards

Binghai Wang, Chenlong Zhang, Dayiheng Liu, and 9 more authors

Jun 2026

Qwen Team Technical Report

TL;DRVerification, not generation, has become the real bottleneck for coding agents—so reward systems must co-evolve with the policy they supervise.

Abs DOI arXiv Bib Website

A classical intuition holds that verifying a solution is easier than producing one. For today’s coding agents, this intuition is being inverted: generating complex candidate solutions is no longer difficult—reliably verifying them has become the harder problem. Every verifier we can build is only a proxy for human intent, never the intent itself. We characterize the quality of verification signals along three dimensions—scalability, faithfulness, and robustness—and study four reward constructions: a test verifier for general coding tasks, a rubric verifier for frontend tasks, the user as verifier for real-world agent tasks, and an automated agent verifier for long-horizon tasks. No fixed reward function can remain effective as policy capability continues to grow; verification must co-evolve with the generator.
@techreport{wang2026verification, title = {The Verification Horizon: No Silver Bullet for Coding Agent Rewards}, author = {Wang, Binghai and Zhang, Chenlong and Liu, Dayiheng and Zhang, Jiajun and Chen, Jiawei and Chen, Mouxiang and Fang, Rongyao and Zhang, Siyuan and Wang, Xuwu and Jing, Yuheng and Ma, Zeyao and Cui, Zeyu}, journal = {arXiv preprint arXiv:2606.26300}, year = {2026}, month = jun, doi = {10.48550/arXiv.2606.26300}, note = {Qwen Team Technical Report}, }
WWW’26
AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

Zhiheng Xi, Chenyang Liao, Guanyu Li, and 11 more authors

In Proceedings of the ACM Web Conference (WWW), Jun 2026

TL;DRProcess reward models that score each agent step by its "promise" and "progress", improving long-horizon LLM-agent decision-making.

Abs DOI arXiv Bib

We introduce AgentPRM, a process reward model that supervises LLM agents at the step level. Rather than only rewarding final outcomes, AgentPRM scores intermediate steps by their promise (expected future usefulness) and progress (concrete advancement toward the goal), yielding denser and more informative training signals for long-horizon agentic tasks.
@inproceedings{xi2026agentprm, title = {AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress}, author = {Xi, Zhiheng and Liao, Chenyang and Li, Guanyu and Zhang, Zhihao and Chen, Wenxiang and Wang, Binghai and Jin, Senjie and Zhou, Yuhao and Guan, Jian and Wu, Wei and Ji, Tao and Gui, Tao and Zhang, Qi and Huang, Xuanjing}, booktitle = {Proceedings of the ACM Web Conference (WWW)}, year = {2026}, month = jun, doi = {10.1145/3774904.3792551}, }
35 citations
arXiv’26
EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training

Chengjun Pan, Shichun Liu, Jiahang Lin, and 10 more authors

arXiv preprint arXiv:2604.19485, Apr 2026

TL;DRUses explained variance to adaptively decide how much to trust the critic, stabilizing and accelerating LLM post-training.

Abs DOI arXiv Bib

EVPO introduces explained-variance policy optimization, which adaptively modulates critic utilization during LLM post-training. By measuring how much variance the critic actually explains, EVPO down-weights an unreliable critic and leans on it when trustworthy, improving training stability and sample efficiency.
@article{pan2026evpo, title = {EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training}, author = {Pan, Chengjun and Liu, Shichun and Lin, Jiahang and Zhu, Dingwei and Zhang, Jiazheng and Dou, Shihan and Gao, Songyang and Han, Zhenhua and Wang, Binghai and Zheng, Rui and Huang, Xuanjing and Gui, Tao and Feng, Yansong}, journal = {arXiv preprint arXiv:2604.19485}, year = {2026}, month = apr, doi = {10.48550/arXiv.2604.19485}, }
arXiv’26
MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning

Jiahang Lin, Kai Hu, Binghai Wang, and 12 more authors

arXiv preprint arXiv:2604.13579, Apr 2026

TL;DRMulti-turn RL trains agents to answer long-document visual questions by iteratively gathering and reasoning over evidence.

Abs DOI arXiv Bib

MM-Doc-R1 trains multimodal agents for visual question answering over long documents. Through multi-turn reinforcement learning, the agent learns to navigate documents, gather evidence across pages, and reason iteratively, outperforming single-pass baselines on long-document VQA.
@article{lin2026mmdocr1, title = {MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning}, author = {Lin, Jiahang and Hu, Kai and Wang, Binghai and Zhou, Yuhao and Xi, Zhiheng and Guo, Honglin and Liu, Shichun and Wang, Junzhe and Dou, Shihan and Zhou, Enyu and Yan, Hang and Han, Zhenhua and Gui, Tao and Zhang, Qi and Huang, Xuanjing}, journal = {arXiv preprint arXiv:2604.13579}, year = {2026}, month = apr, doi = {10.48550/arXiv.2604.13579}, }
2 citations
arXiv’26
HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning

Shenzhi Wang, Shixuan Liu, Jingren Zhou, and 8 more authors

arXiv preprint arXiv:2603.17024, Mar 2026

TL;DRSynthesizing multi-hop training data yields vision-language models that generalize to harder compositional reasoning.

Abs DOI arXiv Bib

HopChain proposes a multi-hop data synthesis pipeline for vision-language reasoning. By constructing chains of intermediate reasoning hops, it produces training data that improves the generalization of vision-language models on compositional, multi-step reasoning tasks.
@article{wang2026hopchain, title = {HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning}, author = {Wang, Shenzhi and Liu, Shixuan and Zhou, Jingren and Gao, Chang and Chen, Xiong-Hui and Wang, Binghai and Yang, An and Song, Shiji and Yu, Bowen and Huang, Gao and Lin, Junyang}, journal = {arXiv preprint arXiv:2603.17024}, year = {2026}, month = mar, doi = {10.48550/arXiv.2603.17024}, }
1 citations
ACL’26
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Binghai Wang, Yantao Liu, Yuxuan Liu, and 13 more authors

In Annual Meeting of the Association for Computational Linguistics (ACL), Feb 2026

TL;DRReward models can be right for the wrong reasons; aligning their reasoning process—not just outcomes—escapes this deceptive alignment.

Abs DOI arXiv Bib

Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model’s reasoning process and human judgment, and a hybrid training signal that combines rationale consistency with outcome accuracy. Our method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), and improves downstream RLHF, escaping the deceptive alignment trap.
@inproceedings{wang2026outcome, title = {Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models}, author = {Wang, Binghai and Liu, Yantao and Liu, Yuxuan and Tang, Tianyi and Wang, Shenzhi and Gao, Chang and Zheng, Chujie and Zhang, Yichang and Yu, Le and Liu, Shixuan and Gui, Tao and Zhang, Qi and Huang, Xuanjing and Yu, Bowen and Huang, Fei and Lin, Junyang}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2026}, month = feb, doi = {10.48550/arXiv.2602.04649}, }
8 citations

2025

arXiv’25
WorldPM: Scaling Human Preference Modeling

Binghai Wang, Runji Lin, Keming Lu, and 17 more authors

arXiv preprint arXiv:2505.10527, May 2025

TL;DRHuman preference modeling follows scaling laws—objective preferences scale with data and model size, while subjective ones do not. Adopted in the post-training of Qwen3.

Abs DOI arXiv Bib

Motivated by scaling laws in language modeling, we find that similar laws exist in preference modeling. We propose World Preference Modeling (WorldPM), training on 15M-scale data across models from 1.5B to 72B parameters. Adversarial and objective metrics scale with data and model size, while subjective metrics do not. WorldPM broadly improves generalization across preference datasets and, integrated into RLHF, yields notable gains on in-house and public evaluations.
@article{wang2025worldpm, title = {WorldPM: Scaling Human Preference Modeling}, author = {Wang, Binghai and Lin, Runji and Lu, Keming and Yu, Le and Zhang, Zhenru and Huang, Fei and Zheng, Chujie and Dang, Kai and Fan, Yang and Ren, Xingzhang and Yang, An and Hui, Binyuan and Liu, Dayiheng and Gui, Tao and Zhang, Qi and Huang, Xuanjing and Jiang, Yu-Gang and Yu, Bowen and Zhou, Jingren and Lin, Junyang}, journal = {arXiv preprint arXiv:2505.10527}, year = {2025}, month = may, doi = {10.48550/arXiv.2505.10527}, }
11 citations
Tech Report
Qwen3 Technical Report

Qwen Team

May 2025

TL;DRThe Qwen3 family of open foundation models, unifying thinking and non-thinking modes with strong reasoning, multilingual, and agentic capabilities.

Abs DOI arXiv Bib

Qwen3 is a series of open-weight large language models spanning a wide range of sizes, including both dense and Mixture-of-Experts architectures. A key feature is the integration of thinking (for complex multi-step reasoning) and non-thinking (for fast, context-driven responses) modes within a single model, together with a thinking-budget mechanism. Qwen3 delivers state-of-the-art results among open models across reasoning, code, math, multilingual, and agentic benchmarks.
@techreport{qwen2025qwen3, title = {Qwen3 Technical Report}, author = {Team, Qwen}, journal = {arXiv preprint arXiv:2505.09388}, year = {2025}, month = may, doi = {10.48550/arXiv.2505.09388}, }
ICLR’25
RMB: Comprehensively Benchmarking Reward Models in LLM Alignment

Enyu Zhou, Guodong Zheng, Binghai Wang, and 11 more authors

In International Conference on Learning Representations (ICLR), Apr 2025

TL;DRA comprehensive 49-scenario reward-model benchmark that correlates with downstream alignment and exposes generalization gaps in state-of-the-art RMs.

Abs DOI arXiv Bib HTML

Reward models (RMs) guide the alignment of large language models. We propose RMB, a comprehensive RM benchmark covering over 49 real-world scenarios with both pairwise and Best-of-N (BoN) evaluations, and demonstrate a positive correlation with downstream alignment performance. Our analysis reveals generalization defects of state-of-the-art RMs and highlights the potential of generative RMs.
@inproceedings{zhou2025rmb, title = {RMB: Comprehensively Benchmarking Reward Models in LLM Alignment}, author = {Zhou, Enyu and Zheng, Guodong and Wang, Binghai and Xi, Zhiheng and Dou, Shihan and Bao, Rong and Shen, Wei and Xiong, Limao and Fan, Jessica and Mou, Yurong and Zheng, Rui and Gui, Tao and Zhang, Qi and Huang, Xuanjing}, booktitle = {International Conference on Learning Representations (ICLR)}, year = {2025}, month = apr, doi = {10.48550/arXiv.2410.09893}, }
57 citations

2024

EMNLP’24
Reward Modeling Requires Automatic Adjustment Based on Data Quality

Binghai Wang, Rui Zheng, Lu Chen, and 7 more authors

In Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024

TL;DRThe score gap a reward model assigns reveals preference-data quality, enabling automatic reweighting that stabilizes RM training under noisy labels.

Abs DOI Bib HTML PDF

In Reinforcement Learning from Human Feedback (RLHF), the reward model plays a crucial role in aligning language model outputs with human values. The human preference data used to train the reward model consists of a prompt and a response pair, with humans annotating which response better aligns with human value preferences. Due to the complexity and subjectivity of the annotation task, multiple organizations including OpenAI and Anthropic report significant noise in the human preference datasets, leading to instability and deviation in reward model training from human values. We discover that the difference in scores assigned to response pairs by the reward model effectively indicates the quality of data, and data of varying qualities show significant distinctions in reward model training. We introduce a method that automatically adjusts reward modeling based on data quality, reducing the impact of noise and making full use of dataset. Experiments on multiple human preference datasets demonstrate that our method stabilizes reward model training and significantly enhances the alignment performance of RLHF.
@inproceedings{wang2024reward, title = {Reward Modeling Requires Automatic Adjustment Based on Data Quality}, author = {Wang, Binghai and Zheng, Rui and Chen, Lu and Xi, Zhiheng and Shen, Wei and Zhou, Yuhao and Yan, Dong and Gui, Tao and Zhang, Qi and Huang, Xuanjing}, booktitle = {Findings of the Association for Computational Linguistics: EMNLP 2024}, year = {2024}, month = nov, pages = {4041--4064}, publisher = {Association for Computational Linguistics}, address = {Miami, Florida, USA}, doi = {10.18653/v1/2024.findings-emnlp.234}, }
15 citations
EMNLP’24
Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning

Lu Chen, Rui Zheng, Binghai Wang, and 9 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov 2024

TL;DRContrastive learning sharpens a reward model’s ability to discriminate between subtly different responses, improving RLHF.

Abs DOI Bib HTML PDF

Reward models in RLHF must reliably distinguish between responses of differing quality. We introduce a contrastive learning objective that improves the discriminative capability of reward models, helping them separate subtly different responses and yielding better downstream alignment performance.
@inproceedings{chen2024improving, title = {Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning}, author = {Chen, Lu and Zheng, Rui and Wang, Binghai and Jin, Senjie and Huang, Caishuang and Ye, Junjie and Zhang, Zhihao and Zhou, Yuhao and Xi, Zhiheng and Gui, Tao and Zhang, Qi and Huang, Xuanjing}, booktitle = {Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, year = {2024}, month = nov, pages = {15270--15283}, publisher = {Association for Computational Linguistics}, address = {Miami, Florida, USA}, doi = {10.18653/v1/2024.emnlp-main.852}, }
5 citations
Tech Report
Secrets of RLHF in Large Language Models Part II: Reward Modeling

Binghai Wang, Rui Zheng, Lu Chen, and 24 more authors

Jan 2024

TL;DRA deep dive into reward modeling for RLHF: measuring preference strength and using contrastive/meta-learning for robust, iterative RLHF.

Abs DOI arXiv Bib

Reward models are trained as proxies for human preferences to drive RLHF optimization. We address two challenges: (1) from a data perspective, we measure preference strength via a multi-RM voting mechanism and mitigate incorrect/ambiguous preferences; (2) from an algorithmic standpoint, we introduce contrastive learning and meta-learning to improve reward model generalization and support iterative RLHF.
@techreport{wang2024secrets, title = {Secrets of RLHF in Large Language Models Part II: Reward Modeling}, author = {Wang, Binghai and Zheng, Rui and Chen, Lu and Liu, Yan and Dou, Shihan and Huang, Caishuang and Shen, Wei and Jin, Senjie and Zhou, Enyu and Shi, Chenyu and Gao, Songyang and Xu, Nuo and Zhou, Yuhao and Fan, Xiaoran and Xi, Zhiheng and Zhao, Jun and Wang, Xiao and Ji, Tao and Yan, Hang and Shen, Lixing and Chen, Zhan and Gui, Tao and Zhang, Qi and Qiu, Xipeng and Huang, Xuanjing and Wu, Zuxuan and Jiang, Yu-Gang}, journal = {arXiv preprint arXiv:2401.06080}, year = {2024}, month = jan, doi = {10.48550/arXiv.2401.06080}, }
176 citations

2023

Tech Report

Secrets of RLHF in Large Language Models Part I: PPO

Rui Zheng, Shihan Dou, Songyang Gao, and 24 more authors

Jul 2023

TL;DRA systematic dissection of PPO for RLHF, introducing PPO-max for markedly more stable LLM policy training.

Abs DOI arXiv Bib

We dissect the framework of RLHF and re-evaluate the inner workings of PPO, identifying policy constraints as the key factor for effective PPO. We propose PPO-max, an advanced variant that substantially improves the training stability of the policy model, and release technical reports, reward models, and PPO code to support open research on LLM alignment.

@techreport{zheng2023secrets,
  title = {Secrets of RLHF in Large Language Models Part I: PPO},
  author = {Zheng, Rui and Dou, Shihan and Gao, Songyang and Hua, Yuan and Shen, Wei and Wang, Binghai and Liu, Yan and Jin, Senjie and Liu, Qin and Zhou, Yuhao and Xiong, Limao and Chen, Lu and Xi, Zhiheng and Xu, Nuo and Lai, Wenbin and Zhu, Minghao and Chang, Cheng and Yin, Zhangyue and Weng, Rongxiang and Cheng, Wensen and Huang, Haoran and Sun, Tianxiang and Yan, Hang and Gui, Tao and Zhang, Qi and Qiu, Xipeng and Huang, Xuanjing},
  journal = {arXiv preprint arXiv:2307.04964},
  year = {2023},
  month = jul,
  doi = {10.48550/arXiv.2307.04964},
}

276 citations