Binghai Wang

Fudan University NLP Lab

Shanghai, China

Hi, I’m Binghai Wang (汪冰海) — my given name 冰海 literally means “sea of ice,” which is why I go by Dismeer, a German word meaning frozen sea (冰封之海). I’m a direct PhD student on the combined Master–PhD track (硕博连读) at the Fudan University NLP Lab, co-advised by Prof. Tao Gui, Prof. Qi Zhang, and Prof. Xuanjing Huang, and I expect to graduate in 2028. I received my B.Eng. in Computer Science and Technology from Tongji University. My research centers on reward modeling, RLHF, and scalable oversight for large language models.

The question I care about most is scalable oversight: how do we ensure that the supervision signals we train models with stay faithful, robust, and scalable as model capability keeps growing? My work approaches this through reward modeling and RLHF — from Secrets of RLHF (Part I & II) and scaling human preference modeling, to reward systems for coding agents and aligning the reasoning process (not just the outcomes) of reward models.

I’ve been fortunate to work on these problems both in academia and industry — most recently on post-training (RLHF) at the Qwen team and on reliable visual generation at Seed. See my publications for details.

Research Interests

RLHF

Reinforcement learning from human feedback — aligning model behavior with human values and preferences.

Scalable Oversight

Keeping supervision signals faithful, robust, and scalable as model capability keeps growing.

Reward Modeling

Building reward and verification signals that reflect true intent and resist reward hacking.

Education

2023 — 2028
(expected)

Fudan University Shanghai, China

Combined Master–PhD (硕博连读), NLP Lab

M.S. 2023 – 2025 · Ph.D. 2025 – 2028 (expected)
Co-advised by Prof. Tao Gui, Prof. Qi Zhang, and Prof. Xuanjing Huang
Research: scalable oversight, RLHF, reward modeling

2019 — 2023

Tongji University Shanghai, China

B.Eng. in Computer Science and Technology

GPA 4.85 / 5.0 · Rank 6 / 120

Experience

2026.06 — present

Seed, ByteDance Research Intern

Multimodal & World Model Group

Reliable visual generation.

2024.10 — 2026.05

Qwen, Alibaba Research Intern

Post-Training Group

18-month internship focused on RLHF.
Reward modeling, preference modeling, and post-training for LLMs.

2022.10 — 2023.03

Douyin · ByteDance Intern

Data Platform (数据中台)

Recommendation system development.

Awards

Silver ICPC Asia Changchun Regional Contest
Gold Shanghai Collegiate Programming Contest
Honor Outstanding Graduate, Tongji University

selected publications

Tech Report
The Verification Horizon: No Silver Bullet for Coding Agent Rewards

Binghai Wang, Chenlong Zhang, Dayiheng Liu, and 9 more authors

Jun 2026

Qwen Team Technical Report

TL;DRVerification, not generation, has become the real bottleneck for coding agents—so reward systems must co-evolve with the policy they supervise.

Abs DOI arXiv Bib Website

A classical intuition holds that verifying a solution is easier than producing one. For today’s coding agents, this intuition is being inverted: generating complex candidate solutions is no longer difficult—reliably verifying them has become the harder problem. Every verifier we can build is only a proxy for human intent, never the intent itself. We characterize the quality of verification signals along three dimensions—scalability, faithfulness, and robustness—and study four reward constructions: a test verifier for general coding tasks, a rubric verifier for frontend tasks, the user as verifier for real-world agent tasks, and an automated agent verifier for long-horizon tasks. No fixed reward function can remain effective as policy capability continues to grow; verification must co-evolve with the generator.
@techreport{wang2026verification, title = {The Verification Horizon: No Silver Bullet for Coding Agent Rewards}, author = {Wang, Binghai and Zhang, Chenlong and Liu, Dayiheng and Zhang, Jiajun and Chen, Jiawei and Chen, Mouxiang and Fang, Rongyao and Zhang, Siyuan and Wang, Xuwu and Jing, Yuheng and Ma, Zeyao and Cui, Zeyu}, journal = {arXiv preprint arXiv:2606.26300}, year = {2026}, month = jun, doi = {10.48550/arXiv.2606.26300}, note = {Qwen Team Technical Report}, }
ACL’26
Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Binghai Wang, Yantao Liu, Yuxuan Liu, and 13 more authors

In Annual Meeting of the Association for Computational Linguistics (ACL), Feb 2026

TL;DRReward models can be right for the wrong reasons; aligning their reasoning process—not just outcomes—escapes this deceptive alignment.

Abs DOI arXiv Bib

Generative Reward Models (GenRMs) and LLM-as-a-Judge exhibit deceptive alignment by producing correct judgments for incorrect reasons. We introduce Rationale Consistency, a fine-grained metric that quantifies the alignment between the model’s reasoning process and human judgment, and a hybrid training signal that combines rationale consistency with outcome accuracy. Our method achieves state-of-the-art performance on RM-Bench (87.1%) and JudgeBench (82%), and improves downstream RLHF, escaping the deceptive alignment trap.
@inproceedings{wang2026outcome, title = {Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models}, author = {Wang, Binghai and Liu, Yantao and Liu, Yuxuan and Tang, Tianyi and Wang, Shenzhi and Gao, Chang and Zheng, Chujie and Zhang, Yichang and Yu, Le and Liu, Shixuan and Gui, Tao and Zhang, Qi and Huang, Xuanjing and Yu, Bowen and Huang, Fei and Lin, Junyang}, booktitle = {Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2026}, month = feb, doi = {10.48550/arXiv.2602.04649}, }
8 citations
arXiv’25
WorldPM: Scaling Human Preference Modeling

Binghai Wang, Runji Lin, Keming Lu, and 17 more authors

arXiv preprint arXiv:2505.10527, May 2025

TL;DRHuman preference modeling follows scaling laws—objective preferences scale with data and model size, while subjective ones do not. Adopted in the post-training of Qwen3.

Abs DOI arXiv Bib

Motivated by scaling laws in language modeling, we find that similar laws exist in preference modeling. We propose World Preference Modeling (WorldPM), training on 15M-scale data across models from 1.5B to 72B parameters. Adversarial and objective metrics scale with data and model size, while subjective metrics do not. WorldPM broadly improves generalization across preference datasets and, integrated into RLHF, yields notable gains on in-house and public evaluations.
@article{wang2025worldpm, title = {WorldPM: Scaling Human Preference Modeling}, author = {Wang, Binghai and Lin, Runji and Lu, Keming and Yu, Le and Zhang, Zhenru and Huang, Fei and Zheng, Chujie and Dang, Kai and Fan, Yang and Ren, Xingzhang and Yang, An and Hui, Binyuan and Liu, Dayiheng and Gui, Tao and Zhang, Qi and Huang, Xuanjing and Jiang, Yu-Gang and Yu, Bowen and Zhou, Jingren and Lin, Junyang}, journal = {arXiv preprint arXiv:2505.10527}, year = {2025}, month = may, doi = {10.48550/arXiv.2505.10527}, }
11 citations
Tech Report
Secrets of RLHF in Large Language Models Part II: Reward Modeling

Binghai Wang, Rui Zheng, Lu Chen, and 24 more authors

Jan 2024

TL;DRA deep dive into reward modeling for RLHF: measuring preference strength and using contrastive/meta-learning for robust, iterative RLHF.

Abs DOI arXiv Bib

Reward models are trained as proxies for human preferences to drive RLHF optimization. We address two challenges: (1) from a data perspective, we measure preference strength via a multi-RM voting mechanism and mitigate incorrect/ambiguous preferences; (2) from an algorithmic standpoint, we introduce contrastive learning and meta-learning to improve reward model generalization and support iterative RLHF.
@techreport{wang2024secrets, title = {Secrets of RLHF in Large Language Models Part II: Reward Modeling}, author = {Wang, Binghai and Zheng, Rui and Chen, Lu and Liu, Yan and Dou, Shihan and Huang, Caishuang and Shen, Wei and Jin, Senjie and Zhou, Enyu and Shi, Chenyu and Gao, Songyang and Xu, Nuo and Zhou, Yuhao and Fan, Xiaoran and Xi, Zhiheng and Zhao, Jun and Wang, Xiao and Ji, Tao and Yan, Hang and Shen, Lixing and Chen, Zhan and Gui, Tao and Zhang, Qi and Qiu, Xipeng and Huang, Xuanjing and Wu, Zuxuan and Jiang, Yu-Gang}, journal = {arXiv preprint arXiv:2401.06080}, year = {2024}, month = jan, doi = {10.48550/arXiv.2401.06080}, }
176 citations