Binghai Wang
PhD Student at Fudan University NLP Lab
Hi, I’m Binghai Wang (汪冰海) — my given name 冰海 literally means “sea of ice,” which is why I go by Dismeer, a German word meaning frozen sea (冰封之海). I’m a direct PhD student on the combined Master–PhD track (硕博连读) at the Fudan University NLP Lab, co-advised by Prof. Tao Gui, Prof. Qi Zhang, and Prof. Xuanjing Huang, and I expect to graduate in 2028. I received my B.Eng. in Computer Science and Technology from Tongji University. My research centers on reward modeling, RLHF, and scalable oversight for large language models.
The question I care about most is scalable oversight: how do we ensure that the supervision signals we train models with stay faithful, robust, and scalable as model capability keeps growing? My work approaches this through reward modeling and RLHF — from Secrets of RLHF (Part I & II) and scaling human preference modeling, to reward systems for coding agents and aligning the reasoning process (not just the outcomes) of reward models.
I’ve been fortunate to work on these problems both in academia and industry — most recently on post-training (RLHF) at the Qwen team and on reliable visual generation at Seed. See my publications for details.
Research Interests
RLHF
Reinforcement learning from human feedback — aligning model behavior with human values and preferences.
Scalable Oversight
Keeping supervision signals faithful, robust, and scalable as model capability keeps growing.
Reward Modeling
Building reward and verification signals that reflect true intent and resist reward hacking.
Education
(expected)
Fudan University Shanghai, China
- M.S. 2023 – 2025 · Ph.D. 2025 – 2028 (expected)
- Co-advised by Prof. Tao Gui, Prof. Qi Zhang, and Prof. Xuanjing Huang
- Research: scalable oversight, RLHF, reward modeling
Tongji University Shanghai, China
- GPA 4.85 / 5.0 · Rank 6 / 120
Experience
Seed, ByteDance Research Intern
- Reliable visual generation.
Qwen, Alibaba Research Intern
- 18-month internship focused on RLHF.
- Reward modeling, preference modeling, and post-training for LLMs.
Douyin · ByteDance Intern
- Recommendation system development.
Awards
- Silver ICPC Asia Changchun Regional Contest
- Gold Shanghai Collegiate Programming Contest
- Honor Outstanding Graduate, Tongji University
selected publications
- ACL’26Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward ModelsIn Annual Meeting of the Association for Computational Linguistics (ACL), Feb 2026TL;DRReward models can be right for the wrong reasons; aligning their reasoning process—not just outcomes—escapes this deceptive alignment.8 citations