Binghai Wang

PhD Student at Fudan University NLP Lab

prof_pic.jpg

Fudan University NLP Lab

Shanghai, China

wangbh25 [at] m.fudan.edu.cn

Hi, I’m Binghai Wang (汪冰海) — my given name 冰海 literally means “sea of ice,” which is why I go by Dismeer, a German word meaning frozen sea (冰封之海). I’m a direct PhD student on the combined Master–PhD track (硕博连读) at the Fudan University NLP Lab, co-advised by Prof. Tao Gui, Prof. Qi Zhang, and Prof. Xuanjing Huang, and I expect to graduate in 2028. I received my B.Eng. in Computer Science and Technology from Tongji University. My research centers on reward modeling, RLHF, and scalable oversight for large language models.

The question I care about most is scalable oversight: how do we ensure that the supervision signals we train models with stay faithful, robust, and scalable as model capability keeps growing? My work approaches this through reward modeling and RLHF — from Secrets of RLHF (Part I & II) and scaling human preference modeling, to reward systems for coding agents and aligning the reasoning process (not just the outcomes) of reward models.

I’ve been fortunate to work on these problems both in academia and industry — most recently on post-training (RLHF) at the Qwen team and on reliable visual generation at Seed. See my publications for details.

Research Interests

RLHF

Reinforcement learning from human feedback — aligning model behavior with human values and preferences.

Scalable Oversight

Keeping supervision signals faithful, robust, and scalable as model capability keeps growing.

Reward Modeling

Building reward and verification signals that reflect true intent and resist reward hacking.

Education

2023 — 2028
(expected)

Fudan University Shanghai, China

Combined Master–PhD (硕博连读), NLP Lab
  • M.S. 2023 – 2025 · Ph.D. 2025 – 2028 (expected)
  • Co-advised by Prof. Tao Gui, Prof. Qi Zhang, and Prof. Xuanjing Huang
  • Research: scalable oversight, RLHF, reward modeling
2019 — 2023

Tongji University Shanghai, China

B.Eng. in Computer Science and Technology
  • GPA 4.85 / 5.0 · Rank 6 / 120

Experience

2026.06 — present

Seed, ByteDance Research Intern

Multimodal & World Model Group
  • Reliable visual generation.
2024.10 — 2026.05

Qwen, Alibaba Research Intern

Post-Training Group
  • 18-month internship focused on RLHF.
  • Reward modeling, preference modeling, and post-training for LLMs.
2022.10 — 2023.03

Douyin · ByteDance Intern

Data Platform (数据中台)
  • Recommendation system development.

Awards

  • Silver ICPC Asia Changchun Regional Contest
  • Gold Shanghai Collegiate Programming Contest
  • Honor Outstanding Graduate, Tongji University

selected publications

  1. Tech Report
    The Verification Horizon: No Silver Bullet for Coding Agent Rewards
    Binghai Wang, Chenlong Zhang, Dayiheng Liu, and 9 more authors
    Jun 2026
    Qwen Team Technical Report
    TL;DRVerification, not generation, has become the real bottleneck for coding agents—so reward systems must co-evolve with the policy they supervise.
  2. ACL’26
    Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models
    Binghai Wang, Yantao Liu, Yuxuan Liu, and 13 more authors
    In Annual Meeting of the Association for Computational Linguistics (ACL), Feb 2026
    TL;DRReward models can be right for the wrong reasons; aligning their reasoning process—not just outcomes—escapes this deceptive alignment.
    8 citations
  3. arXiv’25
    WorldPM: Scaling Human Preference Modeling
    Binghai Wang, Runji Lin, Keming Lu, and 17 more authors
    arXiv preprint arXiv:2505.10527, May 2025
    TL;DRHuman preference modeling follows scaling laws—objective preferences scale with data and model size, while subjective ones do not. Adopted in the post-training of Qwen3.
    11 citations
  4. Tech Report
    Secrets of RLHF in Large Language Models Part II: Reward Modeling
    Binghai Wang, Rui Zheng, Lu Chen, and 24 more authors
    Jan 2024
    TL;DRA deep dive into reward modeling for RLHF: measuring preference strength and using contrastive/meta-learning for robust, iterative RLHF.
    176 citations