publications

publications by categories in reversed chronological order. generated by jekyll-scholar.

2026

  1. Tech Report
    The Verification Horizon: No Silver Bullet for Coding Agent Rewards
    Binghai Wang, Chenlong Zhang, Dayiheng Liu, and 9 more authors
    Jun 2026
    Qwen Team Technical Report
    TL;DRVerification, not generation, has become the real bottleneck for coding agents—so reward systems must co-evolve with the policy they supervise.
  2. WWW’26
    AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress
    Zhiheng Xi, Chenyang Liao, Guanyu Li, and 11 more authors
    In Proceedings of the ACM Web Conference (WWW), Jun 2026
    TL;DRProcess reward models that score each agent step by its "promise" and "progress", improving long-horizon LLM-agent decision-making.
    35 citations
  3. arXiv’26
    EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training
    Chengjun Pan, Shichun Liu, Jiahang Lin, and 10 more authors
    arXiv preprint arXiv:2604.19485, Apr 2026
    TL;DRUses explained variance to adaptively decide how much to trust the critic, stabilizing and accelerating LLM post-training.
  4. arXiv’26
    MM-Doc-R1: Training Agents for Long Document Visual Question Answering through Multi-turn Reinforcement Learning
    Jiahang Lin, Kai Hu, Binghai Wang, and 12 more authors
    arXiv preprint arXiv:2604.13579, Apr 2026
    TL;DRMulti-turn RL trains agents to answer long-document visual questions by iteratively gathering and reasoning over evidence.
    2 citations
  5. arXiv’26
    HopChain: Multi-Hop Data Synthesis for Generalizable Vision-Language Reasoning
    Shenzhi Wang, Shixuan Liu, Jingren Zhou, and 8 more authors
    arXiv preprint arXiv:2603.17024, Mar 2026
    TL;DRSynthesizing multi-hop training data yields vision-language models that generalize to harder compositional reasoning.
    1 citations
  6. ACL’26
    Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models
    Binghai Wang, Yantao Liu, Yuxuan Liu, and 13 more authors
    In Annual Meeting of the Association for Computational Linguistics (ACL), Feb 2026
    TL;DRReward models can be right for the wrong reasons; aligning their reasoning process—not just outcomes—escapes this deceptive alignment.
    8 citations

2025

  1. arXiv’25
    WorldPM: Scaling Human Preference Modeling
    Binghai Wang, Runji Lin, Keming Lu, and 17 more authors
    arXiv preprint arXiv:2505.10527, May 2025
    TL;DRHuman preference modeling follows scaling laws—objective preferences scale with data and model size, while subjective ones do not. Adopted in the post-training of Qwen3.
    11 citations
  2. Tech Report
    Qwen3 Technical Report
    Qwen Team
    May 2025
    TL;DRThe Qwen3 family of open foundation models, unifying thinking and non-thinking modes with strong reasoning, multilingual, and agentic capabilities.
  3. ICLR’25
    RMB: Comprehensively Benchmarking Reward Models in LLM Alignment
    Enyu Zhou, Guodong Zheng, Binghai Wang, and 11 more authors
    In International Conference on Learning Representations (ICLR), Apr 2025
    TL;DRA comprehensive 49-scenario reward-model benchmark that correlates with downstream alignment and exposes generalization gaps in state-of-the-art RMs.
    57 citations

2024

  1. EMNLP’24
    Reward Modeling Requires Automatic Adjustment Based on Data Quality
    Binghai Wang, Rui Zheng, Lu Chen, and 7 more authors
    In Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024
    TL;DRThe score gap a reward model assigns reveals preference-data quality, enabling automatic reweighting that stabilizes RM training under noisy labels.
    15 citations
  2. EMNLP’24
    Improving Discriminative Capability of Reward Models in RLHF Using Contrastive Learning
    Lu Chen, Rui Zheng, Binghai Wang, and 9 more authors
    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), Nov 2024
    TL;DRContrastive learning sharpens a reward model’s ability to discriminate between subtly different responses, improving RLHF.
    5 citations
  3. Tech Report
    Secrets of RLHF in Large Language Models Part II: Reward Modeling
    Binghai Wang, Rui Zheng, Lu Chen, and 24 more authors
    Jan 2024
    TL;DRA deep dive into reward modeling for RLHF: measuring preference strength and using contrastive/meta-learning for robust, iterative RLHF.
    176 citations

2023

  1. Tech Report
    Secrets of RLHF in Large Language Models Part I: PPO
    Rui Zheng, Shihan Dou, Songyang Gao, and 24 more authors
    Jul 2023
    TL;DRA systematic dissection of PPO for RLHF, introducing PPO-max for markedly more stable LLM policy training.
    276 citations