Major training methods

EvoLM: Self-Evolving Language Models through Co-Evolved Discriminative Rubrics

Shuyue Stella Li, Rui Xin, Teng Xiao, Yike Wang, Rulin Shao, Zoey Hao

Published: May 5, 2026 — 15:31 UTC
Summary length: 448 words
Relevance score: 85%

Problem
This paper addresses the limitations of current post-training methods for language models, which typically rely on external supervision sources such as human annotations, proprietary models, or scalar reward models. These methods impose inherent ceilings on model performance due to their dependency on external evaluative frameworks. The authors propose a novel approach that leverages a model’s own evaluative capacity to generate self-improvement signals, a largely untapped resource in the literature. This work is presented as a preprint and has not yet undergone peer review.

Method
The core technical contribution of this paper is the introduction of EvoLM, a self-evolving post-training method that utilizes co-evolved discriminative rubrics. EvoLM operates by training two interdependent components within a single language model: (1) a rubric generator that creates instance-specific evaluation criteria optimized for discriminative utility, enhancing a frozen judge’s ability to differentiate between preferred and dispreferred responses; and (2) a policy that is trained using scores derived from these rubrics as rewards. The rubrics are generated from the model’s own outputs through temporal contrast with earlier checkpoints, eliminating the need for human annotation or external supervision. The authors specifically implement EvoLM on the Qwen3-8B model.

Results
EvoLM demonstrates significant performance improvements on various benchmarks. The rubric generator outperforms GPT-4.1 on the RewardBench-2 benchmark by 25.7%. The co-trained policy achieves an average score of 69.3% on the OLMo3-Adapt suite, surpassing policies trained with GPT-4.1 prompted rubrics by 3.9% and those using the state-of-the-art 8B reward model SkyWork-RM by 16%. These results indicate that the self-evolving approach effectively enhances the model’s performance without reliance on external evaluative signals.

Limitations
The authors acknowledge that while EvoLM shows promising results, it is still constrained by the architecture of the Qwen3-8B model and may not generalize across all language model architectures. Additionally, the reliance on temporal contrast for rubric generation may introduce biases based on the model’s previous outputs. The paper does not address potential scalability issues when applied to larger models or different domains, nor does it explore the implications of the self-evolving mechanism on long-term model stability and performance.

Why it matters
The implications of this work are significant for the field of natural language processing and model training methodologies. By demonstrating that a language model can self-generate evaluative rubrics and use them for self-improvement, this research opens avenues for developing more autonomous and efficient training paradigms. It reduces dependency on external supervision, potentially leading to more scalable and adaptable language models. This approach could influence future research on self-supervised learning and the design of robust evaluation frameworks for AI systems.

Authors: Shuyue Stella Li, Rui Xin, Teng Xiao, Yike Wang, Rulin Shao, Zoey Hao, Melanie Sclar, Sewoong Oh et al.
Source: arXiv:2605.03871
URL: https://arxiv.org/abs/2605.03871v1

Author Turing Wire editorial staff

Source

arXiv cs.AI https://arxiv.org/abs/2605.03871v1