Notable training methods

DelTA: Discriminative Token Credit Assignment for Reinforcement Learning from Verifiable Rewards

Kaiyi Zhang, Wei Wu, Yankai Lin

Published
May 20, 2026 — 17:53 UTC

Problem
This paper addresses a gap in the understanding of how response-level rewards in Reinforcement Learning from Verifiable Rewards (RLVR) translate into token-level probability adjustments. Despite the effectiveness of RLVR in enhancing the reasoning capabilities of large language models, the mechanisms by which these updates influence individual token probabilities remain inadequately explored. The authors present this work as a preprint, indicating that it has not yet undergone peer review.

Method
The core contribution of this work is the introduction of DelTA (Discriminative Token Credit Assignment), which provides a novel approach to token credit assignment in RLVR. The authors propose a discriminator framework that interprets the policy-gradient update direction as a linear discriminator over token-gradient vectors. This framework constructs centroids from positive and negative token-gradient vectors using advantage-weighted averaging. However, the authors identify that traditional centroid construction can be skewed by high-frequency shared patterns, such as formatting tokens, which can obscure the identification of discriminative token directions. DelTA addresses this by estimating token coefficients that amplify specific token-gradient directions associated with high-reward responses while downweighting shared or less informative gradients. This reweighting modifies the self-normalized RLVR surrogate, enhancing the contrastiveness of the centroids and refining the RLVR update direction.

Results
DelTA was evaluated on seven mathematical benchmarks, where it demonstrated significant improvements over strong same-scale baselines. Specifically, it achieved an average performance increase of 3.26 points on the Qwen3-8B-Base model and 2.62 points on the Qwen3-14B-Base model. The results indicate that DelTA not only enhances performance on these benchmarks but also exhibits robust generalization capabilities, as evidenced by additional evaluations on code generation tasks and across different model architectures.

Limitations
The authors acknowledge that while DelTA improves the discriminative power of token credit assignment, it may still be susceptible to the influence of high-frequency patterns in the data. Additionally, the reliance on advantage-weighted averaging may introduce biases if the reward structure is not well-aligned with the underlying task objectives. The paper does not extensively discuss the computational overhead introduced by the additional token coefficient estimation, which could impact scalability in larger models or datasets.

Why it matters
The implications of this work are significant for the field of reinforcement learning and natural language processing. By providing a clearer understanding of token-level updates in RLVR, DelTA enhances the interpretability and effectiveness of reward-based learning in language models. This advancement could lead to more efficient training methodologies and improved model performance across various tasks, particularly those requiring nuanced reasoning and decision-making. Furthermore, the insights gained from this research may inform future developments in token credit assignment strategies, potentially influencing a broader range of applications in AI.

Authors: Kaiyi Zhang, Wei Wu, Yankai Lin
Source: arXiv:2605.21467
URL: https://arxiv.org/abs/2605.21467v1

Turing Wire

By Turing Wire editorial staff · May 20, 2026 · Editorial standards →

Source: arXiv cs.LG