ExpRL: Exploratory RL for LLM Mid-Training
Violet Xiang, Amrith Setlur, Chase Blagden, Nick Haber, Aviral Kumar
- Published
- Jun 15, 2026 — 17:50 UTC
Problem
The paper addresses the limitations of existing sparse reward reinforcement learning (RL) techniques for large language models (LLMs), particularly in the context of mid-training. Current methods rely on manually curated reasoning traces to teach primitive skills, which may not adequately prepare models for complex problem-solving that requires the integration of these skills. The authors highlight the need for a more automated approach to enhance the coverage and effectiveness of RL in LLM training, particularly for challenging tasks. This work is presented as a preprint and has not undergone peer review.
Method
The core contribution of the paper is the introduction of ExpRL, a novel RL-based mid-training framework that utilizes large corpora of human-written question-answer data. Instead of treating reference solutions as direct imitation targets, ExpRL employs them as reward scaffolds. The model samples from the original problem prompt, while an LLM judge evaluates the sampled reasoning traces against the reference solutions, assigning dense rewards based on outcome-level and process-level evaluations. This approach allows for the reinforcement of partial progress and productive reasoning behaviors, which are often overlooked by traditional sparse final-answer rewards. The authors do not disclose specific architectural details or training compute requirements, focusing instead on the methodology’s conceptual framework.
Results
ExpRL demonstrates superior performance on challenging math reasoning tasks compared to several baselines, including supervised fine-tuning (SFT), sparse-reward Generalized Policy Optimization (GRPO), and self-distillation. The paper reports that ExpRL provides a better initialization for subsequent sparse-reward RL, indicating its effectiveness in enhancing the RL priming process. While specific numerical results are not detailed in the summary, the authors assert that the improvements are statistically significant and suggest that ExpRL can generalize beyond the math domain to mixed-domain tasks.
Limitations
The authors acknowledge that the reliance on human-written data may introduce biases inherent in the training set, potentially affecting the generalizability of the model. Additionally, the paper does not explore the scalability of the approach to larger datasets or more complex reasoning tasks beyond those tested. The lack of peer review may also raise questions about the robustness of the findings.
Why it matters
The implications of this work are significant for the field of reinforcement learning and LLM training. By automating the mid-training process and utilizing reward scaffolds, ExpRL could streamline the development of more capable LLMs that can tackle complex reasoning tasks without extensive manual intervention. This approach may pave the way for future research on automated RL techniques and their applications in various domains, enhancing the overall performance of LLMs. As published in arXiv cs.LG, this work contributes to the ongoing discourse on improving LLM capabilities through innovative training methodologies.
By Turing Wire editorial staff · Jun 15, 2026 · Editorial standards →
Source: arXiv cs.LG