Scaling LLM Reasoning from Minimal Labels: A Semi-Supervised Framework with a Lightweight Verifier
Keizo Kato, Chenhui Chu, Yugo Murawaki, Sado Kurohashi
- Published
- Jun 15, 2026 — 14:55 UTC
Problem
The paper addresses the challenge of scaling reasoning capabilities in Large Language Models (LLMs) with minimal supervision, a gap in the literature that often relies on extensive labeled datasets for effective reasoning assessment. Current methods for generating pseudo-intermediate reasoning typically require large amounts of correctly annotated data, which is not always feasible. This work proposes a semi-supervised approach that transforms reasoning verification into a data generation mechanism, thus reducing the dependency on extensive labeled samples. The authors note that this is a preprint and has not undergone peer review.
Method
The proposed framework employs a lightweight reasoning-correctness classifier trained on a limited number of labeled samples. This classifier evaluates the validity of intermediate reasoning traces generated by an LLM. The method incorporates an entropy-based confidence threshold to filter out unreliable samples, ensuring that only high-confidence reasoning traces are retained for further model fine-tuning. The architecture leverages existing LLMs for generating reasoning traces, while the classifier is designed to be computationally efficient, allowing for scalability. The training process is optimized to minimize the need for extensive labeled data, focusing instead on the quality of the reasoning verification process.
Results
The framework was evaluated on two benchmarks: the Verifiable Math Problems (specifically the Orca-Math subset) and the Question Answering on Image Scene Graphs (GQA) with Visual Programming. The results indicate that the proposed method achieves accuracy levels comparable to models trained with 10-15 times more labeled data. Specifically, the semi-supervised approach demonstrated a significant reduction in the amount of labeled data required while maintaining performance, showcasing the effectiveness of the lightweight verifier and entropy filtering in enhancing reasoning capabilities.
Limitations
The authors acknowledge that the reliance on a small number of labeled samples may introduce biases if the samples are not representative of the broader data distribution. Additionally, the performance of the lightweight classifier may vary depending on the complexity of the reasoning tasks. The paper does not address potential scalability issues when applied to more complex reasoning scenarios or other domains outside the tested benchmarks. Furthermore, the method’s effectiveness in real-world applications remains to be validated.
Why it matters
This work has significant implications for the development of autonomous reasoning systems that can learn from minimal human input, potentially reducing the cost and effort associated with data annotation. By demonstrating that effective reasoning can be achieved with limited supervision, the framework opens avenues for future research in semi-supervised learning and LLMs. The approach could facilitate the construction of large-scale reasoning resources, making it easier to deploy LLMs in various applications where labeled data is scarce. This is particularly relevant in the context of ongoing advancements in AI, as published in arXiv cs.CL.
By Turing Wire editorial staff · Jun 15, 2026 · Editorial standards →
Source: arXiv cs.CL