FLARE: Fine-Grained Diagnostic Feedback for LLM Code Refinement
Yinsheng Yao, Hongxiang Zhang, Weixi Tong, Tianyi Zhang
- Published
- Jun 2, 2026 — 16:29 UTC
Problem
Existing methods for refining code generated by large language models (LLMs) primarily depend on coarse-grained feedback signals, such as test failures and self-critiques. These signals lack the granularity needed for effective bug localization, which is critical for improving the reliability of generated code. This paper addresses this gap by proposing FLARE, a novel framework that provides line-level suspiciousness signals for more precise bug localization and code refinement. The work is presented as a preprint and has not undergone peer review.
Method
FLARE employs a lightweight diagnostic model that predicts line-level suspiciousness scores for code segments, enabling the identification of potential bugs with greater accuracy. The framework operates iteratively, searching over the top-k suspicious regions identified by the diagnostic model. The selection of the best candidate for refinement is based on execution outcomes, which helps mitigate the uncertainty inherent in diagnostic predictions. The authors conducted experiments using two benchmarks, LiveCodeBench and BigCodeBench, with five different base LLMs to evaluate the effectiveness of their approach. The training compute details are not disclosed, but the focus is on the diagnostic model’s performance in providing fine-grained feedback.
Results
FLARE demonstrates significant improvements over existing baselines. Without candidate search (k=1), FLARE achieves an absolute performance increase from 1.72% to 7.42% compared to the strongest baseline. When the candidate search is expanded to 10 candidates, the average improvement rises to 8.50%. Additionally, the lightweight diagnostic model outperforms recent fault localization methods when evaluated in isolation, indicating its effectiveness in providing reliable guidance for code refinement.
Limitations
The authors acknowledge that the performance of FLARE may be influenced by the quality of the underlying LLMs used for code generation. Additionally, the framework’s reliance on execution outcomes for candidate selection may introduce variability based on the test cases employed. The paper does not address potential scalability issues when applied to larger codebases or more complex programming tasks, which could affect the generalizability of the results.
Why it matters
The introduction of FLARE has significant implications for the field of automated code generation and debugging. By providing fine-grained diagnostic feedback, FLARE enhances the ability of LLMs to produce reliable code, thereby reducing the manual effort required for debugging. This advancement could lead to more robust applications of LLMs in software development, particularly in environments where code correctness is critical. The findings and methodology presented in this paper contribute to ongoing research in improving LLM capabilities for code generation and refinement, as published in arXiv cs.AI.
By Turing Wire editorial staff · Jun 2, 2026 · Editorial standards →
Source: arXiv cs.AI