The Coverage Illusion: From Pre-retrieval Routing Failure to Post-retrieval Cascades in a Production RAG System
Zafar Hussain, Kristoffer Nielbo
- Published
- May 26, 2026 — 16:08 UTC
Problem
This preprint addresses a significant gap in the understanding of query augmentation methods in Retrieval-Augmented Generation (RAG) systems, particularly in the context of real-world production traffic. While existing literature emphasizes the necessity of query augmentation techniques like HyDE and query expansion, the empirical justification for their extensive application—resulting in high inference costs and latency—remains largely unexplored. The authors identify a discrepancy between synthetic query evaluations and actual user queries, coining this phenomenon the “Coverage Illusion.”
Method
The authors conducted a case study using the Danish National Encyclopedia, analyzing five retrieval workflows across 20,000 query-workflow pairs derived from both production traffic and synthetic conditions. They evaluated the effectiveness of pre-retrieval routing and post-retrieval cascades in addressing the Coverage Illusion. The proposed post-retrieval cascade operates in a cheapest-first order, escalating to LLM augmentation only when prior steps yield no documents. This method does not require additional training overhead or secondary serving infrastructure, making it efficient for production environments. The evaluation involved comparing the cascade’s performance against the Always-HyDE baseline.
Results
The post-retrieval cascade demonstrated a significant improvement in quality, achieving a +0.140 increase in Composite Overall points compared to the Always-HyDE method. Additionally, it reduced latency by 31.8% and successfully served 72.2% of real user queries without necessitating LLM augmentation. These results highlight the effectiveness of the proposed method in real-world applications, contrasting sharply with the synthetic query findings that suggested a need for LLM augmentation in over 90% of cases.
Limitations
The authors acknowledge that their findings are based on a specific dataset from the Danish National Encyclopedia, which may limit the generalizability of the results to other domains or datasets. They also note that the structural mismatch between synthetic and real query distributions could vary across different applications, potentially affecting the applicability of their proposed cascade method. Furthermore, the study does not explore the implications of varying user intent or the impact of different retrieval models on the observed coverage gap.
Why it matters
This work has significant implications for the design and optimization of RAG systems in production environments. By elucidating the Coverage Illusion, the authors provide a framework for understanding when and how to apply LLM augmentation effectively, potentially leading to more efficient resource utilization and reduced operational costs. The proposed post-retrieval cascade offers a practical solution for improving retrieval quality while minimizing latency, which is crucial for enhancing user experience in real-time applications. This research encourages further exploration into the discrepancies between synthetic and real-world query distributions, paving the way for more robust and adaptive retrieval systems.
Authors: Zafar Hussain, Kristoffer Nielbo
Source: arXiv:2605.27220
URL: https://arxiv.org/abs/2605.27220v1
By Turing Wire editorial staff · May 26, 2026 · Editorial standards →
Source: arXiv cs.CL