Notable evaluation benchmarks

Separating Semantic Competition from Context Length in RAG Reading

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Rohit Alekar, Cien Zhang, Svetlana Karslioglu

Published
May 26, 2026 — 17:06 UTC

Problem
This preprint addresses a critical gap in understanding the performance limitations of retrieval-augmented generation (RAG) systems, specifically the interplay between context length and semantic competition among retrieved passages. While RAG systems can retrieve relevant passages, they often fail to identify the correct answer due to the presence of competing passages that may distract the model. The authors propose a matched-control protocol to disentangle the effects of context length from semantic competition, providing insights into the underlying causes of performance degradation in RAG reading tasks.

Method
The authors introduce a matched-control protocol for evaluating RAG reading performance. This involves maintaining a fixed number and length of retrieved passages while substituting hard competitors with less competitive passages. The study employs two compact open models, Phi-2 and Qwen2.5-1.5B, and evaluates their performance on the SQuAD benchmark. The evaluation metrics include Exact Match (EM), answer inclusion, and F1 score. The authors also analyze retention curves to assess how performance varies as the number of competitors increases, summarizing these curves with a right-censored half-life metric when applicable. The training compute details are not disclosed, but the methodology emphasizes isolating the competition effect from context length.

Results
The results indicate that the matched-control protocol effectively mitigates the negative impact of semantic competition on RAG reading performance. For the Phi-2 model, the authors report an improvement of +6.0 EM points, +7.0 answer-inclusion points, and +0.057 F1 score. The Qwen2.5-1.5B model shows similar enhancements, with +4.5 EM points, +9.0 answer-inclusion points, and +0.068 F1 score. The retention curves reveal that the competition effect is more pronounced for F1 and answer inclusion metrics compared to exact match, and the effect varies with the length of the snippets provided.

Limitations
The authors acknowledge that their approach primarily focuses on the competition effect and does not fully account for other potential factors influencing RAG performance, such as the quality of the retrieved passages or the inherent limitations of the models themselves. Additionally, the study is constrained to the SQuAD benchmark, which may not generalize to other datasets or tasks. The right-censored half-life metric, while informative, may not capture all nuances of performance degradation as competitors accumulate.

Why it matters
This work has significant implications for the design and evaluation of RAG systems. By isolating the effects of semantic competition from context length, the findings can inform future model architectures and training strategies aimed at improving reading comprehension in RAG frameworks. Understanding these dynamics can lead to more robust retrieval mechanisms and enhance the overall efficacy of AI systems in natural language understanding tasks. This research lays the groundwork for further exploration into optimizing RAG systems and addressing their limitations in real-world applications.

Authors: Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Rohit Alekar, Cien Zhang, Svetlana Karslioglu, Akash Vishwakarma
Source: arXiv:2605.27294
URL: https://arxiv.org/abs/2605.27294v1

Turing Wire

By Turing Wire editorial staff · May 26, 2026 · Editorial standards →

Source: arXiv cs.CL