Notable evaluation benchmarks

Re-Ranking Through an Attribution Lens for Citation Quality in Legal QA

Mohamed Hesham Elganayni, Selim Saleh

Published: Jun 2, 2026 — 14:48 UTC

Problem
The paper addresses the inadequacy of existing retrieval-augmented generation systems in legal question answering (QA), specifically the assumption that passages ranked by semantic similarity are the most useful for citation. Prior research has shown that semantic similarity does not correlate with the quality of citations, as evidenced by the authors’ findings on the AQuAECHR benchmark. This work is particularly relevant as it is a preprint and has not undergone peer review, highlighting the need for further validation in the literature.

Method
The authors propose a lightweight cross-encoder that is trained on continuous perturbation-based attribution scores, specifically using C-LIME, to re-rank candidate passages before they are fed into a language model for answer generation. The training process involves five-fold cross-validation on the AQuAECHR benchmark, utilizing two different language models. The core technical contribution lies in the cross-encoder’s ability to leverage attribution scores to improve the relevance of retrieved passages, thereby enhancing citation faithfulness. The model is designed to reduce model-specific noise, allowing for a more generalized relevance signal that can be shared across different models.

Results
The proposed re-ranking method significantly outperforms baseline approaches, including semantic similarity-based ranking and random selection, in terms of citation quality. The authors report that the re-ranker achieves a substantial increase in alignment with gold expert answers, demonstrating improved citation faithfulness. Notably, the two independently trained re-rankers on different models converge in their performance, indicating that the cross-encoder effectively captures a shared relevance signal. While specific numerical improvements are not detailed in the abstract, the qualitative results suggest a marked enhancement in citation accuracy compared to traditional methods.

Limitations
The authors acknowledge that while the cross-encoder improves citation quality, same-model re-ranking remains more effective than cross-model approaches. This suggests that the model-specific characteristics still play a role in the effectiveness of the re-ranking process. Additionally, the reliance on perturbation-based attribution methods may introduce its own biases, which are not fully explored in the paper. The lack of extensive evaluation across diverse legal domains may also limit the generalizability of the findings.

Why it matters
This work has significant implications for the development of more reliable legal QA systems, particularly in enhancing the quality of citations generated by language models. By demonstrating that perturbation-based attribution can serve as a model-agnostic training signal for citation-aware retrieval, the authors pave the way for future research to explore similar methodologies in other domains. The findings contribute to the ongoing discourse on improving retrieval mechanisms in NLP applications, as published in arXiv cs.CL.

By Callan Zhang · Jun 2, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL