Notable efficiency inference

PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

Jingyi Peng, Zhongwei Wan, Weiting Liu, Qiuzhuang Sun

Published: May 12, 2026 — 15:28 UTC
Summary length: 459 words
Relevance score: 80%

Problem
This paper addresses the limitations of existing memory management techniques for long-horizon language agents, particularly in the context of conversation history accumulation. Current methods either expand context windows without optimizing retrieval, incur high token costs through fact extraction, or utilize heuristic graph traversal that compromises both accuracy and efficiency. The authors propose PRISM, a novel framework that operates on the premise of treating long-horizon memory as a joint retrieval-and-compression problem over a graph-structured memory. This work is presented as a preprint and has not yet undergone peer review.

Method
PRISM introduces a training-free retrieval framework that integrates four key inference-time components:

Hierarchical Bundle Search: This component navigates through typed relation paths to identify relevant information efficiently.
Query-Sensitive Edge Costing: It aligns the traversal of the graph with the detected intent of the query, optimizing the retrieval process based on user intent.
Evidence Compression: This step compresses the retrieved candidate bundle into a compact context suitable for answering, thereby reducing the context size while maintaining relevance.
Adaptive Intent Routing: This mechanism routes the majority of queries through zero-LLM (Large Language Model) tiers, minimizing computational overhead.

The retrieval process is formulated as a min-cost selection problem over typed path templates, which is then paired with a compression step on the LLM side. This approach allows PRISM to efficiently surface relevant evidence while adhering to strict context budget constraints, without necessitating any fine-tuning or alterations to the upstream ingestion pipeline.

Results
Experiments conducted on the LoCoMo benchmark demonstrate that PRISM significantly outperforms all same-protocol baselines in terms of LLM-judge accuracy. Specifically, PRISM achieves a marked improvement in accuracy while operating under an order-of-magnitude smaller context budget. This performance indicates that PRISM occupies a previously unexploited area of the accuracy-context-cost trade-off, showcasing a superior balance between answer quality and retrieval efficiency.

Limitations
The authors acknowledge that PRISM’s performance is contingent on the quality of the underlying graph-structured memory and the effectiveness of the query intent detection mechanism. Additionally, the framework’s reliance on a zero-LLM tier for most queries may limit its applicability in scenarios requiring more complex reasoning or nuanced understanding. The paper does not address potential scalability issues when applied to larger datasets or more complex conversational contexts.

Why it matters
The implications of PRISM are significant for the development of long-horizon language agents, particularly in applications requiring efficient memory management and high accuracy in response generation. By providing a framework that balances retrieval efficiency with answer quality, PRISM opens avenues for future research in memory-augmented language models and could influence the design of more sophisticated conversational agents. This work may also inspire further exploration into graph-based memory architectures and their integration with LLMs.

Authors: Jingyi Peng, Zhongwei Wan, Weiting Liu, Qiuzhuang Sun
Source: arXiv:2605.12260
URL: https://arxiv.org/abs/2605.12260v1

Author Turing Wire editorial staff

Source

arXiv cs.CL https://arxiv.org/abs/2605.12260v1