MementoGUI: Learning Agentic Multimodal Memory Control for Long-Horizon GUI Agents
Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, Jiebo Luo
- Published
- May 18, 2026 — 16:57 UTC
Problem
This paper addresses the limitations of existing GUI agents in long-horizon tasks, particularly their reliance on raw history replay or text-only memory, which can lead to inefficiencies and loss of critical visual context. The authors highlight that current methods either overwhelm models with excessive data or fail to retain essential localized visual evidence necessary for decision-making across multiple interface transitions. This work is presented as a preprint and has not yet undergone peer review.
Method
The core technical contribution is the MementoGUI framework, which integrates a learned memory controller, MementoCore, into MLLM-based GUI agents. MementoCore operates as an online memory-control system, modularizing memory management into specialized operators for step processing, memory compression, episodic writing, and episodic selection. This allows for dynamic memory management that selectively retains task-relevant interface events, represented through textual summaries and region-of-interest (ROI) visual evidence. The authors also introduce a scalable data curation pipeline that transforms computer-use trajectories into training data for the memory controller. MementoGUI-Bench is developed as a benchmark for evaluating long-horizon decision-making capabilities in GUI agents, incorporating metrics for semantic action matching, task progress, and memory consistency.
Results
Experiments conducted on the GUI-Odyssey, MM-Mind2Web, and MementoGUI-Bench datasets demonstrate that MementoGUI significantly outperforms baseline methods, including no-history, history-replay, and text-only memory approaches. The results indicate that larger MementoCore backbones yield further improvements in memory-augmented GUI control. Specific effect sizes and performance metrics are not disclosed in the abstract, but the consistent enhancement across multiple benchmarks suggests a robust advancement in the capability of GUI agents to manage long-horizon tasks effectively.
Limitations
The authors acknowledge that the framework’s performance is contingent on the quality of the training data generated by the curation pipeline, which may introduce biases or limitations based on the source data. Additionally, the modular nature of MementoCore, while advantageous for flexibility, may also complicate integration with existing GUI agent architectures. The paper does not address potential scalability issues related to the memory management overhead in highly dynamic environments or the computational costs associated with training larger MementoCore models.
Why it matters
The introduction of MementoGUI represents a significant step forward in the development of intelligent GUI agents capable of maintaining contextual awareness over extended interactions. By addressing the shortcomings of traditional memory management approaches, this work lays the groundwork for more sophisticated agentic behaviors in complex environments. The implications extend to various applications, including automated customer support, software testing, and interactive AI systems, where long-horizon decision-making is critical. Future research can build on this framework to explore enhanced memory architectures and further refine the balance between memory retention and computational efficiency.
Authors: Ziyun Zeng, Hang Hua, Bocheng Zou, Mu Cai, Rogerio Feris, Jiebo Luo
Source: arXiv:2605.18652
URL: https://arxiv.org/abs/2605.18652v1
By Turing Wire editorial staff · May 18, 2026 · Editorial standards →
Source: arXiv cs.CV