Major evaluation benchmarks

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

Kaustav Kundu, Ritvik Shrivastava, Maxim Arap, Nanshu Wang, Xianhui Zhu, Quintin Fettes

Published: Jun 3, 2026 — 14:52 UTC

Problem
The paper addresses the lack of large-scale, cross-domain benchmarks for proactive procedural assistance systems, particularly in scenarios where users deviate from expected task sequences. Existing literature does not adequately cover the complexities of real-world procedural tasks, especially the need for systems that can autonomously decide when to intervene and how to guide users effectively. This gap is critical for developing robust multi-modal assistants capable of providing real-time support. The work is presented as a preprint, indicating it has not yet undergone peer review.

Method
The authors introduce EgoProactive, a comprehensive dataset featuring wearable-egocentric data with explicit Out-of-Plan (OOP) annotations and recovery steps. They augment five established benchmarks—Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, and HowTo100M—into a unified framework called Pro²Bench, designed for proactive guidance. The core technical contribution is a decoupled planner-interaction architecture that integrates procedural state information, visual cues, and recovery strategies. Additionally, they propose a post-training recipe that facilitates knowledge transfer across different model families, validated through cross-backbone replication experiments using Llama-4 and Qwen-3.6-VL.

Results
The proposed Llama-4 system demonstrates significant improvements in intervention quality compared to both proprietary baselines (Claude Opus 4.6, Gemini 3.1 Pro, GPT 5.2) and open-weight models (Qwen3 VL 235B) across all six datasets in the Pro²Bench framework. The authors report substantial effect sizes, with the duplex model achieving high-quality guidance and notable gains in OOP recovery during oracle-plan experiments, indicating that controlled plan quality leads to enhanced performance.

Limitations
The authors acknowledge that the dataset may not encompass all possible procedural variations and user behaviors, which could limit the generalizability of the trained models. They also note that while the architecture shows promise, further validation in diverse real-world scenarios is necessary to assess its robustness. Additionally, the reliance on specific model families for the post-training recipe may restrict applicability to other architectures not included in their experiments.

Why it matters
This work has significant implications for the development of proactive multi-modal assistants, as it provides a foundational dataset and a novel architecture that can adapt to user deviations in procedural tasks. The introduction of Pro²Bench as a benchmark for evaluating proactive guidance systems sets a new standard for future research in this domain. The findings underscore the potential for improved user assistance in various applications, from healthcare to industrial settings, as published in arXiv.

By Callan Zhang · Jun 3, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI