Notable multimodal

Chain-of-Procedure: Hierarchical Visual-Language Reasoning for Procedural QA

Guanhua Chen, Yutong Yao, Shenghe Sun, Ci-Jun Gao, Shudong Liu, Lidia S. Chao

Published: May 14, 2026 — 15:03 UTC

Problem
This paper addresses the gap in capability within visual-language models (VLMs) for visual procedure question answering (VP-QA), a task that has not been thoroughly explored in the literature. The authors highlight that existing VLMs struggle with two primary issues: (1) inadequate cross-modal retrieval of structured procedural information based on visual inputs, and (2) a misalignment between the granularity of image sequences and the textual decomposition of procedural steps. The work is presented as a preprint, indicating it has not yet undergone peer review.

Method
The authors propose a novel hierarchical reasoning framework called Chain-of-Procedure (CoP). This framework consists of three main components:

Instruction Retrieval: CoP first retrieves relevant procedural instructions by analyzing visual cues from the input images.
Step Refinement: It then refines the retrieved instructions through semantic decomposition, ensuring that the steps align with the visual context.
Next-Step Generation: Finally, CoP generates the next procedural step based on the refined instructions. The architecture leverages existing VLMs as a backbone, although specific details regarding the model architecture, loss functions, and training compute are not disclosed in the abstract.

Results
CoP was evaluated across six different VLMs, demonstrating significant improvements in performance on the newly introduced ProcedureVQA benchmark. The framework achieved up to a 13% absolute improvement over standard baselines, although specific baseline models and metrics used for comparison are not detailed in the abstract. This performance enhancement indicates that CoP effectively addresses the identified limitations in current VLMs for VP-QA tasks.

Limitations
The authors acknowledge that their approach may still face challenges in generalizing across diverse procedural contexts and varying image qualities. They do not discuss potential issues related to the scalability of the framework or the computational efficiency of the hierarchical reasoning process. Additionally, the reliance on existing VLMs may limit the framework’s performance if those models are not optimized for VP-QA tasks.

Why it matters
The introduction of the ProcedureVQA benchmark and the CoP framework has significant implications for future research in visual-language reasoning. By systematically addressing the limitations of current VLMs in procedural contexts, this work paves the way for more effective applications in domains such as robotics, automated assistance, and interactive learning systems. The findings could inspire further advancements in multimodal learning and reasoning, particularly in tasks that require a nuanced understanding of both visual and textual information.

Authors: Guanhua Chen, Yutong Yao, Shenghe Sun, Ci-Jun Gao, Shudong Liu, Lidia S. Chao, Feng Wan, Derek F. Wong
Source: arXiv:2605.14928
URL: https://arxiv.org/abs/2605.14928v1

By Callan Zhang · May 14, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL