Major foundation models

AFUN: Towards an Affordance Foundation Model for Functionality Understanding

Zhaoning Wang, Yi Zhong, Jiawei Fu, Henrik I. Christensen, Jun Gao

Published: Jun 1, 2026 — 17:50 UTC

Problem
The paper addresses the gap in affordance understanding for robot manipulation, which is crucial for bridging visual perception and physical action in unstructured environments. Existing methods either localize task-relevant regions without specifying executable motions or predict motions with limited scalability. The authors highlight that building a comprehensive affordance foundation model that generalizes across various environments, objects, and tasks remains a significant challenge. This work is presented as a preprint and has not undergone peer review.

Method
The authors propose a model, referred to as AFUN, which predicts a task-conditional functional mask (indicating where to interact) and a 3D post-contact motion curve (indicating how to interact) from a single RGB-D observation and a language task description. To facilitate open-world generalization, they developed a large-scale standardized data pipeline that integrates heterogeneous data sources, including robot, human, simulation, and real-world scans, into a unified affordance schema. This schema encompasses language descriptions, functional masks, and object-centric 3D motion labels. The architecture details, loss functions, and specific training compute resources are not disclosed in the paper.

Results
AFUN demonstrates superior performance across multiple benchmarks. For affordance segmentation, it achieves a mean gIoU improvement of +23.9 and a mean cIoU improvement of +26.3 compared to all baselines across eight test sets from four benchmarks. In contact-point prediction, AFUN shows a hit-rate gain of 12.7% to 61.3% over the best baseline. Additionally, it outperforms all competitors in 3D motion prediction across three test sets, indicating its robustness and effectiveness in real-world applications.

Limitations
The authors acknowledge that while AFUN shows promising results, it may still face challenges in extreme edge cases or highly dynamic environments where the model’s assumptions about object interactions may not hold. They do not discuss potential limitations related to the scalability of the data pipeline or the computational efficiency of the model during deployment. Furthermore, the lack of detailed architecture and training specifics may hinder reproducibility.

Why it matters
The development of AFUN represents a significant advancement in affordance understanding, enabling robots to perform manipulation tasks in diverse and unstructured environments without the need for fine-tuning or task-specific heuristics. This capability is crucial for the deployment of autonomous systems in real-world scenarios, enhancing their adaptability and functionality. The implications of this work extend to various applications in robotics and AI, paving the way for more sophisticated interaction models. For further details, see the full paper available on arXiv.

By Callan Zhang · Jun 1, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CV