Notable other

FreeOcc: Training-Free Embodied Open-Vocabulary Occupancy Prediction

Zeyu Jiang, Changqing Zhou, Xingxing Zuo, Changhao Chen

Original source

arXiv cs.CV

Problem
This paper addresses the limitations of existing learning-based occupancy prediction methods, which typically require extensive 3D annotations and struggle to generalize across diverse environments. The authors propose FreeOcc, a training-free framework for open-vocabulary occupancy prediction that operates without the need for voxel-level supervision, ground-truth camera poses, or any learning phase. This work is presented as a preprint and has not yet undergone peer review.

Method
FreeOcc employs a four-layer pipeline to incrementally construct a globally consistent occupancy map. The components are as follows:

SLAM Backbone: This module estimates camera poses and sparse geometry from monocular or RGB-D sequences.
Gaussian Update: A geometrically consistent update mechanism constructs dense 3D Gaussian maps based on the sparse geometry.
Open-Vocabulary Semantics: The framework leverages off-the-shelf vision-language models to associate open-vocabulary semantics with the Gaussian primitives generated in the previous step.
Gaussian-to-Occupancy Projection: This probabilistic step converts the dense Gaussian representations into voxel occupancy maps.

The architecture is designed to be entirely training-free and pose-agnostic, allowing it to operate effectively without prior training on labeled datasets.

Results
FreeOcc demonstrates significant performance improvements over existing self-supervised methods, achieving over 2× enhancements in Intersection over Union (IoU) and mean IoU (mIoU) metrics on the EmbodiedOcc-ScanNet benchmark. Additionally, the authors introduce ReplicaOcc, a new benchmark for indoor open-vocabulary occupancy prediction, where FreeOcc exhibits strong zero-shot transfer capabilities to novel environments. The results indicate that FreeOcc outperforms both supervised and self-supervised baselines, showcasing its robustness and adaptability.

Limitations
The authors acknowledge that while FreeOcc is effective in various environments, its performance may still be influenced by the quality of the input sequences and the underlying SLAM accuracy. They do not address potential limitations related to the scalability of the Gaussian update process or the computational efficiency of the framework in real-time applications. Furthermore, the reliance on off-the-shelf vision-language models may introduce biases based on the training data of those models, which could affect the semantic associations made during occupancy prediction.

Why it matters
The implications of FreeOcc are significant for the fields of robotics and computer vision, particularly in applications requiring real-time occupancy mapping in dynamic environments. By eliminating the need for extensive training and annotations, FreeOcc opens avenues for deploying occupancy prediction systems in scenarios where labeled data is scarce or unavailable. This work also sets a precedent for future research on training-free methods and the integration of vision-language models in spatial reasoning tasks, potentially influencing the design of more adaptable and efficient AI systems.

Authors: Zeyu Jiang, Changqing Zhou, Xingxing Zuo, Changhao Chen
Source: arXiv:2604.28115
URL: https://arxiv.org/abs/2604.28115v1