Binary Tracking for Spatial QA and Navigation with Open Vision-Language Models
Dongbin Na, Chanwoo Kim, Soonbin Rho, Giyun Choi, Gangbok Lee, Dooyoung Hong
- Published
- Jun 15, 2026 — 16:10 UTC
Problem
This work addresses the gap in open-source spatial question answering (SQA) systems for service robots navigating long egocentric routes. Existing methods predominantly rely on closed-source models like GPT-4o, which are unsuitable for real-world applications due to issues such as network instability, communication latency, and high deployment costs. The authors highlight the limited research on open-source alternatives capable of running onboard robots, making this a critical area for exploration, especially in practical deployment scenarios.
Method
The authors propose BinTrack, a spatial-localization agent that utilizes a binary search algorithm over trajectory segments defined by two anchor landmarks identified from user queries. This method leverages the temporal ordering of the robot’s trajectory to enhance localization accuracy. The architecture is designed to be fully open-source, allowing for deployment on service robots without reliance on external models. The training process and specific datasets used for model training are not disclosed, but the authors emphasize the efficiency of their inference strategy, which achieves over a 1.5x speedup compared to previous open-source implementations.
Results
BinTrack demonstrates significant performance improvements, achieving up to a 22.8% increase in accuracy over other open-source SQA implementations. Notably, it matches the performance of closed-source models on the global category of the SpaceLocQA benchmark, which is recognized as a challenging setting requiring advanced reasoning capabilities. The results indicate that BinTrack not only enhances accuracy but also maintains competitive performance against state-of-the-art closed-source solutions.
Limitations
The authors acknowledge that while BinTrack performs well in the SpaceLocQA benchmark, its effectiveness in more diverse or complex environments remains to be evaluated. Additionally, the reliance on specific trajectory data may limit generalizability to other types of navigation tasks. The paper does not discuss potential biases in the dataset or the implications of deploying the model in varied real-world conditions, which could affect performance.
Why it matters
The introduction of BinTrack represents a significant advancement in the development of open-source solutions for spatial question answering in robotics, addressing critical limitations of existing closed-source models. By providing a practical and efficient method for spatial localization, this work paves the way for more robust and accessible robotic navigation systems. The release of the GangnamLoop benchmark, which includes multi-trip outdoor data collected from real-world conditions, further contributes to the field by enabling future research and development in SQA. This work is particularly relevant for researchers and engineers focusing on autonomous navigation and localization in robotics, as published in arXiv cs.AI.
By Turing Wire editorial staff · Jun 15, 2026 · Editorial standards →
Source: arXiv cs.AI