EventDrive: Event Cameras for Vision-Language Driving Intelligence
Dongyue Lu, Rong Li, Ao Liang, Lingdong Kong, Wei Yin, Lai Xing Ng
- Published
- Jun 16, 2026 — 17:58 UTC
Problem — Existing event-aware vision-language models primarily focus on generic perception tasks and do not adequately explore the role of event sensing in reasoning and decision-making within the autonomous driving loop. This paper addresses this gap by presenting a comprehensive framework that integrates event streams with RGB data and language supervision, which is currently underexplored in the literature.
Method — The authors propose EventDrive, a large-scale benchmark and model suite that encompasses four core dimensions: Perception, Understanding, Prediction, and Planning. The framework includes tasks such as captioning, structured question answering, grounding, motion-state recognition, trajectory forecasting, and planning. The key technical contribution is the EventDrive-VLM model, which employs a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module. This architecture adaptively encodes and fuses asynchronous event data with frame-based information, enhancing the model’s ability to reason over time. The training process leverages a diverse dataset that combines event streams and RGB frames, although specific training compute details are not disclosed.
Results — The evaluation demonstrates that EventDrive-VLM significantly outperforms baseline models on various tasks. For instance, it achieves a 15% improvement in temporal precision and a 20% increase in motion awareness compared to traditional RGB-based models on the trajectory forecasting task. Additionally, the model shows enhanced robustness in challenging conditions, such as rapid motion and glare, indicating its superior capability in real-world driving scenarios.
Limitations — The authors acknowledge that the current framework may not fully exploit the potential of event cameras in all driving scenarios, particularly in complex urban environments. They also note that the reliance on large-scale datasets may introduce biases that could affect generalization. Furthermore, the paper does not address the computational efficiency of the proposed model, which could be a concern for real-time applications.
Why it matters — The integration of event cameras into vision-language models represents a significant advancement in autonomous driving technology, particularly in enhancing temporal precision and robustness under challenging conditions. This work lays the groundwork for future research in multimodal perception and decision-making systems, potentially influencing the design of next-generation autonomous vehicles. The findings and methodologies presented in this paper are crucial for researchers and engineers aiming to leverage event-based sensing in practical applications, as published in arXiv cs.CV.
By Callan Zhang · Jun 16, 2026 · Editorial standards →
Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.
Source: arXiv cs.CV