Gamma-World: Generative Multi-Agent World Modeling Beyond Two Players
Fangfu Liu, Kai He, Tianchang Shen, Tianshi Cao, Sanja Fidler, Yueqi Duan
- Published
- May 27, 2026 — 17:59 UTC
Problem
This paper addresses the gap in generative world modeling for interactive video generation, specifically in multi-agent settings. Prior work has predominantly focused on single-agent environments, limiting the applicability of these models in scenarios requiring simultaneous interactions among multiple agents. The authors highlight the need for a principled design that allows for independent control of agents, permutation symmetry, and efficient inference while ensuring temporal and spatial consistency. This work is presented as a preprint and has not yet undergone peer review.
Method
The authors introduce a generative multi-agent world model that employs two key innovations: Simplex Rotary Agent Encoding and Sparse Hub Attention. Simplex Rotary Agent Encoding is a parameter-free extension of 3D Rotary Positional Encoding (RoPE), representing agents as vertices of a regular simplex in rotary angle space. This representation allows for distinct phases for each agent while maintaining permutation equivalence, thus enabling scalable agent identity management without the need for learned identities or fixed orderings.
To address the computational inefficiency of dense all-to-all attention mechanisms, the authors propose Sparse Hub Attention. This method utilizes learnable hub tokens to mediate interactions among agents, effectively reducing the attention complexity from quadratic to linear with respect to the number of agents. For real-time video generation, the model distills a full-context diffusion teacher into a causal student, which generates temporal blocks sequentially with key-value (KV) caching, achieving action-responsive generation at 24 frames per second (FPS).
Results
The proposed model demonstrates significant improvements over existing baselines, including slot-based and dense-attention models, in multiplayer virtual environments. Key performance metrics include enhanced video fidelity, improved action controllability, and better inter-agent consistency. Notably, the model generalizes effectively from two to four players without requiring additional training, showcasing its scalability and robustness. Specific quantitative results are not disclosed in the abstract, but the improvements are emphasized in qualitative assessments and comparative analyses against named baselines.
Limitations
The authors acknowledge that their model’s performance may be constrained by the complexity of the environments used for evaluation, which may not fully represent all possible multi-agent scenarios. Additionally, the reliance on a distilled causal student may limit the expressiveness of the generated outputs compared to the full-context teacher. The paper does not address potential challenges related to the scalability of the model in environments with a significantly larger number of agents or the implications of agent interactions in highly dynamic settings.
Why it matters
This work has significant implications for the development of interactive AI systems, particularly in gaming, robotics, and simulation environments where multiple agents must operate concurrently. By providing a scalable and efficient framework for multi-agent world modeling, the authors pave the way for more complex and realistic simulations that can enhance training and evaluation of AI agents. The innovations presented could also inspire further research into efficient attention mechanisms and agent representations in multi-agent systems.
Authors: Fangfu Liu, Kai He, Tianchang Shen, Tianshi Cao, Sanja Fidler, Yueqi Duan, Jun Gao, Igor Gilitschenski et al.
Source: arXiv:2605.28816
URL: https://arxiv.org/abs/2605.28816v1
By Turing Wire editorial staff · May 27, 2026 · Editorial standards →
Source: arXiv cs.CV