Notable other

From Zero to Hero: Training-Free Custom Concept Spawning in World Models

Kiymet Akdemir, Pinar Yanardag

Published
Jun 1, 2026 — 17:59 UTC

Problem
The paper addresses a significant gap in the capability of autoregressive world models for interactive video generation, specifically the lack of controllable scene composition. Current models generate environments based on a single reference frame or text prompt, but they fail to allow users to specify unseen regions when navigating beyond the visible frame. This limitation hinders applications in gaming, interactive storytelling, and simulation, where the ability to introduce user-defined visual concepts is crucial. The authors propose a solution to this problem, termed “concept spawning,” which is not covered in existing literature. This work is presented as a preprint and has not undergone peer review.

Method
The authors introduce SPAWN (Swapping Pinned Anchor with Windowed iNjection), a novel, training-free approach for concept spawning in autoregressive world models. SPAWN leverages a structural property of image-to-video backbones, where the first slot of the context memory is pinned to the reference frame, serving as a foundational anchor for generated content. The method involves swapping this anchor with an external concept latent over a short injection window, allowing the original anchor to return afterward. This mechanism enables the user-specified concept to propagate through the model’s memory during the rollout process. SPAWN can accept either a concept image or a text description, facilitating the integration of both fine-grained entities (e.g., characters, props) and large-scale elements (e.g., buildings, landmarks) into the generated environment.

Results
SPAWN demonstrates effective concept integration, achieving consistent lighting, scale, and perspective while maintaining identity and temporal coherence. The authors report qualitative results showcasing the method’s ability to spawn concepts seamlessly within the generated environments. While specific quantitative metrics are not disclosed, the qualitative assessments indicate a significant improvement over baseline autoregressive models that lack concept spawning capabilities.

Limitations
The authors acknowledge that SPAWN’s effectiveness may be constrained by the quality of the reference frame and the external concept provided. Additionally, the method’s reliance on the structural properties of existing models may limit its applicability to other architectures that do not share these characteristics. The paper does not address potential computational overhead introduced by the concept injection process or the scalability of the method to more complex environments.

Why it matters
The introduction of SPAWN has significant implications for enhancing user interactivity in video generation, particularly in gaming and simulation contexts. By enabling users to specify visual concepts dynamically, SPAWN opens avenues for more personalized and engaging experiences. This work contributes to the growing field of controllable generative models and sets the stage for future research on integrating user-defined elements into autoregressive frameworks, as published in arXiv cs.CV.

Turing Wire

By Turing Wire editorial staff · Jun 1, 2026 · Editorial standards →

Source: arXiv cs.CV