Notable other

OcclusionFormer: Arranging Z-Order for Layout-Grounded Image Generation

Ziye Li, Henghui Ding

Published: May 20, 2026 — 16:10 UTC

Problem
This paper addresses the limitations of existing layout-to-image generation models in handling inter-object occlusion, particularly in scenarios where bounding boxes overlap. Current methods often lack explicit occlusion information, leading to ambiguities in the generated images, such as entangled textures and inconsistent layering in overlapping regions. The authors highlight that this gap in capability significantly hampers the spatial controllability of generated images. The work is presented as a preprint and has not yet undergone peer review.

Method
The authors introduce OcclusionFormer, a novel framework that leverages a Diffusion Transformer architecture to model occlusion explicitly. Central to this approach is the construction of the SA-Z dataset, which includes large-scale, pixel-level annotations for occlusion ordering. OcclusionFormer decouples instances and employs volume rendering to composite them based on Z-order priority. Additionally, the framework incorporates a queried alignment loss that supervises individual instances, enhancing semantic consistency and ensuring fine-grained spatial precision. The training process and compute requirements are not explicitly detailed in the paper.

Results
OcclusionFormer demonstrates significant improvements over baseline models on various benchmarks, achieving a reduction in ambiguity in overlapping regions and enforcing correct occlusion dependencies. The authors report substantial accuracy gains, although specific numerical results and effect sizes are not provided in the abstract. The performance is evaluated across diverse scenes, indicating the model’s robustness in handling complex occlusion scenarios.

Limitations
The authors acknowledge that while OcclusionFormer improves upon existing methods, it may still struggle with extreme occlusion cases or highly complex scenes that exceed the training data’s diversity. Additionally, the reliance on a large-scale dataset for training may limit the model’s applicability in scenarios with limited labeled data. The paper does not discuss potential computational overhead introduced by the volume rendering process or the scalability of the approach in real-time applications.

Why it matters
The introduction of OcclusionFormer and the SA-Z dataset represents a significant advancement in layout-to-image generation, particularly in enhancing spatial controllability and occlusion handling. This work lays the groundwork for future research in occlusion-aware generative models, potentially influencing applications in computer graphics, virtual reality, and augmented reality, where accurate spatial relationships are crucial. By addressing the ambiguity in overlapping regions, this research opens avenues for more realistic image synthesis and improved user interactions in layout-based design tools.

Authors: Ziye Li, Henghui Ding
Source: arXiv:2605.21343
URL: https://arxiv.org/abs/2605.21343v1

By Callan Zhang · May 20, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CV