Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models
Shengqiang Zhang, Ruotong Liao, Volker Tresp, Barbara Plank, Hinrich Schütze
- Published
- Jun 11, 2026 — 16:41 UTC
Problem
The paper addresses the limitations of existing text-guided image editing methods using visual autoregressive (VAR) generators, which primarily manipulate token streams or flat logits. These methods do not leverage the inherent structures of bitwise-residual VAR models, specifically the per-bit Bernoulli prediction head and the additive multi-scale residual code field. This gap in capability is particularly relevant for enhancing the precision of image edits while maintaining background integrity. The work is presented as a preprint, indicating it has not yet undergone peer review.
Method
The authors propose BitResEdit, a training-free editor designed for bitwise-residual VAR generators, exemplified by the Infinity model. The method consists of two main components: BitEdit and ResEdit. BitEdit employs source-negative guidance by adjusting the post-CFG per-bit log-odds based on a source-target contrast derived from a shared edited prefix. This adjustment is projected into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit transforms the sampled bits into per-scale continuous-code residuals, which are gated using a localization mask and reintegrated through the generator’s native sum-of-scales. This dual approach allows for precise decision-time bit guidance and effective code composition, ensuring that masked latent features are preserved while localized edits are applied.
Results
Evaluating BitResEdit on the PIE-Bench benchmark with the Infinity-2B model, the authors report a significant improvement in text alignment, achieving a +1.07 increase in CLIP scores for the edited regions compared to the strongest prior editor. This enhancement is coupled with competitive background preservation, indicating that the method effectively balances targeted edits with the integrity of the original image context. The ablation studies further demonstrate that BitEdit and ResEdit complement each other, with distinct contributions to target alignment and background preservation.
Limitations
The authors acknowledge that the method’s training-free nature may limit its adaptability to diverse editing tasks compared to fully trained models. Additionally, the reliance on the specific architecture of the Infinity model may restrict the generalizability of the approach to other VAR generators. The paper does not address potential computational overhead introduced by the additional processing steps involved in the bitwise editing process.
Why it matters
The introduction of BitResEdit represents a significant advancement in the field of text-guided image editing, particularly for applications requiring high fidelity in both targeted modifications and background preservation. By effectively utilizing the native structures of VAR models, this work opens avenues for further research into more sophisticated editing techniques that leverage bitwise operations. The implications of this research extend to various domains, including digital art, content creation, and interactive design, as published in arXiv.
By Turing Wire editorial staff · Jun 11, 2026 · Editorial standards →
Source: arXiv cs.CL