Notable efficiency inference

Triplet-Block Diffusion RWKV

Ke Lin, Yiyang Luo, Zhaolong Su, Yunya Song, Anyi Rao

Published
May 25, 2026 — 15:44 UTC

Problem
This paper addresses the limitations of causal Transformer language models, specifically their strictly sequential decoding and quadratic per-step attention cost. While linear-time causal models and discrete diffusion models have been proposed to mitigate these issues, their integration has been inconsistent due to the inherent requirement of bidirectional attention in diffusion models versus the unidirectional nature of causal models. This work presents a novel approach to unify these architectures, which is particularly relevant given the increasing demand for efficient and scalable language models. The paper is a preprint and has not yet undergone peer review.

Method
The authors introduce the $B^3D-RWKV$ model, a variant of the RWKV architecture that incorporates a triplet-block layout to achieve efficient inference. This architecture allows for $O(L)$ inference efficiency while enabling parallel, bidirectional discrete diffusion. The triplet-block layout facilitates the integration of the strengths of both causal and diffusion models, allowing for improved decoding throughput without sacrificing accuracy. The model is trained on a diverse 8-task suite, although specific details regarding the training compute and dataset characteristics are not disclosed.

Results
The $B^3D-RWKV-7.2B$ model demonstrates competitive performance across an 8-task suite, achieving accuracy levels comparable to existing state-of-the-art models. Notably, it significantly outperforms baseline models in terms of decoding throughput, achieving an average speedup of $\mathbf{1.6\times}$. The paper does not provide specific numerical accuracy metrics or detailed comparisons against named baselines, which would be beneficial for a comprehensive evaluation of the model’s performance.

Limitations
The authors acknowledge that the integration of bidirectional attention with causal modeling presents inherent challenges, which may affect the model’s performance in certain contexts. Additionally, the lack of detailed training compute specifications and dataset descriptions limits the reproducibility of the results. The paper does not address potential issues related to the scalability of the triplet-block layout or its performance on larger datasets beyond the 8-task suite.

Why it matters
The development of the $B^3D-RWKV$ model has significant implications for the future of language modeling, particularly in applications requiring high throughput and efficiency. By successfully merging the advantages of causal and diffusion models, this work paves the way for more versatile architectures that can handle a wider range of tasks while maintaining performance. The findings could influence subsequent research in model design, particularly in exploring further optimizations in decoding efficiency and task adaptability.

Authors: Ke Lin, Yiyang Luo, Zhaolong Su, Yunya Song, Anyi Rao
Source: arXiv:2605.25969
URL: https://arxiv.org/abs/2605.25969v1

Turing Wire

By Turing Wire editorial staff · May 25, 2026 · Editorial standards →

Source: arXiv cs.CL