Notable efficiency inference

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti, Lei Li, Xu Han, Edoardo M. Ponti

Published
May 18, 2026 — 17:59 UTC

Problem
This paper addresses the limitations of existing hierarchical attention mechanisms, specifically NSA and InfLLMv2, which rely on a fixed top-k selection of key-value (KV) blocks based on coarse attention scores. This approach restricts the gradient flow between sparse and dense attention stages and assumes a constant number of relevant tokens for any query. The authors propose DashAttention, a differentiable and adaptive sparse hierarchical attention mechanism that allows for a variable number of selected blocks, thus filling a gap in the literature regarding flexible and efficient long-context modeling. This work is presented as a preprint and has not yet undergone peer review.

Method
DashAttention introduces an adaptive sparse attention mechanism using the $α$-entmax transformation, which enables the selection of a variable number of KV blocks based on the specific query. This first stage of attention is designed to maintain differentiability, allowing gradients to flow seamlessly into the second stage, where a fine-grained softmax attention is applied. The architecture is non-dispersive, which enhances its ability to model long contexts effectively. The authors also provide a GPU-optimized implementation of DashAttention in Triton, which is designed to outperform existing methods in terms of computational efficiency. The training compute details are not explicitly disclosed, but the implementation is noted for its speed improvements over FlashAttention-3 during inference.

Results
DashAttention demonstrates competitive performance against full attention mechanisms while maintaining 75% sparsity. The authors report that it achieves comparable accuracy to full attention models on large language model (LLM) benchmarks, outperforming NSA and InfLLMv2, particularly in high-sparsity scenarios. The results indicate a better Pareto frontier, suggesting that DashAttention provides a more favorable trade-off between accuracy and computational efficiency. Specific numerical results and comparisons to baseline models are not detailed in the abstract but are expected to be elaborated upon in the full paper.

Limitations
The authors acknowledge that while DashAttention improves upon existing hierarchical attention methods, it may still face challenges in scenarios with extremely high sparsity or in specific applications where the assumptions of the $α$-entmax transformation do not hold. Additionally, the implementation’s performance may vary based on hardware configurations, and the lack of peer review means that the findings should be interpreted with caution until validated by the community.

Why it matters
DashAttention has significant implications for the development of more efficient long-context modeling techniques in natural language processing. By enabling adaptive sparsity and maintaining differentiability, it opens avenues for further research into scalable attention mechanisms that can handle larger contexts without incurring prohibitive computational costs. This work could influence future architectures in LLMs and other applications requiring efficient attention mechanisms.

Authors: Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti, Lei Li, Xu Han, Edoardo M. Ponti, André F. T. Martins, Marcos V. Treviso
Source: arXiv:2605.18753
URL: https://arxiv.org/abs/2605.18753v1

Turing Wire

By Turing Wire editorial staff · May 18, 2026 · Editorial standards →

Source: arXiv cs.CL