Real-time LLM Inference on Standard GPUs: 3k tokens/s per request
- Published
- May 29, 2026 — 09:47 UTC
Problem
This preprint addresses the gap in real-time inference capabilities for large language models (LLMs) on standard GPUs, specifically focusing on single-request decoding speed. Existing inference software stacks are not optimized for maximizing decoding speed, which is critical for applications involving autonomous agents that require rapid iterative interactions. The authors argue that current benchmarks conflate various metrics, obscuring the importance of single-request decode speed, which is essential for enhancing user experience and productivity in AI-driven workflows.
Method
The authors propose a co-design approach that integrates model architecture, runtime, and low-level GPU code to optimize for single-request latency. They emphasize that memory bandwidth, rather than FLOPS, is the primary bottleneck for fast token generation. The proposed method leverages the high memory bandwidth of modern GPUs, such as NVIDIA’s H200 and AMD’s MI300X, to achieve high decoding speeds. The authors provide a theoretical framework for token generation speed, defined as:
[ \text{tokens/s} \leq \frac{\text{effective memory bandwidth}}{\beta \times (\text{active weight bytes} + \text{KV cache})} ]
where (\beta) accounts for inefficiencies in cache reuse. The authors demonstrate that optimizing for memory bandwidth utilization (MBU) is crucial, as it allows for higher token generation rates at batch size 1, which is the focus of their implementation.
Results
The authors report achieving a decoding speed of 3,000 tokens per second on a 2B-parameter model using standard datacenter GPUs. This performance is positioned against traditional inference stacks, which typically yield significantly lower speeds. The authors highlight that their approach can deliver speeds comparable to dedicated inference hardware, thus demonstrating a substantial improvement in single-request decoding efficiency. The implications of this speed are significant for applications requiring rapid iterative processing, such as autonomous software agents.
Limitations
The authors acknowledge that their current implementation focuses on a smaller model (2B parameters) and does not explore the scalability of their approach to larger models or more complex architectures. They also note that while their method optimizes for single-request latency, it may not fully address the needs of batch processing, which remains important for overall system throughput. Additionally, the paper does not provide extensive empirical validation across diverse benchmarks, limiting the generalizability of the results.
Why it matters
This work has significant implications for the development of AI agents and applications that rely on real-time interaction and iterative processing. By demonstrating that high decoding speeds can be achieved on standard GPUs, the authors open pathways for more efficient AI systems without the need for specialized hardware. This could democratize access to high-performance inference capabilities, enabling a broader range of applications in software engineering and beyond. The focus on memory bandwidth as a limiting factor also shifts the paradigm for future research in optimizing LLM inference, suggesting new avenues for exploration in both hardware and software design.
Authors: unknown
Source: arXiv:
By Turing Wire editorial staff · May 29, 2026 · Editorial standards →
Source: Hacker News (AI filtered)