Real-time LLM Inference on Standard GPUs: 3k tokens/s per request
Problem This preprint addresses the gap in real-time inference capabilities for large language models (LLMs) on standard GPUs, specifically focusing on single-request decoding speed. Existing inference software stacks are not...