Notable efficiency inference Hugging Face

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

Deep Gandhi, Ali Asaria, Tony Salomone

Published: Jun 10, 2026 — 16:19 UTC

Problem
This work addresses the gap in empirical evaluations of hardware-specific trade-offs in post-training quantization for large text-to-image diffusion transformers, particularly for consumer-grade GPUs that lack FP8 tensor cores. The authors focus on the Ideogram 4.0 model, a 9.3B parameter flow-matching diffusion transformer, and present their findings in a preprint format, indicating that the work has not yet undergone peer review.

Method
The authors propose an INT8 W8A8 quantization strategy that employs per-channel weights and per-token dynamic activations, utilizing SmoothQuant and mixed-precision protection for a subset of high-fragility layers. This method is specifically tailored for the Ampere RTX 3090 GPU architecture. The quantization process aims to maintain the quality ceiling established by FP8 while optimizing for 8-bit weights and activations. Additionally, they introduce GGUF Q4_K quantization, which is designed to outperform NF4 quantization at equivalent on-disk sizes, thereby achieving a balance between quality and memory efficiency.

Results
The INT8 quantization method demonstrates significant performance improvements over NF4, achieving a +1.9 increase in CLIP scores (95% CI [1.21, 2.64]), with confidence intervals excluding zero, indicating statistical significance. The authors also report that the INT8-FP8 comparison on a 200-prompt benchmark includes zero in the paired same-seed bootstrap confidence intervals for both Pick and CLIP metrics, suggesting that INT8 maintains FP8 quality levels. Furthermore, the per-category OCR analysis reveals that text legibility is preserved across quantized outputs. The GGUF Q4_K quantization method is identified as the Pareto optimal solution on the quality-memory frontier, with Q8_0 showing quality neutrality.

Limitations
The authors acknowledge that their findings are specific to the RTX 3090 architecture and may not generalize to other hardware configurations. They also note that while their INT8 weights match the footprint of FP8, this does not lead to a reduction in memory usage, indicating potential limitations in speed gains. Additionally, the study does not explore the implications of quantization on other model architectures or tasks beyond text-to-image generation.

Why it matters
This research provides critical insights into the practical application of post-training quantization techniques for large-scale diffusion models on consumer hardware, which is essential for democratizing access to advanced AI capabilities. The findings suggest that high-quality outputs can be achieved without the need for specialized hardware, thus broadening the scope for deployment in resource-constrained environments. This work lays the groundwork for future explorations in quantization strategies and their impact on model performance, as discussed in related literature on quantization in deep learning, such as in arXiv cs.LG.

By Callan Zhang · Jun 10, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG