Notable training methods

Efficient Pre-Training with Token Superposition

Bowen Peng, Théo Gigant, Jeffrey Quesnelle

Published: May 7, 2026 — 16:41 UTC
Summary length: 412 words
Relevance score: 80%

Problem
This paper addresses the inefficiencies in the pre-training of Large Language Models (LLMs), which often require substantial computational resources and time. The authors highlight that existing methods necessitate complex modifications to achieve high data throughput, which can hinder scalability. This work presents Token-Superposition Training (TST) as a novel approach to enhance pre-training efficiency without altering the model architecture, optimizer, tokenizer, or data. Notably, this is a preprint and has not undergone peer review.

Method
The core technical contribution is the Token-Superposition Training (TST) method, which consists of two distinct phases. The first phase is a superposition phase where contiguous tokens are combined into a single bag, allowing for a multi-hot cross-entropy (MCE) objective to be employed. This phase is designed to maximize data throughput per FLOP during pre-training. The second phase is a recovery phase, where the training reverts to standard methods. The authors evaluate TST on models with 270M and 600M parameters, as well as on a mixture of experts model with 3B and 10B parameters. The training compute details are not explicitly disclosed, but the authors claim significant improvements in efficiency.

Results
TST demonstrates a consistent performance improvement over baseline methods across various settings. Specifically, at the 10B A1B scale, TST achieves up to a 2.5x reduction in total pre-training time under equal-loss conditions. The authors report that TST outperforms baseline loss metrics and downstream evaluation tasks, although specific numerical results against named baselines are not detailed in the abstract. The robustness of TST is validated across different model sizes and configurations, indicating its versatility.

Limitations
The authors acknowledge that while TST improves efficiency, it may not be universally applicable to all model architectures or training regimes. They do not discuss potential impacts on model generalization or performance in low-resource settings, which could be critical for certain applications. Additionally, the reliance on multi-hot encoding may introduce challenges in terms of memory usage and computational overhead, particularly for very large vocabularies.

Why it matters
The implications of TST are significant for the field of LLMs, particularly in contexts where computational resources are constrained. By improving data throughput without necessitating invasive changes to existing training paradigms, TST could facilitate the pre-training of larger models or enable more extensive experimentation with limited resources. This work may pave the way for future research focused on optimizing training efficiency and scalability in LLMs, potentially influencing the design of next-generation architectures and training methodologies.

Authors: Bowen Peng, Théo Gigant, Jeffrey Quesnelle
Source: arXiv:2605.06546
URL: https://arxiv.org/abs/2605.06546v1

Author Turing Wire editorial staff

Source

arXiv cs.CL https://arxiv.org/abs/2605.06546v1