Notable training methods OpenAI

FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo

Kyunghun Nam, Sumyeong Ahn

Published
Jun 1, 2026 — 15:13 UTC

Problem
Shampoo has emerged as a powerful optimization algorithm for large-scale machine learning tasks, yet it suffers from significant computational overhead due to matrix inversion. This paper addresses the gap in the literature regarding the trade-off between computational efficiency and optimization fidelity when using stale preconditioner updates. The authors conduct a theoretical study of staleness, highlighting its detrimental effects on convergence and stability. This work is a preprint and has not undergone peer review.

Method
The core contribution of this paper is the FOAM algorithm, which introduces an adaptive damping mechanism to stabilize training by dynamically adjusting the damping factor and the frequency of eigendecomposition. The authors derive a theoretical framework that quantifies staleness-oriented error and its impact on optimization performance. FOAM utilizes an approximation of this error to inform its adaptive damping strategy, effectively acting as a numerical stabilizer. The authors do not disclose specific details regarding the architecture or the exact computational resources used for training, focusing instead on the algorithmic innovations.

Results
Experimental evaluations demonstrate that FOAM significantly reduces wall-clock time compared to the standard Shampoo algorithm while preserving robust convergence properties. The authors report that FOAM achieves a reduction in training time by approximately 20% on large-scale benchmarks, such as CIFAR-10 and ImageNet, compared to baseline methods. These results indicate that FOAM not only enhances computational efficiency but also maintains the optimization fidelity necessary for effective training.

Limitations
The authors acknowledge that while FOAM effectively mitigates staleness-oriented errors, it may introduce additional complexity in tuning the adaptive parameters. They also note that the theoretical analysis is based on specific assumptions that may not hold in all practical scenarios. An obvious limitation not discussed is the potential impact of FOAM’s adaptive damping on convergence rates in highly non-convex landscapes, which could vary depending on the specific characteristics of the dataset and model architecture.

Why it matters
The implications of this work are significant for practitioners in the field of large-scale optimization, as it provides a method to balance computational efficiency with convergence stability. By addressing the staleness problem in preconditioned optimization, FOAM could facilitate faster training times without sacrificing performance, making it a valuable tool for deep learning applications. This research contributes to the ongoing discourse on optimization techniques in machine learning, as published in arXiv cs.LG.

Turing Wire

By Turing Wire editorial staff · Jun 1, 2026 · Editorial standards →

Source: arXiv cs.LG