Notable efficiency inference

Compute Efficiency and Serial Runtime Tradeoffs for Stochastic Momentum Methods

Depen Morwani, Alexandru Meterez, Pranav Nair, Sham Kakade

Published
Jun 17, 2026 — 15:19 UTC

Problem
The paper addresses a gap in understanding the trade-offs between compute efficiency (CE) and serial runtime in stochastic momentum methods, specifically heavy ball (HB) and accelerated stochastic gradient descent (ASGD). While these methods are prevalent in training neural networks, their stochastic benefits are not well characterized in terms of batch size effects on CE and serial runtime. The authors present theoretical results and empirical validation for consistent linear regression with Gaussian covariates, highlighting that existing literature lacks a comprehensive analysis of these trade-offs, particularly in the context of varying spectral properties of the data.

Method
The authors derive finite-dimensional, discrete-time lower bounds on the batch-size trade-offs for HB and ASGD. They demonstrate that HB does not enhance the CE frontier over standard stochastic gradient descent (SGD) for arbitrary spectra; instead, it maintains SGD-level CE across a broader batch-size range, allowing for reduced serial runtime until it reaches a deterministic accelerated scale. This range can be significantly larger than the critical batch size for SGD, specifically a factor of (\sqrt{\kappa}). For ASGD, the performance is more dependent on the spectral characteristics of the data: it shows improved small-batch CE over HB/SGD for rapidly decaying power-law spectra, but as batch size increases, it sacrifices this CE advantage for enhanced serial runtime. The authors validate their theoretical findings through synthetic linear regression experiments, confirming the predicted CE-serial trade-offs across different spectral regimes.

Results
The paper reports that HB maintains a CE comparable to SGD while allowing for larger batch sizes, effectively increasing the batch-size window by a factor of (\sqrt{\kappa}). In contrast, ASGD demonstrates improved CE for small batches but transitions to a regime where serial runtime becomes more favorable as batch sizes grow. The experiments corroborate these findings, showing near-overlap in performance between ASGD and HB for slowly decaying spectra, and a clear delineation of the CE-serial trade-off for rapidly decaying spectra. Specific numerical results are not disclosed in the abstract, but the qualitative behavior aligns with theoretical predictions.

Limitations
The authors acknowledge that their results are primarily theoretical and based on specific assumptions about the spectral properties of the data. They do not explore the implications of these findings in more complex, high-dimensional settings or with non-Gaussian covariates. Additionally, the empirical validation is limited to synthetic datasets, which may not fully capture the complexities of real-world applications.

Why it matters
This work provides critical insights into the efficiency of stochastic momentum methods, which are widely used in deep learning. By clarifying the trade-offs between compute efficiency and serial runtime, it informs practitioners on optimal batch size selection based on data characteristics, potentially leading to more efficient training protocols. The findings have implications for the design of future optimization algorithms and are relevant for ongoing research in the field, as published in arXiv cs.AI.

Turing Wire

By Callan Zhang · Jun 17, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.AI