Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad
Zijian Liu
- Published
- May 18, 2026 — 17:30 UTC
Problem
This paper addresses the gap in understanding the convergence properties of adaptive gradient methods, specifically $\mathtt{AdaGrad}$, under heavy-tailed noise conditions. While existing literature has explored various mechanisms to ensure convergence in the presence of such noise, there is limited theoretical insight into whether adaptive gradient methods can achieve convergence without algorithmic modifications. This work is presented as a preprint and has not yet undergone peer review.
Method
The authors investigate the convergence of $\mathtt{AdaGrad}$ in non-convex optimization settings when subjected to heavy-tailed noise characterized by a tail index $p$ in the range $4/3 < p \leq 2$. They derive the first provable convergence rate for $\mathtt{AdaGrad}$ under these conditions, demonstrating that the method can adaptively handle the tail index without prior knowledge of its value. Additionally, the authors establish an algorithm-dependent lower bound, indicating that the minimax rate for heavy-tailed optimization is not achievable by $\mathtt{AdaGrad}$. They also analyze $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$, a variant of $\mathtt{AdaGrad}$, and present an improved convergence rate applicable for any $1 < p \leq 2$, contingent on a mild assumption regarding the noise.
Results
The paper reports that $\mathtt{AdaGrad}$ achieves a convergence rate that is provably effective for the specified range of the tail index. The authors do not provide explicit numerical results against named baselines, but they emphasize that their findings represent a significant theoretical advancement in understanding the behavior of adaptive gradient methods under heavy-tailed noise. The improved rate for $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$ is also highlighted, although specific effect sizes or comparative metrics against existing optimizers are not detailed.
Limitations
The authors acknowledge that their results are limited to the specific case of $\mathtt{AdaGrad}$ and its variant, and they do not extend their findings to other adaptive gradient methods like $\mathtt{Adam}$ or $\mathtt{AdamW}$. Furthermore, the requirement of a mild assumption for the improved rate of $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$ may restrict its applicability. The paper does not explore the practical implications of these theoretical results in real-world scenarios, nor does it provide empirical validation of the convergence rates.
Why it matters
This work has significant implications for the optimization community, particularly in the context of training deep learning models where heavy-tailed noise is prevalent. By establishing a theoretical foundation for the convergence of $\mathtt{AdaGrad}$ under challenging noise conditions, the findings may encourage further exploration of adaptive gradient methods in similar settings. Additionally, the results could inform the design of new optimizers that leverage the strengths of adaptive methods while ensuring robustness against heavy-tailed noise.
Authors: Zijian Liu
Source: arXiv:2605.18694
URL: https://arxiv.org/abs/2605.18694v1
By Turing Wire editorial staff · May 18, 2026 · Editorial standards →
Source: arXiv cs.LG