Notable theory

Exploding and vanishing gradients in deep neural networks: the effect of residual connections

Vivek S Borkar

Published: Jun 15, 2026 — 17:46 UTC

Problem
The paper addresses the well-documented issues of exploding and vanishing gradients in deep neural networks, which hinder effective training, particularly in deep architectures. Despite extensive literature on these phenomena, there remains a lack of rigorous mathematical analysis that connects these issues to the structural properties of neural networks. This work leverages multiplicative ergodic theory to provide a formal characterization of the effects of residual connections on gradient behavior. As a preprint, it contributes to ongoing discussions in the field without peer review.

Method
The author employs multiplicative ergodic theory, specifically utilizing the concepts of Liapunov exponents as established by Furstenberg and Kifer. The analysis focuses on the Liapunov spectrum, which quantifies the growth rates of perturbations in the network’s dynamics. By introducing residual connections, the paper demonstrates how these connections alter the Liapunov spectrum, thereby influencing the stability of gradient propagation. The theoretical framework is supported by mathematical derivations that elucidate the relationship between network architecture and gradient behavior, although specific training compute and data details are not disclosed.

Results
The paper presents a theoretical framework rather than empirical results, focusing on the implications of residual connections on the Liapunov spectrum. It posits that the introduction of residual connections can mitigate the effects of vanishing gradients by altering the growth rates of the network’s dynamics. While no specific numerical results or comparisons to baseline architectures are provided, the theoretical insights suggest that networks with residual connections exhibit improved stability in gradient propagation, which is critical for training deep models effectively.

Limitations
The author acknowledges that the theoretical nature of the work does not provide empirical validation of the claims made regarding the effects of residual connections. Additionally, the analysis is limited to specific types of architectures and may not generalize to all neural network configurations. The paper does not explore the practical implications of these findings in terms of training efficiency or performance on standard benchmarks, which could provide a more comprehensive understanding of the impact of residual connections.

Why it matters
This work has significant implications for the design and training of deep neural networks, particularly in understanding how architectural choices can influence gradient behavior. By providing a rigorous mathematical foundation for the effects of residual connections, it opens avenues for further research into network architectures that can better mitigate gradient-related issues. This is particularly relevant for practitioners aiming to develop deeper networks without succumbing to the pitfalls of exploding or vanishing gradients, as discussed in the broader literature on deep learning. For further details, see the full paper available on arXiv.

By Callan Zhang · Jun 15, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG