Length Generalization with Log-Depth Recurrent Units
Charles Pert, Dalal Alrajeh, Alessandra Russo
- Published
- May 25, 2026 — 17:02 UTC
Problem
This paper addresses the challenge of length generalization in neural networks, particularly in recurrent models that exhibit positional biases and transformers that are limited by fixed computational depth. The authors highlight that existing literature has not sufficiently tackled the issue of generalizing to longer sequences, especially in the context of regular languages, which serve as a robust testbed for evaluating such capabilities. This work is presented as a preprint and has not yet undergone peer review.
Method
The core technical contribution is the introduction of the MLP-LDRU (Multi-Layer Perceptron Log-Depth Recurrent Unit), which is designed to approximate recurrence through a class of associativity-biased operators. The architecture leverages parallel reduction techniques to enhance the model’s ability to generalize across varying sequence lengths. The authors conduct experiments on 21 regular-language tasks, including both standard benchmarks and newly introduced prefix languages. The training process involves increasing the maximum training length to assess the model’s performance on out-of-distribution data. Specific details regarding the loss function, training compute, and hyperparameters are not disclosed in the abstract.
Results
MLP-LDRU achieves remarkable performance, attaining 100% out-of-distribution accuracy on 18 out of 21 tasks and at least 99.9% accuracy on the remaining 3 tasks when the maximum training length is increased. This performance significantly surpasses that of comparable recurrent and attention-based models, demonstrating a clear advantage in length generalization. Additionally, the model is evaluated on ListOps and various NLP classification benchmarks, where it maintains competitive performance, although specific metrics and comparisons to named baselines are not detailed in the abstract.
Limitations
The authors acknowledge that while MLP-LDRU shows strong performance on regular languages, the generalizability of the model to more complex language structures or tasks beyond the evaluated benchmarks remains uncertain. They do not discuss potential limitations related to the computational efficiency of the MLP-LDRU architecture or the scalability of the model to larger datasets. Furthermore, the lack of peer review may indicate that the findings are preliminary and subject to change upon further validation.
Why it matters
This work has significant implications for advancing the understanding of length generalization in neural networks, particularly in recurrent architectures. By introducing MLP-LDRU, the authors provide a novel approach that could inspire further research into associativity-biased operators and their applications in various sequence-based tasks. The success of MLP-LDRU on regular languages suggests potential pathways for improving model robustness in real-world applications where sequence length variability is prevalent, such as natural language processing and time-series analysis.
Authors: Charles Pert, Dalal Alrajeh, Alessandra Russo
Source: arXiv:2605.26035
URL: https://arxiv.org/abs/2605.26035v1
By Turing Wire editorial staff · May 25, 2026 · Editorial standards →
Source: arXiv cs.LG