Notable theory Hugging Face

Length Generalization with Log-Depth Recurrent Units

Charles Pert, Dalal Alrajeh, Alessandra Russo

Published: May 25, 2026 — 17:02 UTC

Problem
This paper addresses the challenge of length generalization in neural networks, particularly in recurrent models that exhibit positional biases and transformers that are limited by fixed computational depth. The authors highlight that existing literature has not sufficiently tackled the issue of generalizing to longer sequences, especially in the context of regular languages, which serve as a robust testbed for evaluating such capabilities. This work is presented as a preprint and has not yet undergone peer review.

Method
The core technical contribution is the introduction of the MLP-LDRU (Multi-Layer Perceptron Log-Depth Recurrent Unit), which is designed to approximate recurrence through a class of associativity-biased operators. The architecture leverages parallel reduction techniques to enhance the model’s ability to generalize across varying sequence lengths. The authors conduct experiments on 21 regular-language tasks, including both standard benchmarks and newly introduced prefix languages. The training process involves increasing the maximum training length to assess the model’s performance on out-of-distribution data. Specific details regarding the loss function, training compute, and hyperparameters are not disclosed in the abstract.

Results
MLP-LDRU achieves remarkable performance, attaining 100% out-of-distribution accuracy on 18 out of 21 tasks and at least 99.9% accuracy on the remaining 3 tasks when the maximum training length is increased. This performance significantly surpasses that of comparable recurrent and attention-based models, demonstrating a clear advantage in length generalization. Additionally, the model is evaluated on ListOps and various NLP classification benchmarks, where it maintains competitive performance, although specific metrics and comparisons to named baselines are not detailed in the abstract.

Limitations
The authors acknowledge that while MLP-LDRU shows strong performance on regular languages, the generalizability of the model to more complex language structures or tasks beyond the evaluated benchmarks remains uncertain. They do not discuss potential limitations related to the computational efficiency of the MLP-LDRU architecture or the scalability of the model to larger datasets. Furthermore, the lack of peer review may indicate that the findings are preliminary and subject to change upon further validation.

Why it matters
This work has significant implications for advancing the understanding of length generalization in neural networks, particularly in recurrent architectures. By introducing MLP-LDRU, the authors provide a novel approach that could inspire further research into associativity-biased operators and their applications in various sequence-based tasks. The success of MLP-LDRU on regular languages suggests potential pathways for improving model robustness in real-world applications where sequence length variability is prevalent, such as natural language processing and time-series analysis.

Authors: Charles Pert, Dalal Alrajeh, Alessandra Russo
Source: arXiv:2605.26035
URL: https://arxiv.org/abs/2605.26035v1

By Callan Zhang · May 25, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG