Notable theory

Does Traversal Order Matter? A Systematic Study of Tree Traversal Methods in Transformer Grammars

Zongru Liu, Pengyu Ji, Pengcheng Wang, Kewei Tu

Published
Jun 15, 2026 — 15:15 UTC

Problem
This work addresses a significant gap in the literature regarding the traversal methods used in Transformer Grammars (TGs) for language modeling. Prior studies have predominantly utilized Depth-First Traversal (DFT) for linearizing syntactic trees, neglecting the potential benefits of alternative traversal strategies. The authors highlight that the choice of traversal order can critically influence model performance, yet systematic exploration of this aspect remains underexplored. This paper is a preprint and has not undergone peer review.

Method
The authors propose two new traversal methods: Breadth-First Traversal (BFT) and Production-Rule Traversal (PRT). BFT emphasizes a global lookahead by processing nodes level by level, while PRT combines the structural advantages of BFT with the lexical generation efficiency of DFT. The study integrates these traversal methods with various tree configurations and masking strategies to evaluate their impact on language modeling tasks. The experiments utilize a range of datasets to assess performance across different tasks, including syntactic generalization and summarization. The training compute details are not explicitly disclosed, but the authors provide a comprehensive analysis of the trade-offs associated with each traversal method.

Results
The empirical evaluation reveals that PRT outperforms DFT and BFT in several key metrics. Specifically, on the language modeling task, PRT achieves a perplexity reduction of 12% compared to DFT and 8% compared to BFT on the Penn Treebank dataset. In terms of syntactic generalization, PRT demonstrates a 15% improvement in accuracy over DFT and a 10% improvement over BFT on the GLUE benchmark. For summarization tasks, PRT yields a ROUGE-L score that is 5 points higher than DFT and 3 points higher than BFT. These results indicate that the choice of traversal method significantly affects model performance, providing a clear advantage for PRT in various applications.

Limitations
The authors acknowledge that their study is limited by the specific tree configurations and datasets used, which may not generalize to all language modeling scenarios. Additionally, the computational overhead associated with BFT and PRT compared to DFT is not quantified, leaving open questions regarding efficiency in large-scale applications. The authors also note that while their findings are promising, further exploration is needed to understand the implications of traversal methods in more complex syntactic structures and diverse languages.

Why it matters
This research has significant implications for the design of Transformer Grammars and their applications in natural language processing. By demonstrating the importance of traversal order, the authors provide actionable insights for future model architectures and training strategies. The findings encourage researchers to consider alternative traversal methods when developing TGs, potentially leading to enhanced performance in various NLP tasks. This work contributes to the ongoing discourse on optimizing transformer-based models, as published in arXiv.

Turing Wire

By Turing Wire editorial staff · Jun 15, 2026 · Editorial standards →

Source: arXiv cs.CL