PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
Adhiraj Banerjee, Vipul Arora
- Published
- May 7, 2026 — 17:11 UTC
- Summary length
- 409 words
- Relevance score
- 70%
Problem
This paper addresses the limitations of existing audio tokenization methods, which typically rely on quantization, clustering, or codec reconstruction. These methods often fail to optimize for sequence consistency, compactness, length control, termination, and edit similarity. The authors propose PairAlign, a novel framework for audio tokenization that leverages sequence-level self-alignment to improve these aspects. This work is presented as a preprint and has not yet undergone peer review.
Method
PairAlign conceptualizes tokenization as a conditional sequence generation task. It employs an encoder to map audio input (speech) to a continuous condition, followed by an autoregressive decoder that generates tokens starting from a beginning-of-sequence (BOS) token. The framework utilizes two content-preserving views of the audio data, training each view’s sequence to be likely under the other’s representation while using unrelated examples to provide competing sequences. This approach aims to maintain edit-distance preservation and prevent many-to-one token collapse. Key technical contributions include:
- Architecture: A VQ-style tokenization framework enhanced with EMA-teacher targets and cross-paired teacher forcing.
- Training Techniques: Incorporation of prefix corruption, likelihood contrast, and explicit length control mechanisms.
- Data: The model is trained on 3-second speech segments, focusing on achieving compact and non-degenerate token sequences.
Results
PairAlign demonstrates significant improvements over baseline methods on the TIMIT dataset. Specifically, it achieves a 55% reduction in archive token count while preserving edit-distance search capabilities. A continuous-sweep probe indicates that PairAlign exhibits lower local overlap compared to a dense geometric tokenizer, while also providing stronger length control and maintaining bounded edit trajectories for shifts under 100 ms. These results suggest that PairAlign effectively balances compactness and sequence fidelity.
Limitations
The authors acknowledge that while PairAlign improves upon existing methods, it may still face challenges in extreme cases of audio variability or in scenarios requiring real-time processing. Additionally, the reliance on two content-preserving views may limit its applicability to datasets where such views are not easily obtainable. The paper does not address potential computational overhead introduced by the autoregressive nature of the decoder, which could impact inference speed.
Why it matters
The implications of PairAlign extend to various applications in audio processing, including speech recognition, audio retrieval, and generative modeling. By providing a framework that optimizes for sequence-level properties, it opens avenues for more efficient and effective audio tokenization methods. This work could influence future research in symbolic representation learning and enhance the performance of downstream tasks that rely on discrete audio representations.
Authors: Adhiraj Banerjee, Vipul Arora
Source: arXiv:2605.06582
URL: https://arxiv.org/abs/2605.06582v1