Recovering the Zipfian Distribution in Unsupervised Term Discovery
Danel Slabbert, Simon Malan, Herman Kamper
- Published
- Jun 9, 2026 — 12:33 UTC
Problem
The paper addresses the limitations of existing unsupervised term discovery methods, particularly the reliance on centre-based clustering techniques like K-means, which yield uniform distributions rather than the expected Zipfian distributions in lexicons. This work is particularly relevant as it challenges the prevailing methodologies in the field, highlighting a gap in the literature regarding effective clustering strategies for segmenting unlabelled speech into meaningful units. The authors note that this is a preprint and has not yet undergone peer review.
Method
The authors propose a graph-based clustering approach that utilizes segment embeddings connected by pairwise similarity. The clustering is performed using the Leiden algorithm, which is known for its efficiency in community detection within graphs. The study compares this method against traditional centre-based approaches, including K-means, Gaussian Mixture Models (GMM), and BIRCH, as well as agglomerative clustering with average linkage. The experiments are conducted on datasets from three different languages, although specific details regarding the datasets and the training compute resources are not disclosed.
Results
The graph-based clustering method significantly outperforms the centre-based approaches in terms of producing lexicons that exhibit more Zipf-like distributions. The paper reports that the graph clustering method achieves a notable improvement in the quality of term discovery, although specific quantitative metrics (e.g., F1 scores, clustering purity) are not provided in the abstract. In contrast, agglomerative clustering also shows competitive performance but is flagged as less computationally efficient and offers less control over the resulting distribution.
Limitations
The authors acknowledge that while graph-based clustering is effective, it may require more computational resources than simpler methods like K-means. They also note that agglomerative clustering, while performing well, is less scalable due to its computational inefficiency. Additionally, the paper does not explore the impact of varying hyperparameters on the clustering performance, nor does it address the potential limitations of the Leiden algorithm in different contexts.
Why it matters
This work has significant implications for the field of unsupervised learning and natural language processing, particularly in the context of speech segmentation and lexicon discovery. By demonstrating that graph-based clustering can yield more accurate representations of language distributions, the authors encourage a reevaluation of clustering methodologies in term discovery tasks. This could lead to improved performance in downstream applications such as speech recognition and language modeling, as highlighted in similar discussions in the literature (as published in arXiv cs.CL). The findings advocate for a shift towards more sophisticated clustering techniques that align better with linguistic properties.
By Turing Wire editorial staff · Jun 9, 2026 · Editorial standards →
Source: arXiv cs.CL