Notable other

Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching

Zsolt Robotka, Ádám Rák, Jalal Al-Afandi, András Horváth, György Cserey

Published
Jun 10, 2026 — 10:56 UTC

Problem
The paper addresses the scarcity of high-quality parallel sign video-text pairs for fine-tuning sign language translation (SLT) models, which hampers generalization on long-tail vocabulary and unseen constructions. Despite advancements in weakly-aligned datasets and gloss-free methods, the reliance on expert annotation remains a bottleneck. This work is a preprint and has not undergone peer review.

Method
The authors propose a corpus augmentation technique that leverages existing gloss-annotated training data and a large language model (LLM) for sentence generation. The method involves several steps:

  1. CTC Forced-Alignment: Per-gloss clips are extracted from training videos using Connectionist Temporal Classification (CTC) forced-alignment.
  2. LLM Sentence Generation: Novel gloss-sentence pairs are generated by a corpus-anchored LLM, which utilizes the existing gloss annotations to create contextually relevant sentences.
  3. Synthetic Sequence Assembly: Synthetic RGB video-text pairs are constructed through random sampling of generated sentences and assignment of corresponding video clips. This approach is architecture-agnostic, allowing the synthetic data to be directly utilized by RGB-based SLT models or converted into pose or feature representations for other model types.

Results
The proposed augmentation method was evaluated against the GFSLT-VLP baseline, achieving a notable improvement of +2.92 BLEU-4 points, which is significantly higher than the largest verified gain of +0.98 BLEU-4 reported by Sincan et al. under identical conditions. This demonstrates the effectiveness of the corpus augmentation in enhancing SLT performance without altering the underlying architecture or training protocols.

Limitations
The authors acknowledge that while synthetic data improves SLT objectives, it can negatively impact vision-language pretraining. Additionally, they note that optimizing clip transitions for visual smoothness may be counterproductive when evaluated under L2-based criteria, suggesting that abrupt transitions might serve as a form of implicit regularization. However, the paper does not explore the long-term effects of synthetic data on model robustness or generalization beyond the immediate performance metrics.

Why it matters
This work has significant implications for the field of sign language translation, particularly in enhancing the accessibility of SLT systems without the need for extensive human annotation. The proposed method can be integrated into existing SLT frameworks, potentially leading to improved performance on diverse datasets. The findings encourage further exploration of synthetic data generation techniques in multimodal learning contexts, as highlighted in the paper. For more details, refer to the full text available on arXiv.

Turing Wire

By Turing Wire editorial staff · Jun 10, 2026 · Editorial standards →

Source: arXiv cs.LG