Major other

Transcoda: End-to-End Zero-Shot Optical Music Recognition via Data-Centric Synthetic Training

Daniel Dratschuk, Paul Swoboda

Published: May 11, 2026 — 16:50 UTC

Problem
This paper addresses the limitations in Optical Music Recognition (OMR) due to the scarcity of large-scale, annotated datasets derived from real scans. Existing models often depend on few-shot transfer learning or simplistic synthetic training methods, which do not adequately capture the complexity of music notation. Additionally, the challenge of encoding non-uniqueness in the Humdrum **kern format complicates the learning task, as multiple text encodings can correspond to the same visual representation of sheet music. This preprint work proposes a novel solution to these issues.

Method
The authors introduce Transcoda, an end-to-end OMR system that leverages three core innovations: (1) an advanced synthetic data generation pipeline that produces diverse training samples, (2) a normalization process for the **kern encoding that enforces a unique normal form, thereby reducing ambiguity, and (3) a grammar-based decoding mechanism that ensures the syntactic correctness of the output. The model architecture consists of a compact 59M-parameter neural network, which is trained on a single GPU over a period of 6 hours. The training process utilizes the synthetic dataset generated by their pipeline, allowing the model to learn effectively despite the lack of real annotated data.

Results
Transcoda achieves a notable performance improvement on a newly curated benchmark of synthetically rendered scores, attaining an Optical Music Recognition Normalized Edit Distance (OMR-NED) score of 18.46%. This performance surpasses the next-best system, Legato, which scores 43.91% OMR-NED. Furthermore, when applied to historical Polish scans, Transcoda reduces the error rate to 63.97% OMR-NED, a significant improvement from the 80.16% achieved by the previous state-of-the-art method, SMT++. These results demonstrate the effectiveness of the proposed methods in overcoming the challenges posed by OMR tasks.

Limitations
The authors acknowledge that the reliance on synthetic data may not fully capture the complexities of real-world music notation, potentially limiting generalization to unseen data. Additionally, the normalization of the **kern encoding, while beneficial for consistency, may inadvertently exclude valid variations in musical representation. The paper does not address the scalability of the synthetic data generation pipeline or the potential computational costs associated with larger datasets. Furthermore, the evaluation is primarily focused on synthetically generated scores, which may not reflect performance on diverse real-world datasets.

Why it matters
The implications of this work are significant for the field of OMR and music information retrieval. By demonstrating that a compact model can outperform larger baselines through innovative data-centric approaches, this research paves the way for more efficient OMR systems that can be trained with limited annotated data. The normalization of encoding and grammar-based decoding could also inspire new methodologies in other areas of sequence-to-sequence learning, particularly in domains where data scarcity is a challenge. Overall, Transcoda represents a promising advancement in the quest for robust and scalable OMR solutions.

Authors: Daniel Dratschuk, Paul Swoboda
Source: arXiv:2605.10835
URL: https://arxiv.org/abs/2605.10835v1

By Callan Zhang · May 11, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.LG