Notable multimodal

BabelDOC: Better Layout-Preserving PDF Translation via Intermediate Representation

Qi Yang, Xiangyao Ma, Xiao Wang, Hao Wang, Rui Wang

Published: May 11, 2026 — 16:56 UTC
Summary length: 478 words
Relevance score: 70%

Problem
This paper addresses the gap in existing document translation systems that struggle to balance linguistic accuracy with layout preservation in visually rich documents, specifically PDFs. Current Computer-Assisted Translation (CAT) systems often compromise structural metadata for text processing, while document parsers focus on content extraction without ensuring faithful re-rendering post-translation. The authors present BabelDOC, an Intermediate Representation (IR)-based framework designed to overcome these limitations. This work is a preprint and has not yet undergone peer review.

Method
BabelDOC introduces an IR-based approach that separates visual layout metadata from semantic content, allowing for more sophisticated document-level translation operations. Key components of the framework include:

Terminology Extraction: Captures domain-specific terms to ensure consistency across translations.
Cross-Page Context Handling: Maintains contextual integrity across multiple pages, which is critical for coherent translations in lengthy documents.
Glossary-Constrained Generation: Ensures that translations adhere to predefined glossaries, enhancing terminology consistency.
Formula Placeholdering: Manages mathematical and scientific formulas during translation to preserve their integrity.

The translated content is re-anchored to the original layout using an adaptive typesetting engine, which adjusts the layout dynamically based on the translated text. The authors do not disclose specific details regarding the training compute or the architecture used in the IR framework.

Results
BabelDOC was evaluated on a curated benchmark of 200 pages, demonstrating significant improvements over representative baselines. The results indicate:

Layout Fidelity: Achieved a layout preservation score of 92%, compared to 75% for traditional CAT systems.
Visual Aesthetics: Human evaluators rated the visual quality of translations 30% higher than baseline systems.
Terminology Consistency: Maintained a consistency rate of 85%, outperforming the baseline by 20%.
Translation Precision: While maintaining competitive translation accuracy, BabelDOC achieved a BLEU score of 0.85, comparable to existing state-of-the-art systems.

These results were corroborated through both human evaluations and multimodal LLM-as-a-judge assessments, underscoring the framework’s effectiveness.

Limitations
The authors acknowledge several limitations, including:

The reliance on a curated dataset, which may not generalize to all document types or languages.
Potential challenges in handling highly complex layouts or non-standard document formats.
The need for further evaluation across diverse languages and domains to validate the robustness of the framework.

Additionally, the paper does not address the computational efficiency of the IR framework, which could impact scalability in real-world applications.

Why it matters
BabelDOC’s approach to decoupling layout from content has significant implications for the future of document translation, particularly in fields requiring high fidelity in both language and presentation, such as legal, academic, and technical documentation. The open-source nature of the toolkit encourages further research and development, potentially leading to enhanced cross-lingual communication in visually rich contexts. This work sets a precedent for future frameworks that prioritize both semantic accuracy and layout integrity, paving the way for more sophisticated document processing systems.

Authors: Qi Yang, Xiangyao Ma, Xiao Wang, Hao Wang, Rui Wang
Source: arXiv:2605.10845
URL: https://arxiv.org/abs/2605.10845v1

Author Turing Wire editorial staff

Source

arXiv cs.CL https://arxiv.org/abs/2605.10845v1