Notable other

Analyzing and Encoding the Al-Mawrid Arabic-English Dictionary with the ISO Language Markup Framework and TEI Lex-0

Diaa Fayed, Laurent Romary

Published: Jun 16, 2026 — 17:35 UTC

Problem
This work addresses the lack of standardized computational lexicons for Arabic, specifically focusing on the Al-Mawrid Arabic-English dictionary, a significant legacy print resource. The authors highlight the challenges posed by structural ambiguities and punctuation inconsistencies prevalent in 20th-century bilingual dictionaries. The paper is a preprint and has not undergone peer review, indicating that the findings should be interpreted with caution.

Method
The authors propose a dual-standard methodology that integrates the ISO Lexical Markup Framework (LMF) with the Text Encoding Initiative (TEI) Lex-0 guidelines. They conduct an empirical analysis of the dictionary’s lexical knowledge density, focusing on a representative sample from the letter Ayn, which constitutes 4.6% of the total dictionary volume. The encoding process achieves a structural parsing accuracy of 91%. The information extraction rules are quantitatively evaluated, yielding 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. The study also introduces a scalable prefix-based referencing system to facilitate the integration of the resource into the semantic web, promoting interoperability and machine tractability.

Results
The methodology demonstrates a structural parsing accuracy of 91% on the sample analyzed. The precision and recall metrics for information extraction are notable: 85% precision and 98% recall for synonyms, and 88% precision for other morpho-semantic features. These results indicate a significant improvement over existing Arabic lexical resources, showcasing the potential of the proposed encoding framework.

Limitations
The authors acknowledge limitations in the TEI Lex-0 framework, particularly its inadequacy in modeling certain Arabic linguistic phenomena, such as implicit “open set” semantic relations and the representation of scattered morphological cues. Additionally, the study’s reliance on a single letter sample may not fully capture the complexities of the entire dictionary. The authors do not address potential scalability issues when applying this methodology to other sections of the dictionary or to different bilingual lexicons.

Why it matters
This research contributes to the advancement of Arabic NLP and Digital Humanities by providing a reproducible workflow for the retro-digitization of complex legacy bilingual lexicons. The integration of the Al-Mawrid dictionary into a standardized framework enhances its usability for computational applications, facilitating better access to Arabic lexical resources. The implications extend to the development of more robust NLP tools and resources, as well as the potential for enriching the semantic web with high-quality linguistic data, as published in arXiv.

By Callan Zhang · Jun 16, 2026 · Editorial standards →

Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.

Source: arXiv cs.CL