An End-to-End Hybrid Framework for Rumour Detection in Low-Resources Algerian Dialect
Dihia Lanasri, Fatima Benbarek
- Published
- Jun 11, 2026 — 14:40 UTC
Problem
The proliferation of social media has exacerbated the dissemination of rumours, particularly in the Algerian context where informal dialects and code-switching complicate detection efforts. The authors identify a significant gap in the literature regarding effective rumour detection methodologies tailored for low-resource dialects, particularly given the scarcity of annotated datasets and the ineffectiveness of standard Arabic NLP tools on dialectal text. This work is a preprint and has not undergone peer review.
Method
The authors propose an end-to-end hybrid framework for rumour detection that integrates multiple approaches, including classical machine learning, deep learning, and transformer models. They construct a domain-specific annotated dataset by merging real social media posts, synthetic data, and the FASSILA corpus, employing a similarity-based annotation process for automatic labeling. A transliteration pipeline is introduced to create parallel datasets in both Arabic script and Arabizi, facilitating model training. The hybrid model combines transformer embeddings with a classical classifier, optimizing for performance in the unique context of Algerian dialects. The training process emphasizes domain-specific pre-training over model size, with models trained on social media data yielding superior results compared to larger models trained on formal Arabic corpora.
Results
The hybrid framework achieves an F1-score of 0.84, outperforming various baseline models across multiple evaluation metrics. Notably, the authors demonstrate that their approach surpasses traditional methods, including both classical machine learning and standalone transformer models, indicating a significant improvement in rumour detection capabilities within the low-resource Algerian dialect context.
Limitations
The authors acknowledge several limitations, including the reliance on a relatively small dataset, which may affect generalizability. Additionally, the transliteration process may introduce noise, potentially impacting model performance. They also note that while their hybrid approach shows promise, further exploration into more sophisticated models and larger datasets is necessary for broader applicability. The study does not address the potential biases introduced by the synthetic data generation process.
Why it matters
This research has significant implications for the field of NLP, particularly in low-resource settings where traditional methods fall short. By demonstrating the effectiveness of a hybrid approach tailored to the nuances of Algerian dialects, the authors pave the way for future work in rumour detection and other NLP tasks in similar contexts. The findings underscore the importance of domain-specific data and pre-training, which could inform subsequent research and applications in dialectal NLP, as published in arXiv.
By Turing Wire editorial staff · Jun 11, 2026 · Editorial standards →
Source: arXiv cs.CL