BamiBERT: A New BERT-based Language Model for Vietnamese
Dat Quoc Nguyen, Thinh Pham, Chi Tran, Linh The Nguyen
- Published
- Jul 2, 2026 — 14:46 UTC
Problem — This work addresses the limitations of PhoBERT, the prevailing Vietnamese text encoder, by introducing BamiBERT, a new pre-trained language model specifically designed for Vietnamese. The paper is a preprint and has not undergone peer review.
Method — BamiBERT is trained from scratch on a 129GB corpus of general-domain Vietnamese text for 20 epochs. It features an extended context length of up to 2048 tokens and processes raw input directly, removing the necessity for external word segmentation. This architecture enhances its applicability and efficiency in handling Vietnamese text.
Results — BamiBERT achieves the best score on 11 out of 15 metrics and ranks second on 3 others across 8 Vietnamese benchmarks, establishing a new state of the art among
By Callan Zhang · Jul 2, 2026 · Editorial standards →
Summarised from the primary source with AI assistance under human editorial oversight. Turing Wire is not a primary source — read the original for the authoritative account.
Source: arXiv cs.CL