Major multimodal

ML-Embed: Inclusive and Efficient Embeddings for a Multilingual World

Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang

Published
May 14, 2026 — 17:05 UTC

Problem
This preprint addresses significant gaps in the development of high-quality text embeddings, particularly the exclusionary trends driven by high computational costs, a narrow focus on dominant languages, and a lack of transparency in model architectures. The authors highlight that existing models often neglect low-resource languages and are not openly accessible, which limits research and application in a multilingual context.

Method
The core technical contribution is the introduction of ML-Embed, a suite of models based on the 3-Dimensional Matryoshka Learning (3D-ML) framework. This framework comprises three key components:

  1. Matryoshka Representation Learning (MRL): This component enhances storage efficiency by structuring embeddings in a nested manner, allowing for reduced memory usage.
  2. Matryoshka Layer Learning (MLL): This allows for flexible inference-time depth, enabling models to adapt their complexity based on the task requirements.
  3. Matryoshka Embedding Learning (MEL): This focuses on parameter efficiency, optimizing the number of parameters needed for effective embeddings.

The authors trained a range of models from 140M to 8B parameters on a newly curated multilingual dataset, which encompasses a wide variety of languages, including low-resource ones. The training process emphasizes efficiency across the model lifecycle, although specific compute resources used for training are not disclosed.

Results
ML-Embed models achieved state-of-the-art performance on 9 out of 17 evaluated benchmarks from the Multilingual Text Embedding Benchmark (MTEB), demonstrating significant improvements particularly in low-resource languages. The authors report that their models outperform existing baselines, although specific numerical results and effect sizes are not detailed in the abstract. The comprehensive evaluation across 430 tasks indicates a robust performance across diverse linguistic contexts, reinforcing the models’ applicability in real-world scenarios.

Limitations
The authors acknowledge that while ML-Embed addresses several barriers, it may still face challenges related to the scalability of the models for extremely low-resource languages and the potential for bias in the curated dataset. Additionally, the computational efficiency claims, while promising, require further validation in practical deployment scenarios. The lack of detailed training compute specifications also limits reproducibility assessments.

Why it matters
The implications of this work are significant for the field of multilingual NLP. By providing a transparent and efficient framework for embedding generation, ML-Embed sets a precedent for future research aimed at inclusivity in AI systems. The release of models, data, and code fosters an open research environment, encouraging further exploration and development of equitable AI technologies. This work could catalyze advancements in low-resource language processing, ultimately contributing to a more inclusive digital landscape.

Authors: Ziyin Zhang, Zihan Liao, Hang Yu, Peng Di, Rui Wang
Source: arXiv:2605.15081
URL: https://arxiv.org/abs/2605.15081v1

Turing Wire

By Turing Wire editorial staff · May 14, 2026 · Editorial standards →

Source: arXiv cs.AI