Brief Review — Multilingual E5 Text Embeddings: A Technical Report

Multilingual E5, Extends English E5 Model

Sik-Ho Tsang
3 min readSep 4, 2024

Multilingual E5 Text Embeddings: A Technical Report
Multilingual E5
, by Microsoft Corporation
2024 arXiv v1, Over 40 Citations (Sik-Ho Tsang @ Medium)

Sentence Embedding / Dense Text Retrieval
2017 [InferSent] 2019 [Sentence-BERT (SBERT)] 2020 [Retrieval-Augmented Generation (RAG)] [Dense Passage Retriever (DPR)] 2021 [Fusion-in-Decoder] [Augmented SBERT (AugSBERT)]
==== My Other Paper Readings Are Also Over Here ====

  • This technical report presents the open-source multilingual E5 text embedding models, released in mid-2023.
  • The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets.
  • A new instruction-tuned embedding model is introduced, whose performance is on par with state-of-the-art, English-only models of similar sizes.

Outline

  1. Multilingual E5
  2. Results

1. Multilingual E5

  • The multilingual E5 text embedding models (mE5-{small / base / large}), extends the English E5 models in Wang et al. (2022).
  • Instruction-tuned embedding model mE5-large-instruct is also released by utilizing the synthetic data from Wang et al. (2023).

1.1. Weakly-supervised Contrastive Pre-training

Data mixture for contrastive pre-training.
  • In the first stage, the model is continually pre-trained on a diverse mixture of multilingual text pairs obtained from various sources as above.
  • The models are trained with a large batch size 32k for a total of 30k steps, which approximately goes over ∼ 1 billion text pairs.
  • InfoNCE contrastive loss with only in-batch negatives is used.

1.2. Supervised Fine-tuning

Data mixture for supervised fine-tuning.
  • In the second stage, the models from the previous stage is fine-tuned on a combination of high-quality labeled datasets.
  • Mined hard negatives and knowledge distillation are incorporated from a cross-encoder model to further enhance the embedding quality.
  • For the mE5-large-instruct model, the data mixture from Wang et al. (2023), which includes additional 500k synthetic data generated by GPT-3.5/GPT-4, is used.
  • This new mixture encompasses 150k unique instructions and covers 93 languages. The instruction templates from Wang et al. (2023) are reused for both the training and evaluation of this instruction-tuned model.

2. Results

2.1. English Text Embedding Benchmark

On MTEB benchmark English portion, the best mE5 model surpasses the previous state-of-the-art multilingual model Coheremultilingual-v3, by 0.4 points and outperforms a strong English-only model, BGElarge-en-v1.5, by 0.2 points.

2.2. Multilingual Retrieval

Multilingual retrieval

As shown in Table 4, mE5 models significantly outperform mDPR, which has been fine-tuned on the MIRACL training set, in both nDCG@10 and recall metrics.

2.3. Bitext Mining

Bitext mining results.
  • Bitext Mining is a cross-lingual similarity search task that requires the matching of two sentences with little lexical overlap.

mE5 models exhibit competitive performance across a broad range of languages, both high-resource and low-resource.

  • mE5large-instruct model surpasses the performance of LaBSE, a model specifically designed for bitext mining, due to the expanded language coverage afforded by the synthetic data.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.