Brief Review — Multilingual E5 Text Embeddings: A Technical Report
Multilingual E5, Extends English E5 Model
Multilingual E5 Text Embeddings: A Technical Report
Multilingual E5, by Microsoft Corporation
2024 arXiv v1, Over 40 Citations (Sik-Ho Tsang @ Medium)Sentence Embedding / Dense Text Retrieval
2017 [InferSent] 2019 [Sentence-BERT (SBERT)] 2020 [Retrieval-Augmented Generation (RAG)] [Dense Passage Retriever (DPR)] 2021 [Fusion-in-Decoder] [Augmented SBERT (AugSBERT)]
==== My Other Paper Readings Are Also Over Here ====
- This technical report presents the open-source multilingual E5 text embedding models, released in mid-2023.
- The training procedure adheres to the English E5 model recipe, involving contrastive pre-training on 1 billion multilingual text pairs, followed by fine-tuning on a combination of labeled datasets.
- A new instruction-tuned embedding model is introduced, whose performance is on par with state-of-the-art, English-only models of similar sizes.
Outline
- Multilingual E5
- Results
1. Multilingual E5
- The multilingual E5 text embedding models (mE5-{small / base / large}), extends the English E5 models in Wang et al. (2022).
- Instruction-tuned embedding model mE5-large-instruct is also released by utilizing the synthetic data from Wang et al. (2023).
1.1. Weakly-supervised Contrastive Pre-training
- In the first stage, the model is continually pre-trained on a diverse mixture of multilingual text pairs obtained from various sources as above.
- The models are trained with a large batch size 32k for a total of 30k steps, which approximately goes over ∼ 1 billion text pairs.
- InfoNCE contrastive loss with only in-batch negatives is used.
1.2. Supervised Fine-tuning
- In the second stage, the models from the previous stage is fine-tuned on a combination of high-quality labeled datasets.
- Mined hard negatives and knowledge distillation are incorporated from a cross-encoder model to further enhance the embedding quality.
- For the mE5-large-instruct model, the data mixture from Wang et al. (2023), which includes additional 500k synthetic data generated by GPT-3.5/GPT-4, is used.
- This new mixture encompasses 150k unique instructions and covers 93 languages. The instruction templates from Wang et al. (2023) are reused for both the training and evaluation of this instruction-tuned model.
2. Results
2.1. English Text Embedding Benchmark
On MTEB benchmark English portion, the best mE5 model surpasses the previous state-of-the-art multilingual model Coheremultilingual-v3, by 0.4 points and outperforms a strong English-only model, BGElarge-en-v1.5, by 0.2 points.
2.2. Multilingual Retrieval
As shown in Table 4, mE5 models significantly outperform mDPR, which has been fine-tuned on the MIRACL training set, in both nDCG@10 and recall metrics.
2.3. Bitext Mining
- Bitext Mining is a cross-lingual similarity search task that requires the matching of two sentences with little lexical overlap.
mE5 models exhibit competitive performance across a broad range of languages, both high-resource and low-resource.
- mE5large-instruct model surpasses the performance of LaBSE, a model specifically designed for bitext mining, due to the expanded language coverage afforded by the synthetic data.