Brief Review — Improving Text Embeddings with Large Language Models
E5 Mistral 7B, Outperforms E5 and Multilingual E5
Improving Text Embeddings with Large Language Models
E5 Mistral 7B, by Microsoft Corporation
2024 ACL, Over 170 Citations (Sik-Ho Tsang @ Medium)Sentence Embedding / Dense Text Retrieval
2017 … 2022 [E5] 2024 [Multilingual E5]
==== My Other Paper Readings Are Also Over Here ====
- Proprietary LLMs are leveraged to generate diverse synthetic data for hundreds of thousands of text embedding tasks across 93 languages.
- Then, open-source decoder-only LLMs are fine-tuned on the synthetic data using standard contrastive loss. Thus, the proposed method does not require building complex training pipelines or relying on manually collected datasets.
Outline
- Synthetic Data Generation
- E5 Mistral 7B
- Results
1. Synthetic Data Generation
1.1. Asymmetric Tasks
- This category comprises tasks where the query and document are semantically related but are not paraphrases of each other.
- Depending on the length of the query and document, asymmetric tasks are further divided into four subgroups: short-long match, long-short match, short-short match, and long-long match. (For instance, short-long match tasks involve a short query and a long document.)
- For each subgroup, a two-step prompt template is designed that first prompts LLMs brainstorm a list of tasks, and then generates a concrete example conditioned on the task definition.
1.2. Symmetric Tasks
- Symmetric tasks involve queries and documents that have similar semantic meanings but different surface forms.
- Two application scenarios are involved: monolingual semantic textual similarity (STS) and bitext retrieval.
- Two distinct prompt templates are designed for each scenario, tailored to their specific objectives. Since the task definition is straightforward, the brainstorming step is omitted for symmetric tasks.
- As shown above, the value of “{query_length}” is sampled from the set “{less than 5 words, 5–10 words, at least 10 words}”.
- To generate multilingual data, the value of “{language}” is sampled from the language list of XLM-R.
1.3. Statistics of the Synthetic Data
- 500k examples are generated with 150k unique instructions using Azure OpenAI . 25% are generated by GPT-3.5-Turbo and others are generated by GPT-4.
- The total token consumption is about 180M.
- The predominant language is English, with coverage extending to a total of 93 languages. For the bottom 75 low-resource languages, there are about 1k examples per language on average.
- For the training data, both the generated synthetic data and a collection of 13 public datasets are utilized, yielding approximately 1.8M examples after sampling.
2. E5 Mistral 7B
2.1. Model Pretraining
- Given a relevant query-document pair (q+, d+), the following instruction template is applied to the original query q+ to generate a new one q+inst:
- where “{task_definition}” is a placeholder for a one-sentence description of the embedding task.
- Given a pretrained LLM, an [EOS] token is appended to the end of the query and document, and they are fed into the LLM to obtain the query and document embeddings by taking the last layer [EOS] vector.
- The standard InfoNCE loss L is adopted over the in-batch negatives and hard negatives:
- where the temperature-scaled cosine similarity function is adopted:
2.2. Model Finetuning
- The pretrained Mistral 7B checkpoint is fine-tuned for 1 epoch, following the training recipe from RankLLaMA and LoRA with rank 16 is utilized.
- Gradient checkpointing, mixed precision training, and DeepSpeed ZeRO-3 are also applied.
3. Results
3.1. MTEB Benchmark
The proposed model “E5_mistral-7b + full data” attains the highest average score on the MTEB benchmark, outperforming the previous state-of-the-art model by 2.4 points.
- In the “w/ synthetic data only” setting, no labeled data is used for training, and yet the performance remains quite competitive.
3.2. MTEB Leaderboard
The proposed model outperforms the current commercial models by a significant margin.
3.3. Multilingual Retrieval
The model surpasses mE5large on high-resource languages, notably on English. Nevertheless, for lowresource languages, the proposed model remains suboptimal compared to mE5base.
3.4. Bitext Mining
Similar to the MIRACL retrieval, E5_mistral-7b excels in bitext mining for high-resource languages only.