Brief Review — Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
Teacher-Student Approach, Multilingual Sentence BERT (SBERT)
Making Monolingual Sentence Embeddings Multilingual using Knowledge Distillation
Multilingual Sentence-BERT, by Technische Universit¨at Darmstadt
2020 EMNLP, Over 920 Citations (Sik-Ho Tsang @ Medium)Sentense Embedding / Dense Text Retrieval
2019 [Sentence-BERT (SBERT)] 2020 [Retrieval-Augmented Generation (RAG)] [Dense Passage Retriever (DPR)] 2021 [Fusion-in-Decoder] [Augmented SBERT (AugSBERT)]
==== My Other Paper Readings Are Also Over Here ====
- The original (monolingual) model is used to generate sentence embeddings for the source language and then a new system is trained on translated sentences to mimic the original model.
- This approach has several advantages: It is easy to extend existing models with relatively few samples to new languages, it is easier to ensure desired properties for the vector space, and the hardware requirements for training are lower.
- (As the citation uses the name of
reimers-2020-multilingual-sentence-bert
, I also named it as Multilingual Sentence-BERT.)
Outline
- Multilingual Sentence-BERT
- Results
1. Multilingual Sentence-BERT
1.1. Teacher and Student Approach
- First, a teacher model is required, that maps sentences in one or more source languages s to a dense vector space.
- Further, parallel (translated) sentences ((s1, t1), …, (sn, tn)) are needed with si a sentence in one of the source languages and ti a sentence in one of the target languages.
After that, a student model ^M is trained such that:
For a given mini-batch B, we minimize the mean-squared loss:
1.2. Model Architecture
- Student model can have the same of different architecture of teacher model.The student model ^M learns the representation of the teacher model M.
- In this paper, an English SBERT model is mainly used as teacher model M and XLM-RoBERTa (XLM-R) is used as student model ^M.
- The English BERT models have a wordpiece vocabulary size of 30k mainly consisting of English tokens.
- In contrast, XLM-R uses SentencePiece, which avoids language specific pre-processing. Further, it uses a vocabulary with 250k entries from 100 different languages. This makes XLM-R much more suitable for the initialization of the multilingual student model.
1.3. Training Data
- The OPUS website provides parallel data for hundreds of language pairs. There are also other datasets mentioned: GlobalVoices, TED2020, NewsCommentary, WikiMatrix, Tatoeba, Europarl, JW300, OpenSubtitles2018, UNPC.
- Authors also experiment with bilingual dictionaries: MUSE, Wikititles.
2. Results
- 3 tasks are evaluated: Multi- and cross-lingual semantic textual similarity (STS), bitext retrieval, and cross-lingual similarity search.
- STS assigns a score for a pair of sentences, while bitext retrieval identifies parallel (translated) sentences from two large monolingual corpora.
2.1. Multilingual Semantic Textual Similarity
- For the generate sentence embeddings, cosine similarity is computed, as recommended in (Reimers et al., 2016), the Spearman’s rank correlation ρ between the computed score and the gold score is computed.
- mBERT (multilingual BERT) / XLM-R without fine-tuning yields rather poor performance.
While in the monolingual setup (Table 1) the performance is quite competitive, there is a significant drop for the cross-lingual setup (Table 2). This indicates that the vectors spaces are not well aligned across languages.
Using the proposed multilingual knowledge distillation approach, there are state-of-the-art performances for mono- as well as for the cross-lingual setup, significantly outperforming other state-of-the-art models (LASER, mUSE, LaBSE).
2.2. BUCC: Bitext Retrieval
- Using mean pooling directly on mBERT / XLM-R produces low scores. While training on English NLI and STS data improves the performance for XLM-R (XLMR-nli-stsb), it reduces the performance for mBERT.
- Using the proposed multilingual knowledge distillation method, the performance is significantly improved compared to the mBERT / XLM-R model trained only on English data.
2.3. Tatoeba: Similarity Search
- Tatoeba test setup from LASER (Artetxe and Schwenk, 2019b) is used. The dataset consists of up to 1,000 English-aligned sentence pairs for various languages.
- Evaluation is done by finding for all sentences the most similar sentence in the other language using cosine similarity. Accuracy is computed for both directions.
- 4 low resource languages with rather small parallel datasets are evaluated: Georgian (KA, 296k parallel sentence pairs), Swahili (SW, 173k), Tagalog (TL, 36k), and Tatar (TT, 119k).
A significant accuracy improvement is observed compared to LASER, indicating much better aligned vector spaces between English and these languages.