Brief Review — SentencePiece: A Simple and Language independent subword tokenizer and detokenizer for Neural Text Processing
SentencePiece, Commonly Used in Many Large Language Models (LLM)
SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing,
SentencePiece, by Google, Inc.,
2018 NAACL, Over 2200 Citations (Sik-Ho Tsang @ Medium)
- SentencePiece, a language-independent subword tokenizer and detokenizer, is proposed.
- SentencePiece comprises four main components: Normalizer, Trainer, Encoder, and Decoder.
- Normalizer is a module to normalize semantically-equivalent Unicode characters into canonical forms.
- Encoder internally executes Normalizer to normalize the input text and tokenizes it into a subword sequence with the subword model trained by Trainer.
- The roles of Encoder and Decoder correspond to preprocessing (tokenization) and postprocessing (detokenization) respectively.
Encoding and decoding manage the vocabulary to id mapping and can directly convert the text into an id sequence and vice versa.
1.2. Problems of Existing Approaches
- Yet, many existing approaches need some handcrafted/internal rules to restore back the id to text (not reversibly convertible). They are language-dependent, work in offline mode, which makes them difficult to use.
- This procedure also makes it hard to employ sub-sentence level data augmentation and noise injection.
- Lossless tokenization:
- The basic idea of lossless tokenization is to treat the input text just as a sequence of Unicode characters. Even whitespace is handled as a normal symbol.
- It is also self-contained with Python and C++ versions provided.
Subword segmentation with SentencePiece consitently improve the BLEU scores compared to the word model.
The segmentation speed of SentencePiece is around 21k and 74k sentences/sec. in English and Japanese respectively, which is fast enough to be executed on-the-fly.