Brief Review — SentencePiece: A Simple and Language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece, Commonly Used in Many Large Language Models (LLM)

2 min readApr 15, 2023

SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing,
SentencePiece, by Google, Inc.,
2018 NAACL, Over 2200 Citations (Sik-Ho Tsang @ Medium)
Machine Translation
2013 … 2021 [ResMLP] [GPKD] [Roformer] [DeLighT] [R-Drop] 2022 [DeepNet]
==== My Other Paper Readings Are Also Over Here ====

SentencePiece, a language-independent subword tokenizer and detokenizer, is proposed.

Outline

SentencePiece
Results

1. SentencePiece

1.1. Overall

SentencePiece comprises four main components: Normalizer, Trainer, Encoder, and Decoder.
Normalizer is a module to normalize semantically-equivalent Unicode characters into canonical forms.
Encoder internally executes Normalizer to normalize the input text and tokenizes it into a subword sequence with the subword model trained by Trainer.
The roles of Encoder and Decoder correspond to preprocessing (tokenization) and postprocessing (detokenization) respectively.

Encoding and decoding manage the vocabulary to id mapping and can directly convert the text into an id sequence and vice versa.

1.2. Problems of Existing Approaches

Yet, many existing approaches need some handcrafted/internal rules to restore back the id to text (not reversibly convertible). They are language-dependent, work in offline mode, which makes them difficult to use.
This procedure also makes it hard to employ sub-sentence level data augmentation and noise injection.

1.3. SentencePiece

Lossless tokenization:

The basic idea of lossless tokenization is to treat the input text just as a sequence of Unicode characters. Even whitespace is handled as a normal symbol.

It is also self-contained with Python and C++ versions provided.

2. Results

Subword segmentation with SentencePiece consitently improve the BLEU scores compared to the word model.

The segmentation speed of SentencePiece is around 21k and 74k sentences/sec. in English and Japanese respectively, which is fast enough to be executed on-the-fly.

Brief Review — SentencePiece: A Simple and Language independent subword tokenizer and detokenizer for Neural Text Processing

SentencePiece, Commonly Used in Many Large Language Models (LLM)

Outline

1. SentencePiece

1.1. Overall

1.2. Problems of Existing Approaches

1.3. SentencePiece

2. Results

Written by Sik-Ho Tsang

No responses yet