Review — Neural Machine Translation of Rare Words with Subword Units

Byte Pair Encoding (BPE), Using Subwords for Rare Words

Sik-Ho Tsang
3 min readJul 16, 2022
Subword Units (Figure from freecodecamp)

Neural Machine Translation of Rare Words with Subword Units
BPE, by University of Edinburgh
2016 ACL, Over 5400 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Neural Machine Translation, NMT

  • A simple and effective approach is introduced where rare and unknown words are encoded as sequences of subword units.


  1. Subword Translation
  2. Experimental Results

1. Subword Translation

1.1. Intuition Behind

Rare and unknown words can be encoded as sequences of subword units.

  • This is based on the intuition that various word classes are translatable via smaller units than words, for instance names, compounds, as well as cognates and loanwords.
  • Word categories whose translation is potentially transparent include:
  • Named entities:
  • Cognates and loanwords:
  • Morphologically complex words:

1.2. Byte Pair Encoding (BPE)

  • Byte Pair Encoding (BPE) (Gage, 1994) is a simple data compression technique that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte.
  • Authors adapt this algorithm for word segmentation. Instead of merging frequent pairs of bytes, we merge characters or character sequences.
BPE merge operations learned from dictionary {‘low’, ‘lowest’, ‘newer’, ‘wider’}.
  • The main difference to other compression algorithms, such as Huffman encoding is that, BPE symbol sequences are still interpretable as subword units, and that the network can generalize to translate and produce new words (unseen at training time) on the basis of these subword units.
  • In the above example, the Out-Of-Vocabulary (OOV) ‘lower’ would be segmented into ‘low er·’.

2. Experimental Results

English→German translation performance (BLEU, CHRF3 and unigram F1) on newstest2015
  • Attention Decoder/RNNSearch is used as the network for NMT using different kinds of vocabularies.
  • The proposed BPE-J90k obtains better or competitive BLUE and CHRF3.

The major claim is that translation of rare and unknown words is poor in word-level NMT models, and that subword models improve the translation of these word types.

English→German translation example, “|” marks subword boundaries
  • The above shows some examples of translation using subwords.

This Byte Pair Encoding (BPE) has been used for many researches.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.