Review — MUTE: Multi-Unit Transformers for Neural Machine Translation

MUTE, Transformer With Parallel Paths

Sik-Ho Tsang
4 min readFeb 11


Multi-Unit Transformers for Neural Machine Translation
, by Pattern Recognition Center, WeChat AI, Tencent Inc,
2020 EMNLP, Over 10 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Neural Machine Translation, NMT, Transformer

  • Multi-Unit TransformErs (MUTE) is proposed, which aims to promote the expressiveness of the Transformer by introducing diverse and complementary units.
  • Specifically, several parallel units are used to introduce diversity. Biased module and sequential dependency are designed that guide and encourage complementariness among different units.


  1. Multi-Unit TransformErs (MUTE)
  2. Results

1. Multi-Unit TransformErs (MUTE)

Layer architecture for Multi-Unit Transformer. White arrow indicates model change from Multi-Unit to Biased Multi-Unit, to Sequentially Biased Multi-Unit
  • For original Transformer, it is the one in (a) but only one path.

1.1. MUTE

  • Given input Xk of k-th layer, it is fed into I identical units {F1, …, Fi, …, FI} with different learnable parameters.
  • For all I units, they are combined by a weighted sum where i denotes the i-th unit:
  • (This is similar to the case of ResNeXt improved from ResNet in CV.)

1.2. Biased MUTE

  • Borrow from bagging idea, for each layer, instead of giving the same inputs Xk to all units, each input is transformed with corresponding type of noises (e.g., swap, reorder, mask):
  • where Biasi denotes the noise function for i-th unit. The noise operations include:
  1. Swapping, randomly swap two input embeddings up to a certain range (i.e., 3).
  2. Disorder, randomly permutate a subsequence within a certain length (i.e., 3).
  3. Masking, randomly replace one input embedding with a learnable mask embedding.
  • The identity mapping (i.e., no noise) can be seen as a special case of bias module and is included in the model design.
  • However, it still lacks explicit complementarity modeling, i.e., mitigating these gaps.

1.3. Sequentially Biased MUTE

  • Given the outputs from biased units Fei(Xki), these outputs are permutated by a certain ordering function p(i):
  • where Gei is the i-th permutated output.
  • Similar to ResNet, each permutated output Gei learns the residual of previous accumulated outputs ^Gei-1:
  • Finally, this accumulated sequence is normalized to keep the output norm stable and fuse all accumulated outputs:

1.4. AutoShuffle

  • A permutation matrix only contains discrete values which cannot be optimized by gradient descent. Therefore, a continuous matrix M is used. Particularly, M is regarded as a learnable matrix and is used to multiply the outputs of units:
  • To ensure M remains an approximation for the permutation matrix during training, M is normalize after each optimization step:
  • Then, a Lipschitz continuous nonconvex penalty is introduced, as proposed in Lyu et al. (2019) to guarantee M converge to a permutation matrix:
  • The penalty is added to cross-entropy loss as the final objective:

2. Results

2.1. NIST Chinses-English

Case-insensitive BLEU scores (%) of NIST Chinses-English (Zh-En) task.
  • All of the proposed approaches substantially outperform the baselines, with improvement ranging from 1.12 to 1.52 BLEU points.

The basic MUTE model has already surpassed existing strong NMT systems and the baselines.

2.2. WMT’14 English-German

Case-sensitive BLEU scores (%) of WMT’14 English-German (En-De) task.

The proposed methods perform consistently across languages and are still useful in large scale datasets.

2.3. WMT’18 Chinese-English

Case-sensitive BLEU scores (%) of WMT’18 Chinese-English (Zh-En) task.

The proposed MUTE model still strongly outperforms the baselineTransformer (Base)” and “Transformer+Relative (Base)” with +1.1 and +0.7 BLEU points.

2.4. Ablation Study

Ablation study for BLEU scores (%) over the NIST Zh-En validation set.

Each noise type and module is useful.

2.5. Inference Speed

The effect on the number of units

Increasing the number of units from 1 to 6 yields consistent BLEU improvement (from 46.5 to 47.5) with only mild inference speed decrease (from 890 tokens/sec to 830 tokens/sec).



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.