Review — MUTE: Multi-Unit Transformers for Neural Machine Translation

MUTE, Transformer With Parallel Paths

4 min readFeb 11, 2023

Multi-Unit Transformers for Neural Machine Translation
MUTE, by Pattern Recognition Center, WeChat AI, Tencent Inc,
2020 EMNLP, Over 10 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Neural Machine Translation, NMT, Transformer

Multi-Unit TransformErs (MUTE) is proposed, which aims to promote the expressiveness of the Transformer by introducing diverse and complementary units.
Specifically, several parallel units are used to introduce diversity. Biased module and sequential dependency are designed that guide and encourage complementariness among different units.

Outline

Multi-Unit TransformErs (MUTE)
Results

1. Multi-Unit TransformErs (MUTE)

**Layer architecture for Multi-Unit Transformer. White arrow indicates model change from Multi-Unit to Biased Multi-Unit, to Sequentially Biased Multi-Unit**

For original Transformer, it is the one in (a) but only one path.

1.1. MUTE

Given input Xk of k-th layer, it is fed into I identical units {F1, …, Fi, …, FI} with different learnable parameters.

For all I units, they are combined by a weighted sum where i denotes the i-th unit:

(This is similar to the case of ResNeXt improved from ResNet in CV.)

1.2. Biased MUTE

Borrow from bagging idea, for each layer, instead of giving the same inputs Xk to all units, each input is transformed with corresponding type of noises (e.g., swap, reorder, mask):

where Biasi denotes the noise function for i-th unit. The noise operations include:

Swapping, randomly swap two input embeddings up to a certain range (i.e., 3).
Disorder, randomly permutate a subsequence within a certain length (i.e., 3).
Masking, randomly replace one input embedding with a learnable mask embedding.

The identity mapping (i.e., no noise) can be seen as a special case of bias module and is included in the model design.
However, it still lacks explicit complementarity modeling, i.e., mitigating these gaps.

1.3. Sequentially Biased MUTE

Given the outputs from biased units Fei(Xki), these outputs are permutated by a certain ordering function p(i):

where Gei is the i-th permutated output.
Similar to ResNet, each permutated output Gei learns the residual of previous accumulated outputs ^Gei-1:

Finally, this accumulated sequence is normalized to keep the output norm stable and fuse all accumulated outputs:

1.4. AutoShuffle

A permutation matrix only contains discrete values which cannot be optimized by gradient descent. Therefore, a continuous matrix M is used. Particularly, M is regarded as a learnable matrix and is used to multiply the outputs of units:

To ensure M remains an approximation for the permutation matrix during training, M is normalize after each optimization step:

Then, a Lipschitz continuous nonconvex penalty is introduced, as proposed in Lyu et al. (2019) to guarantee M converge to a permutation matrix:

The penalty is added to cross-entropy loss as the final objective:

2. Results

2.1. NIST Chinses-English

**Case-insensitive BLEU scores (%) of NIST Chinses-English (Zh-En) task.**

All of the proposed approaches substantially outperform the baselines, with improvement ranging from 1.12 to 1.52 BLEU points.

The basic MUTE model has already surpassed existing strong NMT systems and the baselines.

2.2. WMT’14 English-German

**Case-sensitive BLEU scores (%) of WMT’14 English-German (En-De) task.**

The proposed methods perform consistently across languages and are still useful in large scale datasets.

2.3. WMT’18 Chinese-English

**Case-sensitive BLEU scores (%) of WMT’18 Chinese-English (Zh-En) task.**

The proposed MUTE model still strongly outperforms the baseline “Transformer (Base)” and “Transformer+Relative (Base)” with +1.1 and +0.7 BLEU points.

2.4. Ablation Study

**Ablation study for BLEU scores (%) over the NIST Zh-En validation set.**

Each noise type and module is useful.

2.5. Inference Speed

Increasing the number of units from 1 to 6 yields consistent BLEU improvement (from 46.5 to 47.5) with only mild inference speed decrease (from 890 tokens/sec to 830 tokens/sec).

Reference

[2020 EMNLP] [MUTE]
Multi-Unit Transformers for Neural Machine Translation

4.2. Machine Translation

2013 … 2020 [Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] [OpenNMT] [DeFINE] [MUTE] 2021 [ResMLP] [GPKD] [Roformer] [DeLighT]

Review — MUTE: Multi-Unit Transformers for Neural Machine Translation

MUTE, Transformer With Parallel Paths

Outline

1. Multi-Unit TransformErs (MUTE)

1.1. MUTE

1.2. Biased MUTE

1.3. Sequentially Biased MUTE

1.4. AutoShuffle

2. Results

2.1. NIST Chinses-English

2.2. WMT’14 English-German

2.3. WMT’18 Chinese-English

2.4. Ablation Study

2.5. Inference Speed

Reference

4.2. Machine Translation

==== My Other Previous Paper Readings ====

Written by Sik-Ho Tsang

No responses yet