# Review — MUTE: Multi-Unit Transformers for Neural Machine Translation

## MUTE, Transformer With Parallel Paths

--

Multi-Unit Transformers for Neural Machine Translation, by Pattern Recognition Center, WeChat AI, Tencent Inc,

MUTE2020 EMNLP, Over 10 Citations(Sik-Ho Tsang @ Medium)

Natural Language Processing, NLP, Neural Machine Translation, NMT, Transformer

**M**ulti-**U**nit**T**ransform**E**rs**(MUTE)**is proposed, which aims to promote the expressiveness of the Transformer b**y introducing diverse and complementary units.**- Specifically,
**several parallel units**are used to**introduce diversity**.**Biased module**and**sequential dependency**are designed that guide and**encourage complementariness**among different units.

# Outline

**Multi-Unit TransformErs (MUTE)****Results**

**1. M**ulti-**U**nit** T**ransform**E**rs** (MUTE)**

- For original Transformer, it is the one in (a) but only one path.

## 1.1. MUTE

- Given input
*Xk*of*k*-th layer, it is**fed into**with*I*identical units {*F*1, …,*Fi*, …,*FI*}**different learnable parameters**.

- For all
*I*units, they are combined by a**weighted sum**where*i*denotes the*i*-th unit:

## 1.2. Biased MUTE

**Borrow from bagging idea**, for each layer, instead of giving the same inputs*Xk*to all units, each input is**transformed with corresponding type of noises (e.g., swap, reorder, mask)**:

- where
denotes the*Biasi***noise function**for*i*-th unit. The noise operations include:

**Swapping**, randomly swap two input embeddings up to a certain range (i.e., 3).**Disorder**, randomly permutate a subsequence within a certain length (i.e., 3).**Masking**, randomly replace one input embedding with a learnable mask embedding.

- The
**identity mapping (i.e., no noise)**can be seen as a special case of bias module and is included in the model design. - However, it still
**lacks explicit complementarity modeling**, i.e., mitigating these gaps.

## 1.3. Sequentially Biased MUTE

- Given the outputs from biased units
*Fei*(*Xki*), these**outputs are permutated**by a certain ordering function*p*(*i*):

- where
is the*Gei**i*-th**permutated output**. - Similar to ResNet,
**each permutated output**:*Gei*learns the residual of previous accumulated outputs ^*Gei*-1

- Finally, this accumulated sequence is
**normalized**to keep the output**norm stable**and**fuse all accumulated outputs**:

## 1.4. AutoShuffle

- A permutation matrix only contains discrete values which cannot be optimized by gradient descent. Therefore,
**a continuous matrix**is used. Particularly,*M*is regarded as a*M***learnable matrix**and is used to**multiply the outputs of units**:

- To ensure
*M*remains an approximation for the permutation matrix during training,:*M*is normalize after each optimization step

- Then,
**a Lipschitz continuous nonconvex penalty**is introduced, as proposed in Lyu et al. (2019) to**guarantee***M*converge to a permutation matrix:

- The
**penalty is added to cross-entropy loss**as the**final objective**:

# 2. Results

## 2.1. **NIST Chinses-English**

**All of the proposed approaches substantially outperform the baselines**, with improvement ranging from 1.12 to 1.52 BLEU points.

The

basic MUTEmodel hasalready surpassed existing strong NMT systemsand the baselines.

## 2.2. **WMT’14 English-German**

The proposed methods

perform consistently across languagesand are stilluseful in large scale datasets.

## 2.3. **WMT’18 Chinese-English**

The proposed MUTE model

still strongly outperforms the baseline“Transformer (Base)” and “Transformer+Relative (Base)” with +1.1 and +0.7 BLEU points.

## 2.4. Ablation Study

Each noise type and module is useful.

## 2.5. Inference Speed

Increasing the number of units from 1 to6 yieldsconsistent BLEU improvement(from 46.5 to 47.5) with onlymild inference speed decrease(from 890 tokens/sec to 830 tokens/sec).

## Reference

[2020 EMNLP] [MUTE]

Multi-Unit Transformers for Neural Machine Translation

## 4.2. Machine Translation

**2013 … 2020 **[Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] [OpenNMT] [DeFINE] [MUTE] **2021 **[ResMLP] [GPKD] [Roformer] [DeLighT]