Review — MUTE: Multi-Unit Transformers for Neural Machine Translation
MUTE, Transformer With Parallel Paths
Multi-Unit Transformers for Neural Machine Translation
MUTE, by Pattern Recognition Center, WeChat AI, Tencent Inc,
2020 EMNLP, Over 10 Citations (Sik-Ho Tsang @ Medium)
Natural Language Processing, NLP, Neural Machine Translation, NMT, Transformer
- Multi-Unit TransformErs (MUTE) is proposed, which aims to promote the expressiveness of the Transformer by introducing diverse and complementary units.
- Specifically, several parallel units are used to introduce diversity. Biased module and sequential dependency are designed that guide and encourage complementariness among different units.
Outline
- Multi-Unit TransformErs (MUTE)
- Results
1. Multi-Unit TransformErs (MUTE)
- For original Transformer, it is the one in (a) but only one path.
1.1. MUTE
- Given input Xk of k-th layer, it is fed into I identical units {F1, …, Fi, …, FI} with different learnable parameters.
- For all I units, they are combined by a weighted sum where i denotes the i-th unit:
1.2. Biased MUTE
- Borrow from bagging idea, for each layer, instead of giving the same inputs Xk to all units, each input is transformed with corresponding type of noises (e.g., swap, reorder, mask):
- where Biasi denotes the noise function for i-th unit. The noise operations include:
- Swapping, randomly swap two input embeddings up to a certain range (i.e., 3).
- Disorder, randomly permutate a subsequence within a certain length (i.e., 3).
- Masking, randomly replace one input embedding with a learnable mask embedding.
- The identity mapping (i.e., no noise) can be seen as a special case of bias module and is included in the model design.
- However, it still lacks explicit complementarity modeling, i.e., mitigating these gaps.
1.3. Sequentially Biased MUTE
- Given the outputs from biased units Fei(Xki), these outputs are permutated by a certain ordering function p(i):
- where Gei is the i-th permutated output.
- Similar to ResNet, each permutated output Gei learns the residual of previous accumulated outputs ^Gei-1:
- Finally, this accumulated sequence is normalized to keep the output norm stable and fuse all accumulated outputs:
1.4. AutoShuffle
- A permutation matrix only contains discrete values which cannot be optimized by gradient descent. Therefore, a continuous matrix M is used. Particularly, M is regarded as a learnable matrix and is used to multiply the outputs of units:
- To ensure M remains an approximation for the permutation matrix during training, M is normalize after each optimization step:
- Then, a Lipschitz continuous nonconvex penalty is introduced, as proposed in Lyu et al. (2019) to guarantee M converge to a permutation matrix:
- The penalty is added to cross-entropy loss as the final objective:
2. Results
2.1. NIST Chinses-English
- All of the proposed approaches substantially outperform the baselines, with improvement ranging from 1.12 to 1.52 BLEU points.
The basic MUTE model has already surpassed existing strong NMT systems and the baselines.
2.2. WMT’14 English-German
The proposed methods perform consistently across languages and are still useful in large scale datasets.
2.3. WMT’18 Chinese-English
The proposed MUTE model still strongly outperforms the baseline “Transformer (Base)” and “Transformer+Relative (Base)” with +1.1 and +0.7 BLEU points.
2.4. Ablation Study
Each noise type and module is useful.
2.5. Inference Speed
Increasing the number of units from 1 to 6 yields consistent BLEU improvement (from 46.5 to 47.5) with only mild inference speed decrease (from 890 tokens/sec to 830 tokens/sec).
Reference
[2020 EMNLP] [MUTE]
Multi-Unit Transformers for Neural Machine Translation
4.2. Machine Translation
2013 … 2020 [Batch Augment, BA] [GPT-3] [T5] [Pre-LN Transformer] [OpenNMT] [DeFINE] [MUTE] 2021 [ResMLP] [GPKD] [Roformer] [DeLighT]