Review: Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer (MoE)

Using MoE, Outperforms GNMT and Deep-Att


1. Sparsely-Gated Mixture-of-Experts Layer (MoE)

Sparsely-Gated Mixture-of-Experts Layer (MoE)

The goal to train a trillion-parameter model on a trillion-word corpus.

Model comparison on 1-Billion-Word Language-Modeling Benchmark
Summary of high-capacity MoE-augmented models with varying computational budgets, vs. best previously published results
Language modeling on a 100 billion word corpus
Results on WMT’14 En>Fr newstest2014
Results on WMT’14 En>De newstest2014

The proposed approach achieved BLEU scores of 40.56 and 26.03 on the WMT’14 En>Fr and En>De benchmarks, outperforms GNMT and Deep-Att.

Results on the Google Production En>Fr dataset

On the Google Production dataset, MoE model achieved 1.01 higher test BLEU score even after training for only one sixth of the time.

Multilingual Machine Translation

The MoE model achieves 19% lower perplexity on the dev set than the multilingual GNMT model.



PhD, Researcher. I share what I learn. :) Reads:, LinkedIn:, Twitter:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store