Brief Review — Adaptive Mixtures of Local Experts
MoE, By Hinton in 1991. Hinton Extended to NLP in 2017
- By using gating network, different experts are on/off based on the input signal.
- This is a paper by Prof. Hinton’s research group. They extended this idea to NLP in 2017 MoE.
- Prior Art
- Proposed MoE
1. Prior Art
- A prior art before this work proposes a linear combinations of the local experts:
- Linear combinations may not be a valid solution to solve a complex problem.
2. Proposed MoE
- Instead of using linear combination of experts, a gating function p is used:
Depending on the input signal, different experts are on/off by the gating function. Thus, each expert is focus on its particular signal pattern.
- A variant of MoE is also proposed below, which obtains better performance:
- A multi-speaker vowel recognition task is used for evaluation.
Different experts learn to concentrate on one pair of classes or the other.
The number of epochs for proposed MoE methods (4 Experts and 8 Experts) is much smaller to achieve the same accuracy.
This MoE concept has been further developed as MoE in 2017. Inspired by MoE, Vision MoE (V-MoE) for image classification is appeared in 2021.