Brief Review — Adaptive Mixtures of Local Experts

MoE, By Hinton in 1991. Hinton Extended to NLP in 2017

  • By using gating network, different experts are on/off based on the input signal.
  • This is a paper by Prof. Hinton’s research group. They extended this idea to NLP in 2017 MoE.

Outline

  1. Prior Art
  2. Proposed MoE
  3. Results

1. Prior Art

  • A prior art before this work proposes a linear combinations of the local experts:
  • Linear combinations may not be a valid solution to solve a complex problem.

2. Proposed MoE

Adaptive Mixtures of Local Experts
  • Instead of using linear combination of experts, a gating function p is used:
  • A variant of MoE is also proposed below, which obtains better performance:

3. Results

  • A multi-speaker vowel recognition task is used for evaluation.
Data of vowel discrimination problem, and experts and gating network decision lines
Performance of vowel discrimination task

This MoE concept has been further developed as MoE in 2017. Inspired by MoE, Vision MoE (V-MoE) for image classification is appeared in 2021.

Reference

[1991 JNEUCOM] [MoE]
Adaptive Mixtures of Local Experts

4.1. Language Model / Sequence Model

(It is not related to NLP, but I just want to centralize/group them here)

1991 [MoE] … 2020 [ALBERT] [GPT-3] [T5] [Pre-LN Transformer] [MobileBERT] [TinyBERT]

My Other Previous Paper Readings

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store