Review — Scaling Vision with Sparse Mixture of Experts

V-MoE, up to 24 MoE Layers, 32 Experts Per Layer, Almost 15B Parameters

6 min readOct 5, 2022

**V-MoE, Different Experts for Different Contents, Capacity Can be Reduced by Dropping Patches**

Scaling Vision with Sparse Mixture of Experts,
V-MoE, by Google Brain
2021 NeurIPS, Over 70 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Mixture of Experts, MoE, ViT, Transformer

The idea of Mixture of Experts (MoE) comes from Hinton’s Research group for the vowel recognition in 1991, and for NLP in 2017.
This paper makes use of MoE in image classification. Particularly, V-MoE replaces a subset of the dense feedforward layers in ViT with sparse MoE layers, where each image patch is “routed” to a subset of “experts” (MLPs). It is scalable and competitive with the largest dense networks.

Outline

MoE
V-MoE
Experimental Results

1. MoE

In MoE, a mixture of experts layer with E experts:

where x is input, ei is the function computed by expert i.
g is is the “routing” function which prescribes the input-conditioned weight for the experts.
Both ei and g are neural networks.
If g is sparse, i.e., restricted to assign only k≪E non-zero weights, then unused experts need not be computed. This unlocks super-linear scaling of the number of model parameters with respect to inference and training compute.

2. V-MoE

2.1. Architecture

V-MoE is composed of L ViT blocks. In some, the MLP is replaced with a sparsely activated mixture of MLPs. Each MLP (the expert) is stored on a separate device, and processes a fixed number of tokens.
The MLPs have two layers and a GELU non-linearity:

For Vision MoE, a subset of these is replaced with MoE layers, where each expert is an MLP, which as above.
The experts have the same architecture but with different weights:

2.2. Routing

There are vanilla routing and Batch Prioritized Routing (BPR).

2.2.1. Vanilla Routing

The routing function g is:

where TOPk is an operation that sets all elements of the vector to zero except the elements with the largest k values. In practice, k=1 or k=2.
ε is the sampled value which is found to improve the performance:

The noise typically altered routing decisions 15% of the time in earlier layers, and 2–3% in deeper layers.
In the context of the ViT, x is a representation of an image token at some layer of the network. Therefore, V-MoE routes patch representations, not entire images.

2.2.2. Batch Prioritized Routing (BPR)

BPR is to compute a priority score s(x) on each token, and sort g(X) accordingly before proceeding with the allocation.
One is to based on the maximum routing weight:

Another one is the sum of TOP-k weights:

Both work well.

2.3. Expert’s Buffer Capacity

The buffer capacity of each expert is fixed (i.e. the number of tokens that each expert processes), and the model is trained with auxiliary losses that encourage load balancing (in Appendix).
The buffer capacity of an expert (Be) as a function of the number of images in the batch (N), the number of tokens per image (P), the number of selected experts per token (k), the total number of experts (E), and the capacity ratio (C):

If the router assigns more than Be tokens to a given expert, only Be of them are processed.
The capacity ratio C is used to adjust the capacity of the experts.
With C<1, the router is forced to ignore some assignments, to discard the least useful tokens and save compute during inference.

2.4. Variants

V-MoE Every-2: The MoEs on every other layer,
V-MoE Last-n: Using fewer MoE layers, by placing them on the last-n even blocks.

3. Experimental Results

**JFT-300M Precision@1 and ImageNet 5-shot accuracy.**

3.1. Upstream JFT

The models are pretrained on JFT-300M.
JFT-300M (Left): The best results are achieved by V-MoE-H/14 Every-2 where 14 is the patch size.

Expert models provide notable gains across all model sizes, for only a mild increase in FLOPs, establishing a new Pareto frontier (gray lines).

3.2. Downstream ImageNet

ImageNet (Right): A linear layer is trained on top of the fixed representation of pretrained models.

The quality of the representations learned by V-MoE also outperforms ViT models when looking at a new task.

In full fine-tuning, V-MoE also performs better than dense counterparts.

3.3. Downstream VTAB

In low-data regime, on the VTAB benchmark, while performance is similar for V-MoE-H/14, experts provide significant gains at the ViT-L/16 level. They can still be fine-tuned with small amounts of data and no further tricks.

3.4. Scaling Up V-MoE

A 48-block V-MoE model, with every-2 expert placement (32 experts and k = 2), resulting in a model with 14.7B parameters, which we denote by V-MoE-15B.

It has an impressive accuracy of 82.78% on 5-shot ImageNet and 90.35% when fully fine-tuned.

3.5. Routing

**White patches are discarded tokens in the first layer of experts, for different capacities, using Batch Prioritized Routing (BPR) with a V-MoE-H/14.**

By simultaneously reducing the capacity of each expert, we can discard the least useful tokens.
Intuitively, not every patch is equally important to classify a given image, e.g., most background patches can be dropped to let the model only focus on the ones with the relevant entities, which as shown above.

**Left: Reducing compute with priority routing. Right: Priority routing works where vanilla fails.**

Left: BPR allows the model to be competitive with the dense one even at quite low capacities.
Right: BPR outperforms vanilla routing at low capacity ratio C.