Review — Scaling Vision with Sparse Mixture of Experts
V-MoE, up to 24 MoE Layers, 32 Experts Per Layer, Almost 15B Parameters
- The idea of Mixture of Experts (MoE) comes from Hinton’s Research group for the vowel recognition in 1991, and for NLP in 2017.
- This paper makes use of MoE in image classification. Particularly, V-MoE replaces a subset of the dense feedforward layers in ViT with sparse MoE layers, where each image patch is “routed” to a subset of “experts” (MLPs). It is scalable and competitive with the largest dense networks.
- In MoE, a mixture of experts layer with E experts:
- where x is input, ei is the function computed by expert i.
- g is is the “routing” function which prescribes the input-conditioned weight for the experts.
- Both ei and g are neural networks.
- If g is sparse, i.e., restricted to assign only k≪E non-zero weights, then unused experts need not be computed. This unlocks super-linear scaling of the number of model parameters with respect to inference and training compute.
- V-MoE is composed of L ViT blocks. In some, the MLP is replaced with a sparsely activated mixture of MLPs. Each MLP (the expert) is stored on a separate device, and processes a fixed number of tokens.
- The MLPs have two layers and a GELU non-linearity:
- For Vision MoE, a subset of these is replaced with MoE layers, where each expert is an MLP, which as above.
- The experts have the same architecture but with different weights:
- There are vanilla routing and Batch Prioritized Routing (BPR).
2.2.1. Vanilla Routing
- The routing function g is:
- where TOPk is an operation that sets all elements of the vector to zero except the elements with the largest k values. In practice, k=1 or k=2.
- ε is the sampled value which is found to improve the performance:
- The noise typically altered routing decisions 15% of the time in earlier layers, and 2–3% in deeper layers.
- In the context of the ViT, x is a representation of an image token at some layer of the network. Therefore, V-MoE routes patch representations, not entire images.
2.2.2. Batch Prioritized Routing (BPR)
- BPR is to compute a priority score s(x) on each token, and sort g(X) accordingly before proceeding with the allocation.
- One is to based on the maximum routing weight:
- Another one is the sum of TOP-k weights:
- Both work well.
2.3. Expert’s Buffer Capacity
- The buffer capacity of each expert is fixed (i.e. the number of tokens that each expert processes), and the model is trained with auxiliary losses that encourage load balancing (in Appendix).
- The buffer capacity of an expert (Be) as a function of the number of images in the batch (N), the number of tokens per image (P), the number of selected experts per token (k), the total number of experts (E), and the capacity ratio (C):
- If the router assigns more than Be tokens to a given expert, only Be of them are processed.
- The capacity ratio C is used to adjust the capacity of the experts.
- With C<1, the router is forced to ignore some assignments, to discard the least useful tokens and save compute during inference.
- V-MoE Every-2: The MoEs on every other layer,
- V-MoE Last-n: Using fewer MoE layers, by placing them on the last-n even blocks.
3. Experimental Results
3.1. Upstream JFT
- The models are pretrained on JFT-300M.
- JFT-300M (Left): The best results are achieved by V-MoE-H/14 Every-2 where 14 is the patch size.
Expert models provide notable gains across all model sizes, for only a mild increase in FLOPs, establishing a new Pareto frontier (gray lines).
3.2. Downstream ImageNet
- ImageNet (Right): A linear layer is trained on top of the fixed representation of pretrained models.
The quality of the representations learned by V-MoE also outperforms ViT models when looking at a new task.
In full fine-tuning, V-MoE also performs better than dense counterparts.
3.3. Downstream VTAB
In low-data regime, on the VTAB benchmark, while performance is similar for V-MoE-H/14, experts provide significant gains at the ViT-L/16 level. They can still be fine-tuned with small amounts of data and no further tricks.
3.4. Scaling Up V-MoE
A 48-block V-MoE model, with every-2 expert placement (32 experts and k = 2), resulting in a model with 14.7B parameters, which we denote by V-MoE-15B.
- It has an impressive accuracy of 82.78% on 5-shot ImageNet and 90.35% when fully fine-tuned.
- By simultaneously reducing the capacity of each expert, we can discard the least useful tokens.
- Intuitively, not every patch is equally important to classify a given image, e.g., most background patches can be dropped to let the model only focus on the ones with the relevant entities, which as shown above.
- Left: BPR allows the model to be competitive with the dense one even at quite low capacities.
- Right: BPR outperforms vanilla routing at low capacity ratio C.
3.6. Model Analysis
- The above figure shows how many images of a given ImageNet class use each expert.
Experts specialize in discriminating between small sets of classes.
- In earlier MoE layers we do not observe this. Experts may instead focus on aspects common to all classes (background, basic shapes, colors)