Brief Review — EfficientFormerV2: Rethinking Vision Transformers for MobileNet Size and Speed
Improved Components, Search Spaces and Search Algorithm
Rethinking Vision Transformers for MobileNet Size and Speed
EfficientFormerV2, by Snap Inc., Northeastern University, UC Berkeley
2023 ICCV (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++]
==== My Other Paper Readings Are Also Over Here ====
- Can Transformer models run as fast as MobileNet and maintain a similar size?
- The design choices of ViTs are revisited and an improved supernet with low latency and high parameter efficiency is proposed.
- A fine-grained joint search strategy is introduced that can find efficient architectures by optimizing latency and number of parameters simultaneously.
- Finally, EfficientFormerV2 is formed.
Outline
- EfficientFormerV2: Improved Components
- EfficientFormerV2: Improved Search Space and Algorithm
- Results
1. EfficientFormerV2: Improved Components
By improving each component one by one, the performance becomes better. The components are described below.
1.1. Token Mixers vs. Feed Forward Network
- PoolFormer and EfficientFormer employ 3 × 3 average pooling layers (Fig. 2(a)) as local token mixer.
Replacing these layers with depth-wise convolutions (DWCONV) of the same kernel size does not introduce latency overhead, while the performance is improved by 0.6% with negligible extra parameters (0.02M).
- Inspired from recent work [5,21], it is also beneficial to inject local information modeling layers in the Feed Forward Network (FFN).
The explicit residual-connected local token mixer is removed and the dept-wise 3 × 3 CONV is moved into the FFN, to get a unified FFN (Fig. 2(b)) with locality enabled. This boosts the accuracy to 80.3% at the same latency (see Tab. 1).
1.2. Search Space Refinement
The network depth (number of blocks in each stage) and width (number of channels) are varied, and it is found that deeper and narrower network leads to better accuracy (0.2% improvement), less parameters (0.13M reduction), and lower latency (0.1ms acceleration), as in Tab. 1.
- This network is set as a new baseline (accuracy 80.5%) to validate subsequent design modifications, and enable a deeper supernet for architecture search.
- 5-stage network (More stages) is also tried but performance is not good.
1.3. MHSA Improvements
- As shown in Fig. 2(c), two approaches are investigated for MHSA.
First, local information is injected into the Value matrix (V) by adding a depth-wise 3 × 3 CONV, which is also employed by [21, 64].
Second, communications between attention heads are enabled by adding fully connected layers across head dimensions [63] that are shown as Talking Head in Fig. 2(c).
- With these modifications, the performance is further boosted to 80.8% with similar parameters and latency compared to baseline.
1.4. Attention on Higher Resolution
To perform MHSA at the earlier stages of the network, all Query, Key, and Value are downsampled to a fixed spatial resolution (1/32) and the outputs are interpolated from the attention back to the original resolution to feed into the next layer, as shown in Fig. 2((d)&(e)). This method is referred as Stride Attention.
- As in Tab. 1, this simple approximation significantly reduces the latency from 3.5ms to 1.5ms and preserves a competitive accuracy (81.5% vs. 81.7%).
1.5. Attention Downsampling
- A combined strategy is used that wields both locality and global dependency.
To get downsampled Queries, pooling is used as static local downsampling, 3 × 3 DWCONV is used as learnable local downsampling, and the results are combined and projected into Query dimension.
In addition, the attention downsampling module is residual connected to a regular strided CONV to form a local-global manner, similar to the downsampling bottlenecks (ResNet) or inverted bottlenecks (MobileNetV2).
- As shown in Tab. 1, with slightly more parameters and latency overhead, the accuracy is further improved to 81.8% with attention downsampling.
1.6. EfficientFormerV2 Design
- A 4-stage hierarchical design is used which obtains feature sizes in {1/4, 1/8, 1/16, 1/32} of the input resolution.
Stems: EfficientFormerV2 starts with a small kernel convolution stem to embed input image instead of using inefficient embedding of non-overlapping patches:
- where B denotes the batch size, C refers to channel dimension, H and W are the height and width of the feature, Xj is the feature in stage j.
Early Stages: The first two stages capture local information on high resolutions; thus only the unified FFN is employed:
- where Si,j is a learnable layer scale. Ei,j is the expansion scale.
Late Stages: In the last two stages, both local FFN and global MHSA blocks are used. Therefore, global blocks are defined as:
- MHSA uses ab as a learnable attention bias for position encoding:
2. EfficientFormerV2: Improved Search Space and Algorithm
2.1. Jointly Optimizing Model Size and Speed
A Mobile Efficiency Score (MES) is newly defined:
- where i > {size, latency, …} and αi ∈ (0; 1]. Mi, and Ui represent the metric and its unit. Score is a pre-defined base score set as 100 for simplicity.
- The size and speed of MobileNetV2 are used as the unit. Specifically, Usize = 3M, and Ulatency = 1ms on iPhone 12 (iOS 16) deployed with CoreML Tools. To emphasize speed, αlatency = 1.0 and αsize = 0.5.
Decreasing size and latency can lead to a higher MES, and authors aim to search for Pareto optimality on MES-Accuracy.
2.2. Search Space and SuperNet
Search space consists of: (i) the depth of the network, measured by the number of blocks Nj per stage, (ii) the width of the network, i.e., the channel dimension Cj per stage, and (iii) expansion ratio Ei,j of each FFN.
- Elastic depth can be naturally implemented through stochastic drop path augmentation [32].
- As for width and expansion ratio, authors follow Yu et al. [78] to construct switchable layers with shared weights but independent normalization layers.
- where w:c refers to slicing the first c filters of the weight matrix. Other symbols are used for normalization.
- The supernet is pre-trained with Sandwich Rule [78] by training the largest, the smallest, and randomly sampled two subnets at each iteration (denoted as max, min, rand-1, and rand-2 in Algorithm 1).
Iterative searching is performed in the Algorithm 1 as above to search the optimal size and speed.
- (Please feel free to read the paper for more details.)
3. Results
3.1. ImageNet
With the proposed searching, the model can avoid entering a pitfall achieving seemingly good performance on one metric while sacrificing too much for others. Instead, an efficient mobile ViT backbones that are both light and fast are obtained.
3.2. MS COCO & ADE20K
MS COCO: With similar model size, EfficientFormerV2-S2 outperform PoolFormer-S12 by 6.1 APbox and 4.9 APmask. EfficientFormerV2-L outperforms EfficientFormer-L3 by 3.3 APbox and 2.3 APmask.
ADE20K: EfficientFormerV2-S2 outperforms PoolFormer-S12 and EfficientFormer-L1 by 5.2 and 3.5 mIoU, respectively.
3.3. Ablation Studies
The search algorithm obtains models with similar parameters and latency as EfficientFormer yet with higher accuracy, demonstrating the effectiveness of fine-grained search and joint optimization of latency and size.