Review — EfficientFormer: Vision Transformers at MobileNet Speed
EfficientFormer, Low Latency Vision Transformer
EfficientFormer: Vision Transformers at MobileNet Speed
EfficientFormer, by Snap Inc., and Northeastern University
2022 NeurIPS, Over 80 Citations (Sik-Ho Tsang @ Medium)Image Classification
1989 … 2023 [Vision Permutator (ViP)] [ConvMixer] [CrossFormer++]
==== My Other Paper Readings Are Also Over Here ====
- Can Transformers run as fast as MobileNet while obtaining high performance?
- To answer this, authors first revisit the network architecture and operators used in ViT-based models and identify inefficient designs.
- Then a dimension-consistent pure Transformer (without MobileNet blocks) is introduced as a design paradigm.
- Finally, latency-driven slimming is performed to get a series of final models dubbed EfficientFormer.
- Later, EfficientFormerV2 is also designed.
Outline
- Latency Analysis
- EfficientFormer: Overall Architecture
- EfficientFormer: Latency Driven Slimming
- Results
1. Latency Analysis
- Observation 1: Patch embedding with large kernel and stride is a speed bottleneck on mobile devices.
- Observation 2: Consistent feature dimension is important for the choice of token mixer. MHSA is not necessarily a speed bottleneck.
- Observation 3: CONV-BN is more latency-favorable than LN (GN)-Linear and the accuracy drawback is generally acceptable.
- Observation 4: The latency of nonlinearity is hardware and compiler dependent.
Based on the above observations, EfficientFormer is designed.
2. EfficientFormer: Overall Architecture
2.1. Overall Architecture
- The network consists of a patch embedding (PatchEmbed) and stack of meta Transformer blocks, denoted as MB:
- where X0 is the input image with batch size as B and spatial size as [H, W], Y is the desired output, and m is the total number of blocks (depth).
- MB consists of unspecified token mixer (TokenMixer) followed by a MLP block:
- where Xi|i>0 is the intermediate feature that forwarded into the ith MB.
- Stage (or S) is defined as the stack of several MetaBlocks. The network includes 4 Stages. Among each Stage, there is an embedding operation to project embedding dimension and downsample token length, denoted as Embedding as in Fig. 3 above.
EfficientFormer is a fully Transformer-based model without integrating MobileNet structures.
2.2. Dimension-Consistent Design
- The network starts with 4D partition, while 3D partition is applied in the last stages.
- First, input images are processed by a CONV stem with two 3 × 3 convolutions with stride 2 as patch embedding:
- where Cj is the channel number (width) of the j th stage. Then the network starts with MB4D with a simple Pool mixer to extract low level features:
- where ConvB,G refers to whether the convolution is followed by BN and GeLU.
- After processing all the MB4D blocks, we perform a one-time reshaping to transform the features size and enter 3D partition.
- MB3D follows conventional ViT:
- where LinearG denotes the Linear followed by GeLU, and MHSA is:
- where Q, K, V represents query, key, and values, and b is parameterized attention bias as position encodings.
After defining the overall architecture, efficient architecture is searched in the next step.
3. Latency Driven Slimming
3.1. Supernet
- First, a supernet is defined for searching efficient models.
- MetaPath (MP) is defined, which is the collection of possible blocks:
- where I represents identity path.
In S1 and S2 of the supernet, each block can select from MB4D or I, while in S3 and S4, the block can be MB3D, MB4D, or I.
- Two reasons for only enabling MB3D in the last two Stages. First, since the computation of MHSA grows quadratically with respect to token length, integrating it in early Stages would largely increase the computation cost. Second, early stages in the networks capture low-level features, while late layers learn long-term dependencies.
3.2. Searching Space
- The searching space includes Cj (the width of each Stage), Nj (the number of blocks in each Stage, i.e., depth), and last N blocks to apply MB3D.
3.3. Searching Algorithm
- First, the supernet is trained with Gumbel Softmax sampling [72] to get the importance score for the blocks within each MP:
- where α evaluates the importance of each block in MP as it represents the probability to select a block. ε∼U(0, 1) ensures exploration.
- n ∈ {4D, I} for S1 and S2, and n ∈ {4D, 3D, I} for S3 and S4.
- Then, a latency lookup table is built by collecting the on-device latency of MB4D and MB3D with different widths (multiples of 16).
With the importance score, the action space is defined that includes three options: 1) select I for the least important MP, 2) remove the first MB3D, and 3) reduce the width of the least important Stage (by multiples of 16). Then, the resulting latency of each action is calculated through lookup table, and the accuracy drop of each action is evaluated. Lastly, the action is chosed based on per-latency accuracy drop (-%/ms). This process is performed iteratively until target latency is achieved. (More details are in the paper Appendix.)
4. Results
4.1. ImageNet
[Comparison to CNNs] Compared with the widely used CNN-based models, EfficientFormer achieves a better trade-off between accuracy and latency.
[Comparison to ViTs] Conventional ViTs are still under-performing CNNs in terms of latency. EfficientFormer-L3 achieves 1% higher top-1 accuracy than PoolFormer-S36, while being 3× faster on Nvidia A100 GPU, 2.2× faster on iPhone NPU and 6.8× faster on iPhone CPU.
[Comparison to Hybrid Designs] EfficientFormer-L1 has 4.4% higher top-1 accuracy than MobileViT-XS and runs much faster across different hardware and compilers.
4.2. MS COCO & ADE20K
[MS COCO] EfficientFormers consistently outperform CNN (ResNet) and Transformer (PoolFormer) backbones.
[ADE20K] EfficientFormer consistently outperforms CNN- and Transformer-based backbones by a large margin under a similar computation budget.