Review — LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference
LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference,
LeViT, 2021 ICCV, Over 190 Citations (Sik-Ho Tsang @ Medium)
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====
1.1. Filter Visualizations
- ViT’s patch extractor is a 16x16 convolution with stride 16.
- As shown above, the typical patterns inherent to convolutional architectures: attention heads specialize in specific patterns (low-frequency colors / high frequency graylelvels), and the patterns are similar to Gabor filters.
- In convolutions where the convolutional masks overlap significantly, the spatial smoothness from the overlap: nearby pixels receive approximately the same gradient.
For ViT convolutions there is no overlap. The smoothness mask is likely caused by the data augmentation.
Therefore, in spite of the absence of “inductive bias” in Transformer architectures, the training does produce filters that are similar to traditional convolutional layers.
- The grafting combines a ResNet-50 and a DeiT-Small such that the ResNet acts as a feature extractor for the Transformer layers.
- This grafted architecture produces better results than both DeiT and ResNet-50 alone.
A hypothesis is that the convolutional layers have the ability to learn representations of the low-level information in the earlier layers more efficiently due to their strong inductive biases, noticeably their translation invariance. It appears that in a runtime controlled regime it is beneficial to insert convolutional stages below a Transformer.
- Discounting the role of the classification embedding, ViT is a stack of layers that processes activation maps. Indeed, the intermediate “token” embeddings can be seen as the traditional C×H×W activation maps in FCN architectures.
2.1. Patch Embedding
In LeViT, 4 layers of 3×3 convolutions (stride 2) are used to the input to perform the resolution reduction. The number of channels goes C=3,32,64,128,256.
The patch extractor for LeViT-256 transforms the image shape (3, 224, 224) into (256, 14, 14) with 184 MFLOPs. For comparison, the first 10 layers of a ResNet-18 perform the same dimensionality reduction with 1042 MFLOPs.
2.2. No Classification Token
The average pooling is used on the last activation map instead of using classification token, which produces an embedding used in the classifier.
For distillation during training, separate heads are trained for the classification and distillation tasks. At test time, the outputs from the two heads are averaged.
2.3. Normalization Layers and Activations
- For LeViT, each convolution is followed by a batch normalization. The batch normalization can be merged with the preceding convolution for inference, which is a runtime advantage over layer normalization.
- While DeiT uses the GELU function, all of LeViT’s non-linear activations are Hardswish (MobileNetV3).
2.4. Multi-Resolution Pyramid
Between the LeViT stages, a shrinking attention block reduces the size of the activation map: a subsampling is applied before the Q transformation then propagates to the output of the soft activation.
- This attention block is used without a residual connection.
2.6. Attention Bias Instead of a Positional Embedding
- The positional embedding in Transformer architectures is a location-dependent trainable parameter vector. However positional embeddings are included only on input to the sequence of attention blocks.
- Therefore, the goal is to provide positional information within each attention block, and to explicitly inject relative position information in the attention mechanism.
An attention bias is simply added to the attention maps. The scalar attention value between two pixels (x, y) ∈ [H]×[W] and (x′, y′) ∈ [H]×[W] for one head h∈[N] is calculated as:
The first term is the classical attention. The second is the translation-invariant attention bias.
- Each head has H×W parameters corresponding to different pixel offsets. Symmetrizing the differences x−x′ and y−y′ encourages the model to train with flip invariance.
2.7. Smaller Keys
The bias term reduces the pressure on the keys to encode location information, the size of the key matrices K and Q is reduced relative to the values matrix V.
2.8. Attention Activation
- A Hardswish activation is applied to the product AhV before the regular linear projection is used to combine the output of the different heads.
2.9. Reducing the MLP Blocks
- The MLP residual block in ViT is a linear layer that increases the embedding dimension by a factor 4. The MLP is usually more expensive in terms of runtime and parameters than the attention block.
- For LeViT, the “MLP” is a 1×1 convolution, followed by the usual batch normalization. To reduce the computational cost of that phase, the expansion factor of the convolution is reduced from 4 to 2.
2.10. LeViT Model Variants
- The LeViT models can spawn a range of speed-accuracy tradeoffs by varying the size of the computation stages.
e.g. LeViT-256 has 256 channels on input of the Transformer stage.
3.1. Speed Accuracy Trade-Off
The LeViT architecture largely outperforms both the Transformer and convolutional variants.
- LeViT-384 is on-par with DeiT-Small in accuracy but uses half the number of FLOPs. The gap widens for faster operating points: LeViT-128S is on par with DeiT-Tiny and uses 4× fewer FLOPs.
The runtime measurements follow closely these trends.
- For example, LeViT-192 and LeViT-256 have about the same accuracies as EfficientNet B2 and B3 but are 5× and 7× faster on CPU, respectively.
- On the ARM platform, the float32 operations are not as well optimized compared to Intel. However, the speed-accuracy trade-off remains in LeViT’s favor.
3.2. SOTA Comparisons
All Token-to-token ViT (T2T-ViT) variants take around 5× more FLOPs than LeViT-384 and more parameters for comparable accuracies than LeViT.
Bottleneck transformers (BoT)  and Visual Transformers (VT)  are about 5× slower than LeViT-192 at a comparable accuracy.
- The same holds for the pyramid vision transformer (PVT) (not reported in the table) but its design objectives are different.
3.3. Ablation Study
All changes degrade the accuracy.