Review — LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference

LeViT, Fast Inference Model, Faster Than EfficientNet & DeiT

Sik-Ho Tsang
6 min readApr 4, 2023
Speed-accuracy operating points. Left plots: on 1 CPU core, Right: on 1 GPU.

LeViT: a Vision Transformer in ConvNet’s Clothing for Faster Inference,
LeViT
, 2021 ICCV, Over 190 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • Convolutional neural networks applying to Transformers is revisited.
  • The attention bias is introduced, positional information is integrated by a new way in ViTs. LeViT, a hybrid neural network, is proposed for fast inference image classification.

Outline

  1. Motivations
  2. LeViT
  3. Results

1. Motivations

1.1. Filter Visualizations

Patch-based convolutional masks in the pre-trained DeiT-base model.
  • ViT’s patch extractor is a 16x16 convolution with stride 16.
  • As shown above, the typical patterns inherent to convolutional architectures: attention heads specialize in specific patterns (low-frequency colors / high frequency graylelvels), and the patterns are similar to Gabor filters.
  • In convolutions where the convolutional masks overlap significantly, the spatial smoothness from the overlap: nearby pixels receive approximately the same gradient.

For ViT convolutions there is no overlap. The smoothness mask is likely caused by the data augmentation.

Therefore, in spite of the absence of “inductive bias” in Transformer architectures, the training does produce filters that are similar to traditional convolutional layers.

1.2. Grafting

DeiT architecture grafted on top of a truncated ResNet-50 convolutional architecture.
Models with convolutional layers show a faster convergence in the early stages compared to their DeiT counterpart.
  • The grafting combines a ResNet-50 and a DeiT-Small such that the ResNet acts as a feature extractor for the Transformer layers.
  • This grafted architecture produces better results than both DeiT and ResNet-50 alone.

A hypothesis is that the convolutional layers have the ability to learn representations of the low-level information in the earlier layers more efficiently due to their strong inductive biases, noticeably their translation invariance. It appears that in a runtime controlled regime it is beneficial to insert convolutional stages below a Transformer.

2. LeViT

Block diagram of the LeViT-256 architecture. (Direction from left to right)
  • Discounting the role of the classification embedding, ViT is a stack of layers that processes activation maps. Indeed, the intermediate “token” embeddings can be seen as the traditional C×H×W activation maps in FCN architectures.

2.1. Patch Embedding

In LeViT, 4 layers of 3×3 convolutions (stride 2) are used to the input to perform the resolution reduction. The number of channels goes C=3,32,64,128,256.

The patch extractor for LeViT-256 transforms the image shape (3, 224, 224) into (256, 14, 14) with 184 MFLOPs. For comparison, the first 10 layers of a ResNet-18 perform the same dimensionality reduction with 1042 MFLOPs.

2.2. No Classification Token

The average pooling is used on the last activation map instead of using classification token, which produces an embedding used in the classifier.

For distillation during training, separate heads are trained for the classification and distillation tasks. At test time, the outputs from the two heads are averaged.

2.3. Normalization Layers and Activations

  • For LeViT, each convolution is followed by a batch normalization. The batch normalization can be merged with the preceding convolution for inference, which is a runtime advantage over layer normalization.
  • While DeiT uses the GELU function, all of LeViT’s non-linear activations are Hardswish (MobileNetV3).

2.4. Multi-Resolution Pyramid

LeViT integrates the ResNet stages within the Transformer architecture.

2.5. Downsampling

Between the LeViT stages, a shrinking attention block reduces the size of the activation map: a subsampling is applied before the Q transformation then propagates to the output of the soft activation.

  • This attention block is used without a residual connection.

2.6. Attention Bias Instead of a Positional Embedding

The LeViT attention blocks, using similar notations to Non-Local Neural Network [39]. Left: regular version, Right: with 1/2 reduction of the activation map.
  • The positional embedding in Transformer architectures is a location-dependent trainable parameter vector. However positional embeddings are included only on input to the sequence of attention blocks.
  • Therefore, the goal is to provide positional information within each attention block, and to explicitly inject relative position information in the attention mechanism.

An attention bias is simply added to the attention maps. The scalar attention value between two pixels (x, y) ∈ [H]×[W] and (x′, y′) ∈ [H]×[W] for one head h∈[N] is calculated as:

The first term is the classical attention. The second is the translation-invariant attention bias.

  • Each head has H×W parameters corresponding to different pixel offsets. Symmetrizing the differences xx′ and yy′ encourages the model to train with flip invariance.

2.7. Smaller Keys

The bias term reduces the pressure on the keys to encode location information, the size of the key matrices K and Q is reduced relative to the values matrix V.

2.8. Attention Activation

  • A Hardswish activation is applied to the product AhV before the regular linear projection is used to combine the output of the different heads.

2.9. Reducing the MLP Blocks

  • The MLP residual block in ViT is a linear layer that increases the embedding dimension by a factor 4. The MLP is usually more expensive in terms of runtime and parameters than the attention block.
  • For LeViT, the “MLP” is a 1×1 convolution, followed by the usual batch normalization. To reduce the computational cost of that phase, the expansion factor of the convolution is reduced from 4 to 2.

2.10. LeViT Model Variants

LeViT Model Family.
  • The LeViT models can spawn a range of speed-accuracy tradeoffs by varying the size of the computation stages.

e.g. LeViT-256 has 256 channels on input of the Transformer stage.

3. Results

3.1. Speed Accuracy Trade-Off

Characteristics of LeViT w.r.t. two strong families of competitors: DeiT and EfficientNet.

The LeViT architecture largely outperforms both the Transformer and convolutional variants.

  • LeViT-384 is on-par with DeiT-Small in accuracy but uses half the number of FLOPs. The gap widens for faster operating points: LeViT-128S is on par with DeiT-Tiny and uses 4× fewer FLOPs.

The runtime measurements follow closely these trends.

  • For example, LeViT-192 and LeViT-256 have about the same accuracies as EfficientNet B2 and B3 but are 5× and 7× faster on CPU, respectively.
  • On the ARM platform, the float32 operations are not as well optimized compared to Intel. However, the speed-accuracy trade-off remains in LeViT’s favor.

3.2. SOTA Comparisons

Comparison with the recent state of the art in the high-throughput regime.

All Token-to-token ViT (T2T-ViT) variants take around 5× more FLOPs than LeViT-384 and more parameters for comparable accuracies than LeViT.

Bottleneck transformers (BoT) [46] and Visual Transformers (VT) [47] are about 5× slower than LeViT-192 at a comparable accuracy.

  • The same holds for the pyramid vision transformer (PVT) (not reported in the table) but its design objectives are different.

3.3. Ablation Study

Ablation of various components w.r.t. the baseline LeViT-128S.

All changes degrade the accuracy.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet