Review — hMLP: Three Things Everyone Should Know About Vision Transformers

Fine-Tuning Attention is All You Need. hMLP is Proposed.

Sik-Ho Tsang
5 min readMar 31


Three Things Everyone Should Know About Vision Transformers,
hMLP, by Meta AI, Sorbonne University, and Inria,
2022 ECCV (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] [MetaFormer, PoolFormer] [Swin Transformer V2] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • Three insights are offered for ViT:
  1. The residual layers of ViTs, which are usually processed sequentially, can to some extent be processed efficiently in parallel without noticeably affecting the accuracy.
  2. Fine-tuning the weights of the attention layers is sufficient to adapt ViTs to a higher resolution and to other classification tasks. This saves compute, reduces the peak memory consumption at fine-tuning time, and allows sharing the majority of weights across tasks.
  3. Adding MLP-based patch pre-processing layers improves BERT-like self-supervised training based on patch masking, i.e. hMLP.


  1. First Thing: Parallel ViT
  2. Second Thing: Fine-Tuning Attention is All You Need
  3. Third Thing: Patch preprocessing for BERT-like Self-Supervised Learning (hMLP)

1. First Thing: Parallel ViT

Top: Sequential, Bottom: Parallel
  • Sequential (Top): Considering a sequence of transformer blocks defined by the functions mhsal(·), ffnl(·), mhsal+1(·) and ffnl+1(·). The input xl is sequentially processed in four steps as done in the usual implementation:
  • Parallel (Bottom): Two parallel operations are proposed to replace the above composition:
  • This reduces the number of layers by two for a given number of MHSA and FFN blocks. Conversely, there is twice the amount of processing in parallel.

The intuition behind this parallelization is as follows: as networks get deeper, the contribution of any residual block r(·), be it mhsa(·) or ffn(·), becomes increasingly smaller with respect to the overall function.

  • Therefore, the approximation below becomes increasingly satisfactory:
  • This modification is neutral with respect to parameter and compute.
Comparison with authors’ baseline with previous training procedures.
  • The ViT baselines by authors are already on par or obtain better performance.
Baseline models and their performance on ImageNet1k-val top1 accuracy at resolution 224×224.
  • LayerScale (LS), as in CaiT, significantly improves the performance.
Performance of parallel ViTs
  • Notations: e.g., ViT-B12 has 12 pairs of MHSA and FFN layers. For the proposed parallel models, ViT-B12×2 has twice the number of residual modules as a ViT-B12. ×2 means parallel model.
  • Figure 1: The best performance is obtained with two parallel branches for all tested model capacities.
  • Figure 2: ViT-L12×2 is stronger than its sequential counterpart, which is more difficult to optimize even though LS is used.
  • Figure 3: The parallel version is more helpful for the deeper and higher capacity models that are more difficult to optimize.

Table 2: For models big enough and with proper optimization, sequential and parallel ViTs are roughly equivalent.

  • Table 3: The sequential and parallel models yield substantially higher accuracy than the models with larger working dimensionality. The sequential and parallel models are comparable with 36 blocks. The parallel model is better in the case of 48 blocks.
  • Table 4: A significant speed-up is observed for parallel model in the case of per-sample processing. However, it is NOT true for large batch size condition, this may need specific hardware or kernels.

2. Second Thing: : Fine-Tuning Attention is All You Need

Fine-tuning the weights of the self-attention layer only (middle panel) leads to savings during fine-tuning in peak memory usage and computational cost.

Instead of fine-tuning whole model, authors propose to fine-tune the attention layer only since FFN layer is heavy.

Comparison of full finetuning of all weight (full), finetuning of the MHSA layer weights only (attn) and of the FFN layer only (ffn) when adapting models at resolution 384 × 384 on ImageNet-1k from model pre-trained at 224 × 224.
  • First, the fine-tuning stage requires 10% less memory on the GPU.
  • The training is also 10% faster, as less gradients are computed.
  • The attention weights correspond to approximately one third of the weights. 66% of the storage is saved for each additional model.
Transfer learning experiments: we compare full finetuning, finetuning of attention only and finetuning with ffn only
  • First, for the smallest datasets, namely CARS and Flower, fine-tuning only the MHSA layers is an excellent strategy.
  • With the largest datasets, in particular iNaturalist, it is observed a significant gap between the full fine-tuning and the proposed solution for the ViT-S. This could be expected: in this case there are more images to learn from.
  • This limitation tends to disappear with the larger ViT-L models, for which the the capacity of the MHSA is much larger.

3. Patch preprocessing for BERT-like Self-Supervised Learning (hMLP)

Design of hMLP-stem: we start from subpatches and progressively merge them with linear layers interleaved by GELU non-linearities.
  • Conventionally, a simple patch projection or a convolutional stem is used. Yet, there is no work addressing the problem of their compatibility with self-supervised methods based on patch masking, and in particular on BERT-like auto-encoders such as BEiT.

Hierarchical MLP (hMLP) stem is proposed as above.

  • All patches are processed independently with linear layers interleaved with non-linearities and renormalization.
  • The motivation is to remove any interaction between the different 16×16 patches during the pre-processing stage.

With hMLP, it is equivalent to have the masking of patches before or after the patch-processing stage. It is also equivalent to a convolutional stem in which the size of the convolutional kernel and its stride are matched.

  • hMLP design does not significantly increase the compute requirement. For instance, ViT-B, requires FLOPS is 17.73 GFLOPS with hMLP design. This adds less than 1% of compute compared to using the usual linear projection stem.
Patch pre-processing: Performance in top1 accuracy with for a ViT-B12.

hMLP stem obtains a comparable performance but with lower complexity, and without any interaction between the 16×16 patches.

Performance of patch pre-processing in the supervised and BEiT+FT settings.
  • The interest of hMLP in the context of masked self-supervised learning is clear in the above figure.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.