Review — VAN: Visual Attention Network

VAN, Large Kernel Convolution by Decomposition

Results of different models on ImageNet-1K validation set.
  • Applying self-attention in computer vision is challenging: (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability.
  • In this paper, a novel linear attention named large kernel attention (LKA) is proposed to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings.
  • Visual Attention Network (VAN) is constructed using LKA.


  1. Large Kernel Attention (LKA)
  2. Visual Attention Network (VAN)
  3. Experimental Results

1. Large Kernel Attention (LKA)

Decomposition diagram of large-kernel convolution.
  • A large kernel convolution operation is decomposed to capture long-range relationship.
  • Specifically, a K×K convolution is decomposed into a
  • depth-wise dilation convolution with dilation d, a (2d-1)(2d-1) depth-wise convolution and a 1×1 convolution.
  • Through the above decomposition, long-range relationship is captured with slight computational cost and parameters.
  • The LKA module can be written as:
The structure of different modules: (a) the proposed Large Kernel Attention (LKA); (b) non-attention module; (c) replace multiplication in LKA with addition ; (d) self-attention. It is worth noting that (d) is designed for 1D sequences
  • The LKA module is as shown above.
Desirable properties belonging to convolution, self-attention and LKA.
  • LKA does not require an additional normalization function like sigmoid and softmax.
  • Furthermore, LKA not only achieves the adaptability in the spatial dimension but also the adaptability in the channel dimension. It worth noting that different channels often represent different objects.

2. Visual Attention Network (VAN)

A stage of VAN. d means depth wise convolution. k×k denotes k×k convolution.
  • VAN has a simple hierarchical structure, i.e., a sequence of four stages with decreasing output spatial resolution. With the decreasing of resolution, the number of output channels is increasing.
  • For each stage, the input is firstly downsampled using stride.
  • Then, L groups of batch normalization, 1×1 Conv, GELU activation, large kernel attention and feed-forward network (FFN) are stacked in sequence to extract features.
The detailed setting for different versions of the VAN.
  • Seven architectures: VAN-B0, VAN-B1, VAN-B2, VANB3, VAN-B4, VAN-B5, VAN-B6, are designed.
  • The number of parameters P(K, d) and FLOPs F(K, d) can be denoted as:
  • K=21 is adopted by default. For K=21, the above equation takes the minimum value when d=3, which corresponds to 5×5 depth-wise convolution and 7×7 depth-wise convolution with dilation 3.
Number of parameters for different forms of a 21×21 convolution.

3. Experimental Results

3.1. Ablation Study

Ablation study of different modules in LKA.
  • VAN-B0 is used as baseline.
Left: Throughput of Swin Transformer and VAN on RTX 3090. Right: Accuracy-Throughput Diagram.
Ablation study of different kernel size K in LKA.
  • Decomposing a larger 28×28 convolution, the gain is not obvious.

2.1. Image Classification

Left: Compare with the state-of-the-art methods on ImageNet validation set. Right: All models are pretrained on ImageNet-22K dataset.
  • Right: Pretraining using ImageNet-22K, VAN achieves 87.8% Top-1 accuracy with 200M parameters and surpasses the same level ViT, Swin Transformer, EFFNetV2 (EfficientNetV2) and ConvNeXt on different resolution, which proves the strong capability to adapt large-scale pretraining.
Visualization results using Grad-CAM.

2.3. Overview of Visual Tasks

Comparing with similar level PVT, Swin Transformer, and ConvNeXt on various tasks, including image classification, object detection, semantic segmentation, instance segmentation and pose estimation.
  • The above figure clearly reveals the improvement of VAN.
  • (There are tables for individual tasks in the paper, please feel free to read the paper directly if you’re interested.)


[2022 arXiv v5] [VAN]
Visual Attention Network

1.1. Image Classification

1989–2021 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] 2023 [Vision Permutator (ViP)]

==== My Other Previous Paper Readings ====



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store