Review — VAN: Visual Attention Network

VAN, Large Kernel Convolution by Decomposition

Sik-Ho Tsang
4 min readJan 30, 2023
Results of different models on ImageNet-1K validation set.

Visual Attention Network,
VAN, by Tsinghua University, Nankai University University, and Fitten Tech,
2022 arXiv v5, Over 80 Citations (Sik-Ho Tsang @ Medium)
Image Classification

  • Applying self-attention in computer vision is challenging: (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability.
  • In this paper, a novel linear attention named large kernel attention (LKA) is proposed to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings.
  • Visual Attention Network (VAN) is constructed using LKA.

Outline

  1. Large Kernel Attention (LKA)
  2. Visual Attention Network (VAN)
  3. Experimental Results

1. Large Kernel Attention (LKA)

Decomposition diagram of large-kernel convolution.
  • A large kernel convolution operation is decomposed to capture long-range relationship.
  • Specifically, a K×K convolution is decomposed into a
  • depth-wise dilation convolution with dilation d, a (2d-1)(2d-1) depth-wise convolution and a 1×1 convolution.
  • Through the above decomposition, long-range relationship is captured with slight computational cost and parameters.
  • The LKA module can be written as:
The structure of different modules: (a) the proposed Large Kernel Attention (LKA); (b) non-attention module; (c) replace multiplication in LKA with addition ; (d) self-attention. It is worth noting that (d) is designed for 1D sequences
  • The LKA module is as shown above.
Desirable properties belonging to convolution, self-attention and LKA.
  • LKA does not require an additional normalization function like sigmoid and softmax.
  • Furthermore, LKA not only achieves the adaptability in the spatial dimension but also the adaptability in the channel dimension. It worth noting that different channels often represent different objects.

LKA combines the advantages of convolution and self-attention.

2. Visual Attention Network (VAN)

A stage of VAN. d means depth wise convolution. k×k denotes k×k convolution.
  • VAN has a simple hierarchical structure, i.e., a sequence of four stages with decreasing output spatial resolution. With the decreasing of resolution, the number of output channels is increasing.
  • For each stage, the input is firstly downsampled using stride.
  • Then, L groups of batch normalization, 1×1 Conv, GELU activation, large kernel attention and feed-forward network (FFN) are stacked in sequence to extract features.
The detailed setting for different versions of the VAN.
  • Seven architectures: VAN-B0, VAN-B1, VAN-B2, VANB3, VAN-B4, VAN-B5, VAN-B6, are designed.
  • The number of parameters P(K, d) and FLOPs F(K, d) can be denoted as:
  • K=21 is adopted by default. For K=21, the above equation takes the minimum value when d=3, which corresponds to 5×5 depth-wise convolution and 7×7 depth-wise convolution with dilation 3.
Number of parameters for different forms of a 21×21 convolution.

As shown above, the decomposition owns significant advantages in decomposing large kernel convolution in terms of parameters and FLOPs.

3. Experimental Results

3.1. Ablation Study

Ablation study of different modules in LKA.
  • VAN-B0 is used as baseline.

All components in LKA are indispensable to improve performance.

Left: Throughput of Swin Transformer and VAN on RTX 3090. Right: Accuracy-Throughput Diagram.

VAN achieves a better accuracy-throughput trade-off than Swin Transformer.

Ablation study of different kernel size K in LKA.

Decomposing a 21×21 convolution works better than decomposing a 7×7 convolution which demonstrates large kernel is critical for visual tasks.

  • Decomposing a larger 28×28 convolution, the gain is not obvious.

2.1. Image Classification

Left: Compare with the state-of-the-art methods on ImageNet validation set. Right: All models are pretrained on ImageNet-22K dataset.

Left: VAN outperforms common CNNs (ResNet, ResNeXt, ConvNeXt, etc.), ViTs (DeiT, PVT and Swin Transformer, etc.) and MLPs (MLP-Mixer, ResMLP, gMLP, etc.) with similar parameters and computational cost.

  • Right: Pretraining using ImageNet-22K, VAN achieves 87.8% Top-1 accuracy with 200M parameters and surpasses the same level ViT, Swin Transformer, EFFNetV2 (EfficientNetV2) and ConvNeXt on different resolution, which proves the strong capability to adapt large-scale pretraining.
Visualization results using Grad-CAM.

VAN-B2 can clearly focus on the target objects.

2.3. Overview of Visual Tasks

Comparing with similar level PVT, Swin Transformer, and ConvNeXt on various tasks, including image classification, object detection, semantic segmentation, instance segmentation and pose estimation.
  • The above figure clearly reveals the improvement of VAN.
  • (There are tables for individual tasks in the paper, please feel free to read the paper directly if you’re interested.)

Reference

[2022 arXiv v5] [VAN]
Visual Attention Network

1.1. Image Classification

1989–2021 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] 2023 [Vision Permutator (ViP)]

==== My Other Previous Paper Readings ====

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.