Review — VAN: Visual Attention Network
Visual Attention Network,
VAN, by Tsinghua University, Nankai University University, and Fitten Tech,
2022 arXiv v5, Over 80 Citations (Sik-Ho Tsang @ Medium)
- Applying self-attention in computer vision is challenging: (1) Treating images as 1D sequences neglects their 2D structures. (2) The quadratic complexity is too expensive for high-resolution images. (3) It only captures spatial adaptability but ignores channel adaptability.
- In this paper, a novel linear attention named large kernel attention (LKA) is proposed to enable self-adaptive and long-range correlations in self-attention while avoiding its shortcomings.
- Visual Attention Network (VAN) is constructed using LKA.
- Large Kernel Attention (LKA)
- Visual Attention Network (VAN)
- Experimental Results
1. Large Kernel Attention (LKA)
- A large kernel convolution operation is decomposed to capture long-range relationship.
- Specifically, a K×K convolution is decomposed into a
- depth-wise dilation convolution with dilation d, a (2d-1)(2d-1) depth-wise convolution and a 1×1 convolution.
- Through the above decomposition, long-range relationship is captured with slight computational cost and parameters.
- The LKA module can be written as:
- The LKA module is as shown above.
- LKA does not require an additional normalization function like sigmoid and softmax.
- Furthermore, LKA not only achieves the adaptability in the spatial dimension but also the adaptability in the channel dimension. It worth noting that different channels often represent different objects.
LKA combines the advantages of convolution and self-attention.
2. Visual Attention Network (VAN)
- VAN has a simple hierarchical structure, i.e., a sequence of four stages with decreasing output spatial resolution. With the decreasing of resolution, the number of output channels is increasing.
- For each stage, the input is firstly downsampled using stride.
- Then, L groups of batch normalization, 1×1 Conv, GELU activation, large kernel attention and feed-forward network (FFN) are stacked in sequence to extract features.
- Seven architectures: VAN-B0, VAN-B1, VAN-B2, VANB3, VAN-B4, VAN-B5, VAN-B6, are designed.
- The number of parameters P(K, d) and FLOPs F(K, d) can be denoted as:
- K=21 is adopted by default. For K=21, the above equation takes the minimum value when d=3, which corresponds to 5×5 depth-wise convolution and 7×7 depth-wise convolution with dilation 3.
As shown above, the decomposition owns significant advantages in decomposing large kernel convolution in terms of parameters and FLOPs.
3. Experimental Results
3.1. Ablation Study
- VAN-B0 is used as baseline.
All components in LKA are indispensable to improve performance.
VAN achieves a better accuracy-throughput trade-off than Swin Transformer.
Decomposing a 21×21 convolution works better than decomposing a 7×7 convolution which demonstrates large kernel is critical for visual tasks.
- Decomposing a larger 28×28 convolution, the gain is not obvious.
2.1. Image Classification
Left: VAN outperforms common CNNs (ResNet, ResNeXt, ConvNeXt, etc.), ViTs (DeiT, PVT and Swin Transformer, etc.) and MLPs (MLP-Mixer, ResMLP, gMLP, etc.) with similar parameters and computational cost.
- Right: Pretraining using ImageNet-22K, VAN achieves 87.8% Top-1 accuracy with 200M parameters and surpasses the same level ViT, Swin Transformer, EFFNetV2 (EfficientNetV2) and ConvNeXt on different resolution, which proves the strong capability to adapt large-scale pretraining.
VAN-B2 can clearly focus on the target objects.
2.3. Overview of Visual Tasks
- The above figure clearly reveals the improvement of VAN.
- (There are tables for individual tasks in the paper, please feel free to read the paper directly if you’re interested.)
[2022 arXiv v5] [VAN]
Visual Attention Network
1.1. Image Classification
1989–2021 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] 2023 [Vision Permutator (ViP)]