Review — PVTv2: Improved Baselines with Pyramid Vision Transformer

Outperforms PVT/PVTv1, Swin Transformer, Twins

PVTv2: Improved Baselines with Pyramid Vision Transformer
, by Nanjing University, The University of Hong Kong, Nanjing University of Science and Technology, IIAI, and SenseTime Research
2022 CVMJ, Over 90 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Vision Transformer, ViT, PVT/PVTv1, Swin Transformer


1. Limitations in PVT/PVTv1

2. PVTv2

2.1. Linear Spatial Reduction Attention (Linear SRA)

Linear SRA in PVTv2

2.2. Overlapping Patch Embedding (OPE)

Overlapping Patch Embedding in PVTv2

2.3. Convolutional Feed-Forward Network

Convolutional Feed-Forward Network in PVTv2

2.4. PVTv2 Variants

PVTv2 Variants

3. Experimental Results

3.1. ImageNet

Image classification performance on the ImageNet validation set

PVTv2 is the state-of-the-art method on ImageNet-1K classification. Compared to PVT, PVTv2 has similar flops and parameters, but the image classification accuracy is greatly improved.

Compared to other recent counterparts, PVTv2 series also has large advantages in terms of accuracy and model size.

3.2. COCO

Object detection and instance segmentation on COCO val2017

PVTv2 significantly outperforms PVTv1 on both one-stage and two-stage object detectors with similar model size.

Compare with Swin Transformer on object detection

PVTv2 obtain much better AP than Swin Transformer among all the detectors, showing its better feature representation ability.

PVTv2-Li can largely reduce the computation from 258 to 194 GFLOPs, while only sacrificing a little performance.

3.3. ADE20K

Semantic segmentation performance of different backbones on the ADE20K validation set

PVTv2 consistently outperforms PVTv1 and other counterparts.

3.4. Ablation Study

Ablation experiments of PVTv2

3.5. Computational Complexity

Models’ GFLOPs under different input scales

PVTv2-Li successfully addresses the high computational overhead problem caused by the attention layer.

3.6. Qualitative Results

Qualitative results of object detection and instance segmentation on COCO val2017, and semantic segmentation on ADE20K



PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store