Review — CPVT: Conditional Positional Encodings for Vision Transformers
CPVT, Conditional Position Encodings (CPE) Instead of Absolute Position Encodings
Conditional Positional Encodings for Vision Transformers
CPVT, by Meituan Inc., and The University of Adelaide
2021 arXiv v2, Over 100 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Vision Transformer, ViT
- Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, Conditional Positional Encodings (CPE) is dynamically generated and conditioned on the local neighborhood of the input tokens.
- As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during training.
- CPE can keep the desired translation-invariance, improving accuracy.
- (For quick read, please read 1–4.)
1. Problems in ViT/DeiT
1.1. Results of DeiT with Different Positional Encodings
- Row 1: By removing the positional encodings, DeiT-tiny’s performance on ImageNet dramatically degrades from 72.2% to 68.2% (compared with the original learnable positional encodings).
- Row 4: Relative positional encoding cannot provide any absolute position information. The model with relative position encodings has inferior performance (70.5% vs. 72.2%).
- Also, in DeiT, when the image is larger than the default size, interpolation of the position encodings is needed to make them have the same length.
1.2. Requirements of a Successful Positional Encoding
- Authors argue that a successful positional encoding for vision tasks should meet the following requirements:
- Making the input sequence permutation-variant but translation-invariant.
- Being inductive and able to handle the sequences longer than the ones during training.
- Having the ability to provide the absolute position to a certain degree. This is important to the performance.
2. Conditional Positional Encodings (CPE) Using Positional Encoding Generator (PEG)
- To condition on the local neighbors, the flattened input sequence X with the size of B×N×C of DeiT is first reshaped back to X’ with the size of B×W×H×C in the 2D image space.
- Then, a function F is repeatedly applied to the local patch in X’ to produce the conditional positional encodings E with the size of B×W×H×C.
- PEG can be efficiently implemented with a 2-D convolution with kernel k (k≥3) and (k-1)/2 zero paddings. Note that the zero paddings here are important to make the model be aware of the absolute positions, and F can be of various forms such as separable convolutions and many others.
3. Conditional Positional Encoding Vision Transformers (CPVT)
- There are three sizes CPVT-Ti, CPVT-S and CPVT-B.
- Similar to the original positional encodings in DeiT, the conditional positional encodings (CPE) are also added to the input sequence.
- Both DeiT and ViT utilize an extra learnable class token to perform classification (i.e., cls token shown in Fig. (a) and (b)). The class token is not translation-invariant although it can learn to be translation-invariant.
- A simple alternative is to directly replace it with a global average pooling (GAP), which is inherently translation-invariant. Therefore, CVPT-GAP is proposed, where the class token is replaced with a global average pooling.
Together with the translation-invariant positional encodings, CVPT-GAP is uttermost translation-invariant and with better performance.
4. Experimental Results
4.1. Generalization to Higher Resolutions
- With the 384×384 input images, the DeiT-tiny with learnable positional encodings degrades from 72.2% to 71.2%. When equipped with sine encoding, the tiny model degrades from 72.2% to 70.8%.
In contrast, the proposed PEG can directly generalize to larger image sizes without any fine-tuning. CPVT-Ti’s performance is boosted from 73.4% to 74.2% when applied to 384×384 images. The gap between DeiT-tiny and CPVT-Ti is further enlarged to 3.0%.
- While DeiT-small obtains 81.5% top-1 accuracy, CPVT-B obtains 82.4% without any interpolations.
4.2. CPVT with Global Average Pooling (GAP)
Using GAP can boost CPVT as least 1%.
- For example, equipping CPVT-Ti with GAP obtains 74.9% top-1 accuracy on ImageNet validation dataset, which outperforms DeiT-tiny by a large margin (+2.7%).
- Moreover, it even exceeds DeiT-tiny model with distillation (74.5%). In contrast, DeiT with GAP cannot gain so much improvement (only 0.4%).
4.3. Complexity of PEG
- DeiT-tiny utilizes learnable position encodings with 192×14×14=37632 parameters. CPVT-Ti introduces only 38952 parameters, which is 960 more parameters, which is neglectable compared to the 5.7M model parameters of DeiT-tiny.
4.4. Comparison with SOTA
Compared with DeiT, CPVT models have much better top-1 accuracy with similar throughputs.
- Noticeably, the proposed GAP version, marks a new state-of-the-art for Vision Transformers.
- Using use RegNetY-160 as the teacher, CPVT obtains 75.9% top-1 accuracy, exceeding DeiT-tiny by 1.4%.
4.5. Object Detection
- PEG is applied onto DETR. Compared to the original DETR model, only the positional encoding strategy of its encoder part is changed, which originally uses the absolute 2D sine and cosine positional encodings.
- If the positional encodings are removed, where the mAP of DETR degrades from 33.7% to 32.8%.
PEG improves the performance to 33.9%, which is even better than DETR with the original positional encodings.
- The same results hold for Deformable DETR [36]. (Hope I can review Deformable DETR in the future.)
5. Ablation Study
5.1. Positions of PEG in CPVT
- The input of the first encoder is denoted as index -1.
- PEG at 0 can have much better performance than positioning it at -1.
5.2. Single PEG vs. Multiple PEGs
By inserting PEGs to five positions (0–5), the top-1 accuracy of the tiny model is further improved to 73.4%, which surpasses DeiT-tiny by 1.2%.
- Similarly, CPVT-S achieves a new state of the art (80.5%).
5.3. Comparisons with Other Positional Encodings
If a single layer PEG is added to the first five blocks, 73.4% top-1 accuracy is obtained, which indicates that it is better to add the positional encodings to the more levels of the encoders.
5.4.Importance of Zero Paddings
Zero paddings can provide some absolution position information to the model.
5.5. PEG on PVT
- PEG can significantly boost PVT-tiny by 2.1% on ImageNet.
Later on, authors published another model, Twins, in 2021 NeurIPS, in which Twins also uses CPVT.
Reference
[2021 arXiv v2] [CPVT]
Conditional Positional Encodings for Vision Transformers
Image Classification
2021 [Learned Resizer] [Vision Transformer, ViT] [ResNet Strikes Back] [DeiT] [EfficientNetV2] [MLP-Mixer] [T2T-ViT] [Swin Transformer] [CaiT] [ResMLP] [ResNet-RS] [NFNet] [PVT, PVTv1] [CvT] [HaloNet] [TNT] [CoAtNet] [Focal Transformer] [TResNet] [CPVT] 2022 [ConvNeXt]