Review — CPVT: Conditional Positional Encodings for Vision Transformers

CPVT, Conditional Position Encodings (CPE) Instead of Absolute Position Encodings

Comparison of CPVT and DeiT models under various configurations
  • Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, Conditional Positional Encodings (CPE) is dynamically generated and conditioned on the local neighborhood of the input tokens.
  • As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during training.
  • CPE can keep the desired translation-invariance, improving accuracy.
  • (For quick read, please read 1–4.)

Outline

  1. Problems in ViT/DeiT
  2. Conditional Positional Encodings (CPE) Using Positional Encoding Generator (PEG)
  3. Conditional Positional Encoding Vision Transformers (CPVT)
  4. Experimental Results
  5. Ablation Study

1. Problems in ViT/DeiT

1.1. Results of DeiT with Different Positional Encodings

Comparison of various positional encoding (PE) strategies tested on ImageNet validation set in terms of the top-1 accuracy
  • Row 1: By removing the positional encodings, DeiT-tiny’s performance on ImageNet dramatically degrades from 72.2% to 68.2% (compared with the original learnable positional encodings).
  • Row 4: Relative positional encoding cannot provide any absolute position information. The model with relative position encodings has inferior performance (70.5% vs. 72.2%).
  • Also, in DeiT, when the image is larger than the default size, interpolation of the position encodings is needed to make them have the same length.

1.2. Requirements of a Successful Positional Encoding

  • Authors argue that a successful positional encoding for vision tasks should meet the following requirements:
  1. Making the input sequence permutation-variant but translation-invariant.
  2. Being inductive and able to handle the sequences longer than the ones during training.
  3. Having the ability to provide the absolute position to a certain degree. This is important to the performance.

2. Conditional Positional Encodings (CPE) Using Positional Encoding Generator (PEG)

Schematic illustration of Positional Encoding Generator (PEG). d is the embedding size, N is the number of tokens. The function F can be depth-wise, separable convolution or other complicated blocks.
  • To condition on the local neighbors, the flattened input sequence X with the size of B×N×C of DeiT is first reshaped back to Xwith the size of B×W×H×C in the 2D image space.
  • Then, a function F is repeatedly applied to the local patch in Xto produce the conditional positional encodings E with the size of B×W×H×C.
  • PEG can be efficiently implemented with a 2-D convolution with kernel k (k≥3) and (k-1)/2 zero paddings. Note that the zero paddings here are important to make the model be aware of the absolute positions, and F can be of various forms such as separable convolutions and many others.

3. Conditional Positional Encoding Vision Transformers (CPVT)

  • There are three sizes CPVT-Ti, CPVT-S and CPVT-B.
(a) ViT with explicit 1D learnable positional encodings (PE) (b) CPVT with conditional positional encoding from the proposed Position Encoding Generator (PEG) plugin, which is the default choice. © CPVT-GAP without class token (cls), but with global average pooling (GAP)
  • Similar to the original positional encodings in DeiT, the conditional positional encodings (CPE) are also added to the input sequence.
  • Both DeiT and ViT utilize an extra learnable class token to perform classification (i.e., cls token shown in Fig. (a) and (b)). The class token is not translation-invariant although it can learn to be translation-invariant.
  • A simple alternative is to directly replace it with a global average pooling (GAP), which is inherently translation-invariant. Therefore, CVPT-GAP is proposed, where the class token is replaced with a global average pooling.

4. Experimental Results

4.1. Generalization to Higher Resolutions

Direct evaluation on higher resolutions without fine-tuning. A simple PEG of single layer of 3×3 depthwise convolution is used here
  • With the 384×384 input images, the DeiT-tiny with learnable positional encodings degrades from 72.2% to 71.2%. When equipped with sine encoding, the tiny model degrades from 72.2% to 70.8%.
  • While DeiT-small obtains 81.5% top-1 accuracy, CPVT-B obtains 82.4% without any interpolations.

4.2. CPVT with Global Average Pooling (GAP)

Performance comparison of Class Token (CLT) and global average pooling (GAP) on ImageNet
  • For example, equipping CPVT-Ti with GAP obtains 74.9% top-1 accuracy on ImageNet validation dataset, which outperforms DeiT-tiny by a large margin (+2.7%).
  • Moreover, it even exceeds DeiT-tiny model with distillation (74.5%). In contrast, DeiT with GAP cannot gain so much improvement (only 0.4%).

4.3. Complexity of PEG

  • DeiT-tiny utilizes learnable position encodings with 192×14×14=37632 parameters. CPVT-Ti introduces only 38952 parameters, which is 960 more parameters, which is neglectable compared to the 5.7M model parameters of DeiT-tiny.

4.4. Comparison with SOTA

Comparison with ConvNets and Transformers on ImageNet
  • Noticeably, the proposed GAP version, marks a new state-of-the-art for Vision Transformers.
  • Using use RegNetY-160 as the teacher, CPVT obtains 75.9% top-1 accuracy, exceeding DeiT-tiny by 1.4%.

4.5. Object Detection

Results on COCO 2017 val set
  • PEG is applied onto DETR. Compared to the original DETR model, only the positional encoding strategy of its encoder part is changed, which originally uses the absolute 2D sine and cosine positional encodings.
  • If the positional encodings are removed, where the mAP of DETR degrades from 33.7% to 32.8%.
  • The same results hold for Deformable DETR [36]. (Hope I can review Deformable DETR in the future.)

5. Ablation Study

5.1. Positions of PEG in CPVT

Performance of different plugin positions using the architecture of DeiT-tiny on ImageNet
  • The input of the first encoder is denoted as index -1.
  • PEG at 0 can have much better performance than positioning it at -1.

5.2. Single PEG vs. Multiple PEGs

CPVT’s sensitivity to number of plugin positions
  • Similarly, CPVT-S achieves a new state of the art (80.5%).

5.3. Comparisons with Other Positional Encodings

Comparison of various encoding strategies. LE: learnable encoding. RPE: relative positional encoding

5.4.Importance of Zero Paddings

ImageNet Performance w/ or w/o zero paddings

5.5. PEG on PVT

PEG on PVT
  • PEG can significantly boost PVT-tiny by 2.1% on ImageNet.

--

--

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store