Review — CPVT: Conditional Positional Encodings for Vision Transformers

CPVT, Conditional Position Encodings (CPE) Instead of Absolute Position Encodings

6 min readJun 22, 2022

--

**Comparison of CPVT and** **DeiT** **models under various configurations**

Conditional Positional Encodings for Vision Transformers
CPVT, by Meituan Inc., and The University of Adelaide
2021 arXiv v2, Over 100 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Vision Transformer, ViT

Unlike previous fixed or learnable positional encodings, which are predefined and independent of input tokens, Conditional Positional Encodings (CPE) is dynamically generated and conditioned on the local neighborhood of the input tokens.
As a result, CPE can easily generalize to the input sequences that are longer than what the model has ever seen during training.
CPE can keep the desired translation-invariance, improving accuracy.
(For quick read, please read 1–4.)

Outline

Problems in ViT/DeiT
Conditional Positional Encodings (CPE) Using Positional Encoding Generator (PEG)
Conditional Positional Encoding Vision Transformers (CPVT)
Experimental Results
Ablation Study

1. Problems in ViT/DeiT

1.1. Results of DeiT with Different Positional Encodings

**Comparison of various positional encoding (PE) strategies tested on ImageNet validation set in terms of the top-1 accuracy**

Row 1: By removing the positional encodings, DeiT-tiny’s performance on ImageNet dramatically degrades from 72.2% to 68.2% (compared with the original learnable positional encodings).
Row 4: Relative positional encoding cannot provide any absolute position information. The model with relative position encodings has inferior performance (70.5% vs. 72.2%).
Also, in DeiT, when the image is larger than the default size, interpolation of the position encodings is needed to make them have the same length.

1.2. Requirements of a Successful Positional Encoding

Authors argue that a successful positional encoding for vision tasks should meet the following requirements:

Making the input sequence permutation-variant but translation-invariant.
Being inductive and able to handle the sequences longer than the ones during training.
Having the ability to provide the absolute position to a certain degree. This is important to the performance.

2. Conditional Positional Encodings (CPE) Using Positional Encoding Generator (PEG)

**Schematic illustration of Positional Encoding Generator (PEG).** d is the embedding size, N is the number of tokens. The function F can be depth-wise, separable convolution or other complicated blocks.

To condition on the local neighbors, the flattened input sequence X with the size of B×N×C of DeiT is first reshaped back to X’ with the size of B×W×H×C in the 2D image space.
Then, a function F is repeatedly applied to the local patch in X’ to produce the conditional positional encodings E with the size of B×W×H×C.
PEG can be efficiently implemented with a 2-D convolution with kernel k (k≥3) and (k-1)/2 zero paddings. Note that the zero paddings here are important to make the model be aware of the absolute positions, and F can be of various forms such as separable convolutions and many others.

3. Conditional Positional Encoding Vision Transformers (CPVT)

There are three sizes CPVT-Ti, CPVT-S and CPVT-B.

**(a)** **ViT** with explicit 1D learnable positional encodings (PE) **(b) CPVT** with conditional positional encoding from the proposed Position Encoding Generator (PEG) plugin, which is the default choice. **© CPVT-GAP** without class token (cls), but with global average pooling (GAP)

Similar to the original positional encodings in DeiT, the conditional positional encodings (CPE) are also added to the input sequence.
Both DeiT and ViT utilize an extra learnable class token to perform classification (i.e., cls token shown in Fig. (a) and (b)). The class token is not translation-invariant although it can learn to be translation-invariant.
A simple alternative is to directly replace it with a global average pooling (GAP), which is inherently translation-invariant. Therefore, CVPT-GAP is proposed, where the class token is replaced with a global average pooling.

Together with the translation-invariant positional encodings, CVPT-GAP is uttermost translation-invariant and with better performance.

4. Experimental Results

4.1. Generalization to Higher Resolutions

**Direct evaluation on higher resolutions without fine-tuning.** A simple PEG of single layer of 3×3 depthwise convolution is used here

With the 384×384 input images, the DeiT-tiny with learnable positional encodings degrades from 72.2% to 71.2%. When equipped with sine encoding, the tiny model degrades from 72.2% to 70.8%.

In contrast, the proposed PEG can directly generalize to larger image sizes without any fine-tuning. CPVT-Ti’s performance is boosted from 73.4% to 74.2% when applied to 384×384 images. The gap between DeiT-tiny and CPVT-Ti is further enlarged to 3.0%.

While DeiT-small obtains 81.5% top-1 accuracy, CPVT-B obtains 82.4% without any interpolations.

4.2. CPVT with Global Average Pooling (GAP)

**Performance comparison of Class Token (CLT) and global average pooling (GAP) on ImageNet**

Using GAP can boost CPVT as least 1%.

For example, equipping CPVT-Ti with GAP obtains 74.9% top-1 accuracy on ImageNet validation dataset, which outperforms DeiT-tiny by a large margin (+2.7%).
Moreover, it even exceeds DeiT-tiny model with distillation (74.5%). In contrast, DeiT with GAP cannot gain so much improvement (only 0.4%).

4.3. Complexity of PEG

DeiT-tiny utilizes learnable position encodings with 192×14×14=37632 parameters. CPVT-Ti introduces only 38952 parameters, which is 960 more parameters, which is neglectable compared to the 5.7M model parameters of DeiT-tiny.

4.4. Comparison with SOTA

**Comparison with ConvNets and** **Transformers on ImageNet**

Compared with DeiT, CPVT models have much better top-1 accuracy with similar throughputs.

Noticeably, the proposed GAP version, marks a new state-of-the-art for Vision Transformers.
Using use RegNetY-160 as the teacher, CPVT obtains 75.9% top-1 accuracy, exceeding DeiT-tiny by 1.4%.

4.5. Object Detection

PEG is applied onto DETR. Compared to the original DETR model, only the positional encoding strategy of its encoder part is changed, which originally uses the absolute 2D sine and cosine positional encodings.
If the positional encodings are removed, where the mAP of DETR degrades from 33.7% to 32.8%.

PEG improves the performance to 33.9%, which is even better than DETR with the original positional encodings.

The same results hold for Deformable DETR [36]. (Hope I can review Deformable DETR in the future.)

5. Ablation Study

5.1. Positions of PEG in CPVT

**Performance of different plugin positions using the architecture of** **DeiT-tiny on ImageNet**

The input of the first encoder is denoted as index -1.
PEG at 0 can have much better performance than positioning it at -1.

5.2. Single PEG vs. Multiple PEGs

**CPVT’s sensitivity to number of plugin positions**

By inserting PEGs to five positions (0–5), the top-1 accuracy of the tiny model is further improved to 73.4%, which surpasses DeiT-tiny by 1.2%.

Similarly, CPVT-S achieves a new state of the art (80.5%).

5.3. Comparisons with Other Positional Encodings

**Comparison of various encoding strategies. LE: learnable encoding. RPE: relative positional encoding**

If a single layer PEG is added to the first five blocks, 73.4% top-1 accuracy is obtained, which indicates that it is better to add the positional encodings to the more levels of the encoders.