Review — CaiT: Going Deeper with Image Transformers

Outperforms ViT, T2T-ViT, DeiT, FixEfficientNet, EfficientNet

4 min readMar 13, 2022

Going Deeper with Image Transformers,
CaiT, by Facebook AI, and Sorbonne University
2021 ICCV, Over 100 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Transformer, Vision Transformer, ViT

CaiT (Class-Attention in Image Transformers) is proposed.
LayerScale significantly facilitates the convergence and improves the accuracy of image transformers at larger depths.
Layers with specific class-attention offers a more effective processing of the class embedding.

Outline

Deeper Image Transformers with LayerScale
Specializing Layers for Class Attention
Experimental Results

1. Deeper Image Transformers with LayerScale

**From (a)** **ViT, to (d)** **ViT** **Using Proposed LayerScale**

(a) Vision Transformer (ViT): instantiates a particular form of residual architecture: After casting the input image into a set x0 of vectors, the network alternates self-attention layers (SA) with feed-forward networks (FFN), as:

where η is the layer normalization.
(b) Fixup [75], ReZero [2] and SkipInit [16]: introduce learnable scalar weighting αl on the output of residual blocks, while removing the pre-normalization and the warmup:

The empirical observation in this paper is that removing the warmup and the layer normalization is what makes training unstable in Fixup and T-Fixup.
(c) Both Layer Norm and Learnable Scalar Weighting: When initialized at a small value, this choice does help the convergence when increasing the depth.
(d) LayerScale: is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar.
The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block:

where the parameters λl,i and λ’l,i are learnable weights.

LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar.

2. Specializing Layers for Class Attention

(Left) ViT: The class embedding (CLS) is inserted along with the patch embeddings.
(Middle): Inserting CLS token later improves the performance.
(Right) CaiT: Further proposes to freeze the patch embeddings when inserting CLS to save compute, so that the last part of the network (typically 2 layers) is fully devoted to summarizing the information to be fed to the linear classifier.