Review — CaiT: Going Deeper with Image Transformers

Outperforms ViT, T2T-ViT, DeiT, FixEfficientNet, EfficientNet

Sik-Ho Tsang
4 min readMar 13, 2022

Going Deeper with Image Transformers,
CaiT, by Facebook AI, and Sorbonne University
2021 ICCV, Over 100 Citations (Sik-Ho Tsang @ Medium)
Image Classification, Transformer, Vision Transformer, ViT

  • CaiT (Class-Attention in Image Transformers) is proposed.
  • LayerScale significantly facilitates the convergence and improves the accuracy of image transformers at larger depths.
  • Layers with specific class-attention offers a more effective processing of the class embedding.


  1. Deeper Image Transformers with LayerScale
  2. Specializing Layers for Class Attention
  3. Experimental Results

1. Deeper Image Transformers with LayerScale

From (a) ViT, to (d) ViT Using Proposed LayerScale
  • (a) Vision Transformer (ViT): instantiates a particular form of residual architecture: After casting the input image into a set x0 of vectors, the network alternates self-attention layers (SA) with feed-forward networks (FFN), as:
  • where η is the layer normalization.
  • (b) Fixup [75], ReZero [2] and SkipInit [16]: introduce learnable scalar weighting αl on the output of residual blocks, while removing the pre-normalization and the warmup:
  • The empirical observation in this paper is that removing the warmup and the layer normalization is what makes training unstable in Fixup and T-Fixup.
  • (c) Both Layer Norm and Learnable Scalar Weighting: When initialized at a small value, this choice does help the convergence when increasing the depth.
  • (d) LayerScale: is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar.
  • The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block:
  • where the parameters λl,i and λ’l,i are learnable weights.

LayerScale offers more diversity in the optimization than just adjusting the whole layer by a single learnable scalar.

2. Specializing Layers for Class Attention

CLS Token Places and Interactions
  • (Left) ViT: The class embedding (CLS) is inserted along with the patch embeddings.
  • (Middle): Inserting CLS token later improves the performance.
  • (Right) CaiT: Further proposes to freeze the patch embeddings when inserting CLS to save compute, so that the last part of the network (typically 2 layers) is fully devoted to summarizing the information to be fed to the linear classifier.

3. Experimental Results

3.1. LayerScale

Improving convergence at depth on ImageNet-1k

LayerScale outperforms other weighting variants and baselines.

3.2. Class-Attention Stage

Variations on CLS with DeiT-Small (no LayerScale)
  • Using late CLS insertion obtains better results.

With class-attention stage, further improvement is observed.

3.3. Cait Model Variants

CaiT Model Variants

CaiT model variants are constructed from XXS-24 to M-36.

3.4. SOTA Comparison

SOTA Comparison
  • CaiT can go deeper with better performance.

CaiT obtains higher accuracy compared with others.

Results in transfer learning

CaiT obtains better performance after fine-tuned to downstream tasks.

Ablation path from DeiT-S to our CaiT models

Other than CaiT techniques, techniques from other papers, such as distillation in DeiT, are also used.

Illustration of the regions of focus of a CaiT-XXS model, according to the response of the first class-attention layer (Some of them are shown here only)



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.