Review — CaiT: Going Deeper with Image Transformers

  • CaiT (Class-Attention in Image Transformers) is proposed.
  • LayerScale significantly facilitates the convergence and improves the accuracy of image transformers at larger depths.
  • Layers with specific class-attention offers a more effective processing of the class embedding.


  1. Deeper Image Transformers with LayerScale
  2. Specializing Layers for Class Attention
  3. Experimental Results

1. Deeper Image Transformers with LayerScale

From (a) ViT, to (d) ViT Using Proposed LayerScale
  • (a) Vision Transformer (ViT): instantiates a particular form of residual architecture: After casting the input image into a set x0 of vectors, the network alternates self-attention layers (SA) with feed-forward networks (FFN), as:
  • where η is the layer normalization.
  • (b) Fixup [75], ReZero [2] and SkipInit [16]: introduce learnable scalar weighting αl on the output of residual blocks, while removing the pre-normalization and the warmup:
  • The empirical observation in this paper is that removing the warmup and the layer normalization is what makes training unstable in Fixup and T-Fixup.
  • (c) Both Layer Norm and Learnable Scalar Weighting: When initialized at a small value, this choice does help the convergence when increasing the depth.
  • (d) LayerScale: is a per-channel multiplication of the vector produced by each residual block, as opposed to a single scalar.
  • The objective is to group the updates of the weights associated with the same output channel. Formally, LayerScale is a multiplication by a diagonal matrix on output of each residual block:
  • where the parameters λl,i and λ’l,i are learnable weights.

2. Specializing Layers for Class Attention

CLS Token Places and Interactions
  • (Left) ViT: The class embedding (CLS) is inserted along with the patch embeddings.
  • (Middle): Inserting CLS token later improves the performance.
  • (Right) CaiT: Further proposes to freeze the patch embeddings when inserting CLS to save compute, so that the last part of the network (typically 2 layers) is fully devoted to summarizing the information to be fed to the linear classifier.

3. Experimental Results

3.1. LayerScale

Improving convergence at depth on ImageNet-1k

3.2. Class-Attention Stage

Variations on CLS with DeiT-Small (no LayerScale)
  • Using late CLS insertion obtains better results.

3.3. Cait Model Variants

CaiT Model Variants

3.4. SOTA Comparison

SOTA Comparison
  • CaiT can go deeper with better performance.
Results in transfer learning
Ablation path from DeiT-S to our CaiT models
Illustration of the regions of focus of a CaiT-XXS model, according to the response of the first class-attention layer (Some of them are shown here only)



PhD, Researcher. I share what I learn. :) Reads:, LinkedIn:, Twitter:

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store