Review — Patches Are All You Need?

ConvMixer, Using Depthwise Convolution Instead of Self-Attention or MLP

Sik-Ho Tsang
5 min readMay 12, 2023

Patches Are All You Need?
ConvMixer, by Carnegie Mellon University, and Bosch Center for AI
2023 TMLR, Over 140 Citations (Sik-Ho Tsang @ Medium)

Image Classification: 1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [MetaFormer, PoolFormer] [Swin Transformer V2] [hMLP] [DeiT III] [GhostNetV2] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • A question is raised in this paper:

Is the performance of ViTs due to the inherently-more-powerful Transformer architecture, or is it at least partly due to using patches as the input representation?

  • Authors present some evidence for the latter: Specifically, an extremely simple model, ConvMixer, is proposed, that is similar in spirit to the ViT and the even-more-basic MLP-Mixer in that it operates directly on patches as input, separates the mixing of spatial and channel dimensions, and maintains equal size and resolution throughout the network.
  • In contrast, however, the ConvMixer uses only standard convolutions to achieve the mixing steps.

Outline

  1. ConvMixer
  2. Results
  3. Results (Appendix)

1. ConvMixer

ConvMixer

ConvMixer, consists of a patch embedding layer followed by repeated applications of a simple fully-convolutional block.

1.1. Patch Embedding

  • The spatial structure of the patch embeddings is maintained, as illustrated in the above figure.

Patch embeddings with patch size p and embedding dimension h can be implemented as convolution with cin input channels, h output channels, kernel size p, and stride p:

1.2. ConvMixer Block

The ConvMixer block itself consists of depthwise convolution (i.e., grouped convolution with groups equal to the number of channels, h) followed by pointwise (i.e., kernel size 1×1) convolution.

  • ConvMixers work best with unusually large kernel sizes for the depthwise convolution. Each of the convolutions is followed by an activation and post-activation BatchNorm:

1.3. Output

  • After many applications of the ConvMixer block, we perform global pooling to get a feature vector of size h, which passed to a softmax classifier.

1.4. Model Variants

  • An instantiation of ConvMixer depends on four parameters:
  1. The “width” or hidden dimension h (i.e., the dimension of the patch embeddings);
  2. The depth d, or the number of repetitions of the ConvMixer layer;
  3. The patch size p which controls the internal resolution of the model;
  4. The kernel size k of the depthwise convolutional layer.

ConvMixer-h/d: Different ConvMixers are defined using different hidden dimensions h and depths d.

2. Results

2.1. ImageNet

ImageNet

A ConvMixer-1536/20 with 52M parameters can achieve 81.4% top-1 accuracy on ImageNet, and a ConvMixer-768/32 with 21M parameters 80.2%.

  • Wider ConvMixers seem to converge in fewer epochs, but are memory- and compute-hungry.
  • They also work best with large kernel sizes: ConvMixer-1536/20 lost ≈1% accuracy when reducing the kernel size from k=9 to k=3.
  • ConvMixers with smaller patches are substantially better. Authors believe larger patches may require deeper ConvMixers.
  • With everything held equal except increasing the patch size from 7 to 14, ConvMixer-1536/20 achieves 78.9% top-1 accuracy but is around 4× faster.
  • ReLU is used as GELU isn’t necessary.
ImageNet

ConvMixers achieve competitive accuracies for a given parameter budget:

ConvMixer-1536/20 outperforms both ResNet-152 and ResMLP-B24 despite having substantially fewer parameters and is competitive with DeiT-B.

ConvMixer-768/32 uses just a third of the parameters of ResNet-152, but is similarly accurate.

2.2. CIFAR-10

On smaller-scale CIFAR-10, ConvMixers achieve over 96% accuracy with as few as 0.7M parameters, demonstrating the data efficiency of the convolutional inductive bias.

2.3. Comments on Paper Length

Comments on Paper Length in TMLR
  • Authors also comments the paper length.
Comments on Paper Length in an Older Version

In an older version, authors even have a more direct comments. Please feel free to read as above.

3. Results (Appendix)

3.1. Detailed Comparisons on ImageNet

ImageNet
  • Detailed Comparisons are shown in the paper appendix.

Although ConvMixer has lower parameter counts, its inference time is not fast, or throughput is not as high as expected. This is because of the depthwise convolution. (This fact is also well-known in some literatures.)

3.2. Ablation Studies on ImageNet

Ablation Studies

Table 3: Larger patch sizes result in lower accuracy, while smaller patches increase accuracy.

Table 4: Patches are a good choice of input representation, and may even improve the performance of existing models.

Table 5: 9×9 kernels strongly outperform 3×3 kernels.

Table 6: The choice of activation function (ReLU vs. GELU) and norm layer (BatchNorm vs. LayerNorm) does not have a large impact on performance.

Figure 4: There is rapid growth of inference time for kernel sizes 7 and 9 compared to 3 and 5. Yet, ConvMixers can handle variable input sizes with no modifications.

(There are also detailed results on CIFAR-10, and kernel visualizations. Please feel free to read the paper directly.)

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.