Review — Are Convolutional Neural Networks or Transformers More Like Human Vision?

Shape Bias Analysis on ViT and ResNet

Sik-Ho Tsang
5 min readApr 17, 2023
Bird’s eye view of (a) convolutional and (b) attention-based networks.

Are Convolutional Neural Networks or Transformers More Like Human Vision?,
Tuli CogSci21, by Princeton University, DeepMind, and UC Berkeley,
2021 CogSci, Over 80 Citations (Sik-Ho Tsang @ Medium)

Image Classification
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====

  • In this paper, authors investigate whether ViT give more human-like representations than CNNs by measuring the error consistency.
  • Also, shape bias is analyzed on ViT and CNN presented in the paper.

Outline

  1. Error Consistency
  2. Shape Bias Results
  3. Shape Bias Results After Fine-Tuning

1. Error Consistency

To investigate whether Transformers give more human-like representations than CNNs, in ML, AL, cognitive science and behavioral neuroscience, it is commonly to see whether two decision makers (be they humans or AI models) use the same strategy to solve a given task.

1.1. Error Overlap

  • Observed error overlap is given by:
  • where ei,j is how often the two systems “agree”, i.e., how often they both classify correctly or both classify incorrectly.

1.2. Cohen’s k

  • Let say a system i can have a pi, which is the probability of getting correct or incorrect decision.
  • The expected overlap is calculated by comparing independent binomial observers i and j with their accuracies as the respective probabilities:
  • The expected overlap can be used to normalize the observed error overlap, giving a measure of error consistency known as Cohen’s k:
  • Yet it does not take into account what the system misclassifies an image as when making an error.

1.3. Confusion Matrix Comparison

  • A probability distribution of errors over C classes can be generated by computing the number of times elements from each class are misclassified and normalizing with the net number of errors made.
  • However, the distribution is very sparse, one solution is to cluster the classes into higher-level categories, e.g.: gives 16 so-called “entry-level” categories, producing a 16×16 confusion matrix for each model.
  • The error term for every class:
  • where ei is a count of errors defined for a given system.
  • Jensen-Shannon (JS) distance between these distributions can be computed:
  • where m is the point-wise mean of two probability distributions p and q (i.e., mi=(pi+qi)/2, p and q being the probability distributions of errors of the two systems), and D is KL Divergence:
  • The JS distance is a symmetrized and smoothed version of the Kullback-Liebler divergence. A lower JS distance implies classifiers with high error consistency.

Columns (predicted labels) of the confusion matrix are collapsed, and the accumulated error for 16 true classes is:

  • where CM is the confusion matrix.
  • In this context, the class-wise JS distance compares which classes were misclassified, for a given number of output classes (16 in this case).

For off-diagonal counting (16×16–16=240 entries), it is Inter-class JS distance:

2. Shape Bias Results

2.1. Models

  • The ViT and ResNet models used were pre-trained on ImageNet-21K (also known as the “Full ImageNet, Fall 2011 release”) and ILSVRC-2012 datasets (Russakovsky et al., 2015).
  • The ViT models used include ViT-B/16, ViT-B/32, ViT-L/16 and ViT-L/32 and the ResNet model used is BiT-M-R50×11.

2.2. Datasets

(left) Original image from ImageNet, and (right) a textured transform.
  • These models are tested on a specially designed diagnostic dataset, the Stylized ImageNet (SIN) dataset where cue-conflict between texture and shape are generated by texture-based style transfer.

2.3. Results

Error consistency results on Stylized ImageNet (SIN) dataset.
  • Higher Cohen k and lower JS distance each indicate greater error consistency.

For Cohen’s k and Inter-class JS distance, ViT is more consistent with humans than ResNet.

Shape bias for different networks for the Stylized ImageNet (SIN) dataset. Vertical lines indicate averages.
  • Small bar plots on the right indicate accuracy (answer corresponds to either correct texture or shape category).
  • It can be seen that ViT has a higher shape bias than traditional CNNs.

3. Results After Fine-Tuning

Error consistency results for Stylized ImageNet (SIN) dataset before and after fine-tuning.
  • Fine-tuning made ResNet less human-like in terms of error consistency (significant differences in Cohen’s k and the inter-class JS distance, a non-significant trend in the classwise JS distance).

On the other hand, authors said that ViT does not have significantly change in its error consistency with fine-tuning, and in fact trends (not statistically significantly) towards in the opposite direction than ResNet, in particular, towards improved error consistency.

Shape bias for ResNet and ViT before and after fine-tuning. Vertical lines indicate averages.
  • Augmentations presented in SimCLR and Hermann et al. (2020) are applied to the ImageNet dataset to fine-tune the models:
  • Rotation (±90, 180 randomly), random Cutout (rectangles of size 2×2 px to half the image width), Sobel filtering, Gaussian blur (kernel size = 3×3 px), color distortion (color jitter with probability 80% and color drop with probability 20%) and Gaussian noise (standard deviation of 0.196 for normalized image).

We see that ResNet increases its shape bias after fine-tuning. ViT also increases its shape bias after fine-tuning.

Shape bias and ImageNet accuracy of ViT and ResNet with fine-tuning over augmented data.
  • Training on augmented data increases shape bias and decreases ImageNet accuracy slightly, as is corroborated by previous works (Hermann et al., 2020).
  • The decrease in accuracy for ResNet is more significant than for ViT.

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.