# Review — Are Convolutional Neural Networks or Transformers More Like Human Vision?

Are Convolutional Neural Networks or Transformers More Like Human Vision?,Tuli CogSci, by Princeton University, DeepMind, and UC Berkeley,’212021 CogSci, Over 80 Citations(Sik-Ho Tsang @ Medium)

Image Classification1989 … 2022[ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet]2023[Vision Permutator (ViP)]

==== My Other Paper Readings Are Also Over Here ====

# Outline

**Error Consistency****Shape Bias Results****Shape Bias Results After Fine-Tuning**

# 1. Error Consistency

To investigate

whetherTransformers give more human-like representations than CNNs, in ML, AL, cognitive science and behavioral neuroscience, it is commonly to seewhether two decision makers (be they humans or AI models) use the same strategy to solve a given task.

## 1.1. Error Overlap

**Observed error overlap**is given by:

- where
is how often the two systems “agree”, i.e.,*ei*,*j***how often they both classify correctly or both classify incorrectly**.

## 1.2. Cohen’s k

- Let say
**a system**can have a*i*, which is the*pi***probability of getting correct or incorrect decision.** - The
**expected overlap**is calculated by**comparing independent binomial observers**with their accuracies as the respective probabilities:*i*and*j*

- The expected overlap can be used to
**normalize the observed error overlap**, giving a**measure of error consistency**known as**Cohen’s k**:

- Yet it does not take into account what the system misclassifies an image as when making an error.

## 1.3. Confusion Matrix Comparison

**A probability distribution of errors over**can be generated by computing the number of times elements from each class are misclassified and normalizing with the net number of errors made.*C*classes- However, the distribution is very sparse, one solution is to
**cluster the classes into higher-level categories**, e.g.: gives**16 so-called “entry-level” categories**, producing a**16×16 confusion matrix**for each model. - The
**error term for every class**:

- where
is*ei***a count of errors**defined for a given system. **Jensen-Shannon (JS) distance between these distributions**can be computed:

- where
*m*is the point-wise mean of two probability distributions*p*and*q*(i.e.,,*mi*=(*pi*+*qi*)/2being the*p*and*q***probability distributions of errors of the two systems**), andis KL Divergence:*D*

- The JS distance is a symmetrized and smoothed version of the Kullback-Liebler divergence. A
**lower JS distance**implies classifiers with**high error consistency**.

Columns (predicted labels) of the confusion matrix are collapsed, andthe accumulated error for 16 true classesis:

- where
**CM**is the**confusion****matrix**. - In this context, the
**class-wise JS distance compares which classes were misclassified**, for a given number of output classes (16 in this case).

For

off-diagonalcounting (16×16–16=240 entries), it isInter-class JS distance:

**2. Shape Bias Results**

## 2.1. Models

**The****ViT****and****ResNet****models**used were**pre-trained**on ImageNet-21K (also known as the “Full ImageNet, Fall 2011 release”) and ILSVRC-2012 datasets (Russakovsky et al., 2015).- The ViT models used include
**ViT****-B/16,****ViT****-B/32,****ViT****-L/16 and****ViT****-L/32**and the ResNet model used is**BiT****-M-R50×11**.

## 2.2. Datasets

- These models are
**tested on a specially designed diagnostic dataset, the****Stylized ImageNet (SIN) dataset**where cue-conflict between texture and shape are generated by texture-based style transfer.

## 2.3. Results

**Higher Cohen k**and**lower JS distance**each indicate**greater error consistency.**

For Cohen’s k and Inter-class JS distance,

ViTis more consistent with humans thanResNet.

- Small bar plots on the right indicate accuracy (answer corresponds to either correct texture or shape category).
- It can be seen that
**ViT****has a higher shape bias than traditional CNNs**.

**3. Results After Fine-Tuning**

**Fine-tuning made****ResNet****less human-like**in terms of error consistency (significant differences in Cohen’s k and the inter-class JS distance, a non-significant trend in the classwise JS distance).

On the other hand, authors said that

ViTdoes not have significantly change in its error consistency with fine-tuning, and in fact trends (not statistically significantly) towards in the opposite direction than ResNet, in particular, towardsimproved error consistency.

**Augmentations**presented in SimCLR and Hermann et al. (2020) are**applied to the ImageNet dataset**to**fine-tune the models:**- Rotation (±90, 180 randomly), random Cutout (rectangles of size 2×2 px to half the image width), Sobel filtering, Gaussian blur (kernel size = 3×3 px), color distortion (color jitter with probability 80% and color drop with probability 20%) and Gaussian noise (standard deviation of 0.196 for normalized image).

We see that

ResNetincreases its shape bias after fine-tuning.ViTalso increases its shape bias after fine-tuning.