Review — Are Convolutional Neural Networks or Transformers More Like Human Vision?
Are Convolutional Neural Networks or Transformers More Like Human Vision?,
Tuli CogSci’21, by Princeton University, DeepMind, and UC Berkeley,
2021 CogSci, Over 80 Citations (Sik-Ho Tsang @ Medium)
1989 … 2022 [ConvNeXt] [PVTv2] [ViT-G] [AS-MLP] [ResTv2] [CSWin Transformer] [Pale Transformer] [Sparse MLP] [MViTv2] [S²-MLP] [CycleMLP] [MobileOne] [GC ViT] [VAN] [ACMix] [CVNets] [MobileViT] [RepMLP] [RepLKNet] [ParNet] 2023 [Vision Permutator (ViP)]
==== My Other Paper Readings Are Also Over Here ====
- Error Consistency
- Shape Bias Results
- Shape Bias Results After Fine-Tuning
1. Error Consistency
To investigate whether Transformers give more human-like representations than CNNs, in ML, AL, cognitive science and behavioral neuroscience, it is commonly to see whether two decision makers (be they humans or AI models) use the same strategy to solve a given task.
1.1. Error Overlap
- Observed error overlap is given by:
- where ei,j is how often the two systems “agree”, i.e., how often they both classify correctly or both classify incorrectly.
1.2. Cohen’s k
- Let say a system i can have a pi, which is the probability of getting correct or incorrect decision.
- The expected overlap is calculated by comparing independent binomial observers i and j with their accuracies as the respective probabilities:
- The expected overlap can be used to normalize the observed error overlap, giving a measure of error consistency known as Cohen’s k:
- Yet it does not take into account what the system misclassifies an image as when making an error.
1.3. Confusion Matrix Comparison
- A probability distribution of errors over C classes can be generated by computing the number of times elements from each class are misclassified and normalizing with the net number of errors made.
- However, the distribution is very sparse, one solution is to cluster the classes into higher-level categories, e.g.: gives 16 so-called “entry-level” categories, producing a 16×16 confusion matrix for each model.
- The error term for every class:
- where ei is a count of errors defined for a given system.
- Jensen-Shannon (JS) distance between these distributions can be computed:
- where m is the point-wise mean of two probability distributions p and q (i.e., mi=(pi+qi)/2, p and q being the probability distributions of errors of the two systems), and D is KL Divergence:
- The JS distance is a symmetrized and smoothed version of the Kullback-Liebler divergence. A lower JS distance implies classifiers with high error consistency.
Columns (predicted labels) of the confusion matrix are collapsed, and the accumulated error for 16 true classes is:
- where CM is the confusion matrix.
- In this context, the class-wise JS distance compares which classes were misclassified, for a given number of output classes (16 in this case).
For off-diagonal counting (16×16–16=240 entries), it is Inter-class JS distance:
2. Shape Bias Results
- The ViT and ResNet models used were pre-trained on ImageNet-21K (also known as the “Full ImageNet, Fall 2011 release”) and ILSVRC-2012 datasets (Russakovsky et al., 2015).
- The ViT models used include ViT-B/16, ViT-B/32, ViT-L/16 and ViT-L/32 and the ResNet model used is BiT-M-R50×11.
- These models are tested on a specially designed diagnostic dataset, the Stylized ImageNet (SIN) dataset where cue-conflict between texture and shape are generated by texture-based style transfer.
- Higher Cohen k and lower JS distance each indicate greater error consistency.
- Small bar plots on the right indicate accuracy (answer corresponds to either correct texture or shape category).
- It can be seen that ViT has a higher shape bias than traditional CNNs.
3. Results After Fine-Tuning
- Fine-tuning made ResNet less human-like in terms of error consistency (significant differences in Cohen’s k and the inter-class JS distance, a non-significant trend in the classwise JS distance).
On the other hand, authors said that ViT does not have significantly change in its error consistency with fine-tuning, and in fact trends (not statistically significantly) towards in the opposite direction than ResNet, in particular, towards improved error consistency.
- Augmentations presented in SimCLR and Hermann et al. (2020) are applied to the ImageNet dataset to fine-tune the models:
- Rotation (±90, 180 randomly), random Cutout (rectangles of size 2×2 px to half the image width), Sobel filtering, Gaussian blur (kernel size = 3×3 px), color distortion (color jitter with probability 80% and color drop with probability 20%) and Gaussian noise (standard deviation of 0.196 for normalized image).