Brief Review — An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

CoordConv, Incorporates the Positions into Conv Layer

Sik-Ho Tsang
3 min readJul 29, 2022

An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution, CoordConv, by Uber AI Labs, and Uber Technologies,
2018 NeurIPS, Over 500 Citations (Sik-Ho Tsang @ Medium)
Image Classification

  • CoordConv is proposed, which incorporates the positions into conv layer.

Outline

  1. CoordConv
  2. Not-so-Clevr Dataset & Results
  3. Other Results

1. CoordConv

Comparison of 2D convolutional and CoordConv layers
  • A CoordConv layer has 2 to 3 more channels compared with Conv layer.
  • These channels contain hard-coded coordinates, the most basic version of which is one channel for the i coordinate and one for the j coordinate, as shown above.
  • e.g.: for i coordinates, its first row filled with 0’s, its second row with 1’s, its third with 2’s.
  • Other derived coordinates may be input as well, like the radius coordinate used in ImageNet:
  • Finally, scaling is done to make them fall in the range [−1, 1].

2. Not-so-Clevr Dataset & Results

The Not-so-Clevr dataset
  • Not-so-Clevr consists of 9×9 squares placed on a 64×64 canvas.
Toy tasks considered in this paper
  • So, with coordinates as input, CNN should be designed properly to output the correct positions.
Performance of convolution and CoordConv on Supervised Coordinate Classification
  • However, the conventional convolution models never achieve more than about 86% accuracy, and training is slow.
  • CoordConv models learn several hundred times faster, attaining perfect accuracy in seconds.

3. Other Results

3.1. ImageNet Classification

  • As might be expected for tasks requiring straightforward translation invariance, CoordConv does not help significantly when tested with image classification.
  • Adding a single extra 1×1 CoordConv layer with 8 output channels improves ResNet-50 Top-5 accuracy by a meager 0.04% averaged over five runs for each treatment; however, this difference is not statistically significant. It is at least reassuring that CoordConv doesn’t hurt the performance since it can always learn to ignore coordinates.

3.2. Object Detection

  • On a simple problem of detecting MNIST digits scattered on a canvas, it is found the test intersection-over-union (IOU) of a Faster R-CNN network improved by 24% when using CoordConv.
  • (Authors do not have any figures and tables for this part.)
  • With CoordConv, it can be useful for localization problem such as object detection

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.