Brief Review — An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

CoordConv, Incorporates the Positions into Conv Layer

3 min readJul 29, 2022

An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution, CoordConv, by Uber AI Labs, and Uber Technologies,
2018 NeurIPS, Over 500 Citations (Sik-Ho Tsang @ Medium)
Image Classification

CoordConv is proposed, which incorporates the positions into conv layer.

Outline

CoordConv
Not-so-Clevr Dataset & Results
Other Results

1. CoordConv

**Comparison of 2D convolutional and CoordConv layers**

A CoordConv layer has 2 to 3 more channels compared with Conv layer.
These channels contain hard-coded coordinates, the most basic version of which is one channel for the i coordinate and one for the j coordinate, as shown above.
e.g.: for i coordinates, its first row filled with 0’s, its second row with 1’s, its third with 2’s.
Other derived coordinates may be input as well, like the radius coordinate used in ImageNet:

Finally, scaling is done to make them fall in the range [−1, 1].

2. Not-so-Clevr Dataset & Results

Not-so-Clevr consists of 9×9 squares placed on a 64×64 canvas.

So, with coordinates as input, CNN should be designed properly to output the correct positions.

**Performance of convolution and CoordConv on Supervised Coordinate Classification**

However, the conventional convolution models never achieve more than about 86% accuracy, and training is slow.
CoordConv models learn several hundred times faster, attaining perfect accuracy in seconds.

3. Other Results

3.1. ImageNet Classification

As might be expected for tasks requiring straightforward translation invariance, CoordConv does not help significantly when tested with image classification.
Adding a single extra 1×1 CoordConv layer with 8 output channels improves ResNet-50 Top-5 accuracy by a meager 0.04% averaged over five runs for each treatment; however, this difference is not statistically significant. It is at least reassuring that CoordConv doesn’t hurt the performance since it can always learn to ignore coordinates.

3.2. Object Detection

On a simple problem of detecting MNIST digits scattered on a canvas, it is found the test intersection-over-union (IOU) of a Faster R-CNN network improved by 24% when using CoordConv.
(Authors do not have any figures and tables for this part.)

With CoordConv, it can be useful for localization problem such as object detection

Reference

[2018 NeurIPS] [CoordConv]
An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

Image Classification

1989 … 2018 … [CoordConv] … 2021 [Learned Resizer] [Vision Transformer, ViT] [ResNet Strikes Back] [DeiT] [EfficientNetV2] [MLP-Mixer] [T2T-ViT] [Swin Transformer] [CaiT] [ResMLP] [ResNet-RS] [NFNet] [PVT, PVTv1] [CvT] [HaloNet] [TNT] [CoAtNet] [Focal Transformer] [TResNet] [CPVT] [Twins] 2022 [ConvNeXt] [PVTv2]

Brief Review — An Intriguing Failing of Convolutional Neural Networks and the CoordConv Solution

CoordConv, Incorporates the Positions into Conv Layer

Outline

1. CoordConv

2. Not-so-Clevr Dataset & Results

3. Other Results

3.1. ImageNet Classification

3.2. Object Detection

Reference

Image Classification

My Other Previous Paper Readings

Written by Sik-Ho Tsang

No responses yet