Brief Review — RCNN: Recurrent Convolutional Neural Network for Object Recognition

RCNN, CNN+RNN for Image Classification

Sik-Ho Tsang
4 min readAug 26, 2022

Recurrent Convolutional Neural Network for Object Recognition
, by Tsinghua University
2015 CVPR, Over 1000 Citations (Sik-Ho Tsang @ Medium)
Image Classification, CNN, RNN

  • A Recurrent CNN (RCNN) is proposed for image classification, where the convolutional layer is recurrently used for multiple times.


  1. Recurrent CNN (RCNN)
  2. Results

1. Recurrent CNN (RCNN)

RCNN Overall Network Architecture

1.1. Recurrent Convolutional Layer (RCL)

  • The key module of RCNN is the recurrent convolutional layer (RCL).
  • For a unit located at (i, j) on the kth feature map in an RCL, its net input zijk(t) at time step t is given by:
  • The activity or state of this unit is a function of its net input:
  • where f is the rectified linear activation (ReLU) function:
  • and g is the local response normalization (LRN) function in AlexNet:
  • It is claimed that LRN is used for preventing the states from exploding.
  • Left: An example with T=3. When t=0 only the feedforward input is present.
  • The final gradient of a shared weight is the sum of its gradients over all time steps.
  • Right: To save computation, layer 1 is the standard feed-forward convolutional layer without recurrent connections, followed by max pooling. On top of this, four RCLs are used with a max pooling layer in the middle. Both pooling operations have stride 2 and size 3.
  • The output of the fourth RCL follows a global max pooling, yielding a feature vector representing the image.
  • Finally a softmax layer is used to classify the feature vectors to C categories whose output is given by:
  • The cross-entropy loss function is used for training.
  • If we unfold the recurrent connections for T time steps, the model becomes a very deep feed-forward network with 4(T+1)+2 parameterized layers, where T+1 is the depth of each RCL.

2. Results

2.1. CIFAR-10

Comparison with existing models on CIFAR-10
  • Three models with different K’s were tested: RCNN-96, RCNN-128 and RCNN-160. The number of iterations was set to 3.

All of them outperformed existing models such as Maxout and NIN, and the performance was steadily improved with more features maps.

2.2. CIFAR-100

Comparison with existing models on CIFAR-100
  • Again RCNN-96 outperformed the state-of-the-art models with fewer parameters, and the performance kept improving by increasing K.

2.3. MNIST

Comparison with existing models on MNIST
  • RCNN-64 outperformed other models using only 0.30 million parameters.

2.4. SVHN

Comparison with existing models on SVHN
  • RCNN-128 had much fewer parameters than NIN (1.19 million versus 1.98 million), and increasing K kept improving the accuracy.

Similar idea is used in PolyInception Modules as in PolyNet, and PolyNet got 2nd Runner Up in ILSVRC 2016 Image Classification.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.