Review: Deep Image — A Big Data Solution for Image Recognition in ILSVRC 2015

Sik-Ho Tsang
5 min readSep 12, 2018

In this story, Deep Image [1] is reviewed. Deep Image achieves 4.58% error rate which surpasses the human-level performance in ILSVRC 2015.

However, Baidu violated the rule of ILSVRC at that moment. They created 30 accounts such that they had at least 200 submissions, and more than 40 submissions over 5 days from 15 March, 2015 to 19 March, 2015. Due to such frequent submissions, they violated the rule of ILSVRC that only allows 2 submissions in 1 week.

Nevertheless, they have proposed something new by using a custom-built supercomputer, called Minwa, which comprised of 36 server nodes so that they can increase the number of GPUs such that they can increase the batch size largely. As the efficient parallelism brought by large number of GPUs, an aggressive data augmentation is also proposed. (Sik-Ho Tsang @ Medium)

It is a paper in arXiv, called Deep Image: Scaling up Image Recognition, with hundreds of citations. It is a kind of big data solution with scaling out techniques rather than scaling up as in the paper title. It does NOT focus on any innovations within the deep learning convolutional neural network (CNN), or any better regularization terms for the loss function. It is talking about how to scale out to achieve a better performance. It is particular suitable for large commercial companies and governmental authorities. Thus, it is also worth to talk about it.

What are covered

  1. Supercomputer Setup
  2. Data Parallelism
  3. Data Augmentation
  4. Experimental Results

1. Supercomputer Setup

As mentioned, Minwa, a custom-built supercomputer, with 36 server nodes, is used. Each node has 2 six-core Intel Xeon E5–2620 CPU processors, 4 Nvidia Tesla K40m GPUs, each with 12GB of memory, and 1 FDR InfiniBand which can provide 56Gb/s interconnection data speed.

In total, it has 6.9TB host memory and 1.7TB device memory.

As we can see, it is actually scaling out rather than scaling up!

With more GPUs, larger batch size can be used. It is crucial to have a larger batch size in deep learning network, e.g. AlexNet used 2 GPUs. It is a difficult task for one single computer to increase the number of GPUs due to limited PCI-E slots, power supply, and high computer temperature problem. The best solution is to scale out!!

With 1024 batch size, and 64 GPUs, the speed up can be up to 47 (Green) as shown below:

And the time to converge is much faster, it only takes 8.6 hours by using 32 GPUs for training to get up to 80% accuracy while 212 hours are needed with only 1 GPU.

If the training time of your model takes weeks, right now, it can be reduced by scaling out. Then it is possible to collect all the current day new data, train the model at midnight, and deploy the trained model before the next business day starts, or in a weekly-basis!

2. Data Parallelism

Each GPU is responsible for 1/N mini-batch. During backpropagation, all GPUs compute the gradient based on the local training data, then exchange gradients and update the local copy of weights, as below:

3. Data Augmentation

Aggressive data augmentation is proposed.

  1. Color casting: Add a random integer from -20 to +20 to R, G, B channels.
  2. Vignetting: Make the periphery of an image darker, with two random parameters, area to add the effect, and how much brightness is reduced.
  3. Lens Distortion: Horizontal and vertical stretching.
  4. Rotation, flipping and cropping: This is just the same as other prior arts.

4. Experimental Results

VGGNet [2] is used. Multi-scale training is also used. The model is pre-trained from ILSVRC and then fine-tuned based on the new dataset.

Model Architecture

4.1 CUB-200–2011

200 bird species recognition, 11,788 images.

4.2 Oxford 102 Flowers

102 different categories of flowers, 8,189 images.

4.3 Oxford-IIIT Pets

37 classes, 7,349 images.

4.4 FGVC-Aircraft

100 aircraft variants, 10,000 images.

4.5 MIT-67 Indoor Scenes

67 indoor scenes, 15,620 images.


1000 images in each of 1000 categories. Indeed, the best error rate is of 4.58%. But finally they are withdrawn from ILSVRC due to violation of rule.

Single Model



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.