Review: DeconvNet — Unpooling Layer (Semantic Segmentation)

Published in

Towards Data Science

4 min readOct 8, 2018

--

In this story, DeconvNet is briefly reviewed, the deconvolution network (DeconvNet) is composed of deconvolution and unpooling layers.

For the conventional FCN, the output is obtained by high ratio (32×, 16× and 8×) upsampling, which might induce rough segmentation output (label map). In this DeconvNet, the output label map is obtained by gradual deconvolution and unpooling. And it is a paper published in 2015 ICCV with more than 1000 citations when I was writing this story. (Sik-Ho Tsang @ Medium)

What Are Covered

Unpooling and Deconvolution
Instance-wise Segmentation
Two-Stage Training
Results

1. Unpooling and Deconvolution

The following is the overall architecture of DeconvNet:

As we can see, it uses VGGNet as backbone. The first part is a convolution network which is as usual like FCN, with conv and pooling layers. The second part is the deconvolution network which is a novel part in this paper.

**Remember positions when Pooling (Left), Reuse the position information during Unpooling (right)**

To perform unpooling, we need to remember the position of each maximum activation value when doing max pooling, as shown above. Then, the remembered position is used for unpooling as shown above.

**Convolution is to conv the input to smaller size (Left) Deconvolution is to conv the input back to larger size (Right)**

Deconvolution is just to conv the input back to larger size. (If interested, please read my FCN review for details.)

**An example of Deconvolution and Unpooling**

The above figure is an example. (b) is the output at 14×14 deconv layer. (c) is the output after unpooling, and so on. And we can see in (j) that the bicycle can be reconstructed at the last 224×224 deconv layer, which shows that the learned filters can capture class-specific shape information.

**Input Image (Left), FCN-8s (Middle), DeconvNet (Right)**

Other examples as shown above which shows that DeconvNet can reconstruct a better shape than FCN-8s.

2. Instance-wise Segmentation

**Bad Examples of Semantic Segmentation Without Using Region Proposals**

As shown above, the object that is substantially larger or smaller than the receptive field may be fragmented or mislabeled. Small objects are often ignored and classified as background

The semantic segmentation is posed as instance-wise segmentation problem. First of all, Top 50 out of 2000 region proposals (bounding boxes), are detected by an object detection approach, EdgeBox. Then, DeconvNet is applied for each proposal, and aggregates the outputs of all proposals back to the original image. By using proposal, various scales can be handled effectively.

3. Two-Stage Training

First-Stage Training

Crop the object instances using ground-truth annotations so that the object is centered at the cropped bounding box, and then perform training. This can help to reduce the variations in object location and size.

Second-Stage Training

More challenging examples are used. These examples are generated / cropped by the proposals overlapping the ground-truth segmentation.

Some Other Details

Batch Normalization is used.
The conv part is initialized using the weights in VGGNet.
The deconv part is initialized with zero-mean and Gaussians.
64 samples per batch.

4. Results

FCN-8s: has only 64.4% mean IoU.
DeconvNet: 69.6%
DeconvNet+CRF: 70.5% (where CRF is just a post-processing step)
EDeconvNet: 71.5% (EDeconvNet means the results ensembled with FCN-8s)
EDeconvNet+CRF: 72.5% which has the highest mean IoU.

**Benefits of Instance-wise segmentation**

From the above figure, instance-wise segmentation helps to have segmentation gradually instance-by-instance, not to have segmentation for all instances at once.

It should be noted that the gain of DeconvNet is not just coming from gradual deconv and unpooling, but maybe also from the instance-wise segmentation and two-stage training.