Review: DSSD — Deconvolutional Single Shot Detector (Object Detection)

Deconvolution Layers: Introduce additional large-scale context, improve accuracy for small objects

Published in

Towards Data Science

5 min readDec 23, 2018

This time, DSSD (Deconvolutional Single Shot Detector) is reviewed. DSSD, with deconvolutional path, improves the previous SSD. It is a technical report in 2017 arXiv with over 100 citations. (Sik-Ho Tsang @ Medium).

Gradual deconvolution to enlarge the feature maps
Feature Combination from convolution path and deconvolution path

What Are Covered

Overall Architecture
Deconvolution Module
Prediction Module
Some Training Details
Results

1. Overall Architecture

Convs in white color: It can be VGGNet or ResNet backbone for feature extraction
Convs in blue color: It is the original SSD part, which involves removing the fully connected layers of original VGGNet/ResNet, and adding conv layers with the use of atrous/dilated convolutions (originated from wavelet, used by DeepLab or DilatedNet). (Please visit SSD if interested.)
Remaining Convs: It consists of deconvolution module and prediction module which will be mentioned in details later on.

2. Deconvolution Module

Those feature maps at deconvolution path are upsampled by Deconv2×2 and then Conv3×3+BN.
On the other hand, the corresponding same-size feature maps are having Conv3×3+BN+ReLU+Conv3×3+BN.
Then they are element-wise multiplied (Eltw Product) together, and ReLU, and pass to the Prediction Module.

3. Prediction Module

Various Prediction Modules are tested.
(a): It is the most basic one used in SSD, which directly predicts the object class and perform bounding box regression.
(b): Additional sets of Conv1×1 are performed on feature maps for increasing the dimension. And there is also a skip connection with element-wise addition.
(c): It is the (b) one except that an additional Conv1×1 is performed on skip connection path.
(d): Two of (c) are cascaded.

4. Some Training Details

Two-stage Training

Well-trained SSD by ImageNet Pre-trained model is used.
For the first stage, only deconvolution side is trained.
For the second stage, the entire network is fine-tuned.

Others

Extensive data augmentation, including randomly cropping, flipping, and random photometric distortion, is also used.
After analyses using K-means clustering, prior boxes with aspect ratio of 1.6 is added, i.e. {1.6, 2.0, 3.0} are used.

Based on all above, ablation study is performed on PASCAL VOC 2007:

SSD 321: Original SSD with input 321×321, 76.4% mAP.
SSD 321 + PM(c): Original SSD using Prediction Module (c), 77.1 % mAP, which is better than those using PM (b) and PM (d).
SSD 321 + PM(c) + DM(Eltw-prod): DM means Deconvolutional Module, thus, this is DSSD using PM(c) with element-wise product used for feature combination, 78.6% mAP. It outperforms the one using element-wise addition a little bit.
SSD 321 + PM(c) + DM(Eltw-prod) + Stage 2: Using two-stage training, the performance is decreased.

5. Results

5.1. PASCAL VOC 2007

SSD and DSSD are trained on the union of 2007 trainval and 2012 trainval.

SSD300* and SSD512* (* means the new data augementation trick are used.): With *, the original SSD already outperforms other state-of-the-art approaches except R-FCN.
SSD 321 and SSD 513: With ResNet as backbone, the performances are already similar to SSD300* and SSD512*.
DSSD 321 and DSSD 513: With the deconvolutional path, they outperforms SSD 321 and SSD 513 respectively.
Particularly, DSSD513 outperforms R-FCN.

5.2. PASCAL VOC 2012

VOC2007 trainval+test and 2012 trainval are used for training. Since two-stage training is found to be useless, one-stage training is used here.

DSSD 513 outperforms others, with 80.0% mAP. And it is trained without using extra training data from COCO dataset.

5.3. MS COCO

Again, no two-stage training.

SSD300* is already better than Faster R-CNN and ION.
DSSD321 has better AP on small object, with 7.4% compared with SSD321 with only 6.2%.
For larger model, DSSD513 obtains 33.2% mAP, which is better than R-FCN of 29.9% mAP. And it already has competitive result with Faster R-CNN+++. (+++ means it also used VOC2007 and VOC2012 for training as well.)

5.4. Inference Time

To simply the network during testing, BN is removed, and merged with the conv as follows:

In brief, they try to merge the BN effect into the conv layer’s weight and bias calculation so that the network is simplified. This improves the speed by 1.2 to 1.5 times, and reduce memory up to three times.

**Speed & Accuracy on PASCAL VOC 2007 Test**

SSD 513 has similar speed (8.7 fps) and accuracy compared with R-FCN (9 FPS). With BN layers removed and merged with conv layers, 11.0 fps is obtained which is faster.
DSSD513 has better accuracy than R-FCN but slightly slower.
DSSD321 has lower accuracy than R-FCN but faster.