Review: DSSD — Deconvolutional Single Shot Detector (Object Detection)
Deconvolution Layers: Introduce additional large-scale context, improve accuracy for small objects
This time, DSSD (Deconvolutional Single Shot Detector) is reviewed. DSSD, with deconvolutional path, improves the previous SSD. It is a technical report in 2017 arXiv with over 100 citations. (Sik-Ho Tsang @ Medium).
- Gradual deconvolution to enlarge the feature maps
- Feature Combination from convolution path and deconvolution path
What Are Covered
- Overall Architecture
- Deconvolution Module
- Prediction Module
- Some Training Details
- Results
1. Overall Architecture
- Convs in white color: It can be VGGNet or ResNet backbone for feature extraction
- Convs in blue color: It is the original SSD part, which involves removing the fully connected layers of original VGGNet/ResNet, and adding conv layers with the use of atrous/dilated convolutions (originated from wavelet, used by DeepLab or DilatedNet). (Please visit SSD if interested.)
- Remaining Convs: It consists of deconvolution module and prediction module which will be mentioned in details later on.
2. Deconvolution Module
3. Prediction Module
- Various Prediction Modules are tested.
- (a): It is the most basic one used in SSD, which directly predicts the object class and perform bounding box regression.
- (b): Additional sets of Conv1×1 are performed on feature maps for increasing the dimension. And there is also a skip connection with element-wise addition.
- (c): It is the (b) one except that an additional Conv1×1 is performed on skip connection path.
- (d): Two of (c) are cascaded.
4. Some Training Details
Two-stage Training
- Well-trained SSD by ImageNet Pre-trained model is used.
- For the first stage, only deconvolution side is trained.
- For the second stage, the entire network is fine-tuned.
Others
- Extensive data augmentation, including randomly cropping, flipping, and random photometric distortion, is also used.
- After analyses using K-means clustering, prior boxes with aspect ratio of 1.6 is added, i.e. {1.6, 2.0, 3.0} are used.
Based on all above, ablation study is performed on PASCAL VOC 2007:
- SSD 321: Original SSD with input 321×321, 76.4% mAP.
- SSD 321 + PM(c): Original SSD using Prediction Module (c), 77.1 % mAP, which is better than those using PM (b) and PM (d).
- SSD 321 + PM(c) + DM(Eltw-prod): DM means Deconvolutional Module, thus, this is DSSD using PM(c) with element-wise product used for feature combination, 78.6% mAP. It outperforms the one using element-wise addition a little bit.
- SSD 321 + PM(c) + DM(Eltw-prod) + Stage 2: Using two-stage training, the performance is decreased.
5. Results
5.1. PASCAL VOC 2007
SSD and DSSD are trained on the union of 2007 trainval and 2012 trainval.
- SSD300* and SSD512* (* means the new data augementation trick are used.): With *, the original SSD already outperforms other state-of-the-art approaches except R-FCN.
- SSD 321 and SSD 513: With ResNet as backbone, the performances are already similar to SSD300* and SSD512*.
- DSSD 321 and DSSD 513: With the deconvolutional path, they outperforms SSD 321 and SSD 513 respectively.
- Particularly, DSSD513 outperforms R-FCN.
5.2. PASCAL VOC 2012
VOC2007 trainval+test and 2012 trainval are used for training. Since two-stage training is found to be useless, one-stage training is used here.
- DSSD 513 outperforms others, with 80.0% mAP. And it is trained without using extra training data from COCO dataset.
5.3. MS COCO
Again, no two-stage training.
- SSD300* is already better than Faster R-CNN and ION.
- DSSD321 has better AP on small object, with 7.4% compared with SSD321 with only 6.2%.
- For larger model, DSSD513 obtains 33.2% mAP, which is better than R-FCN of 29.9% mAP. And it already has competitive result with Faster R-CNN+++. (+++ means it also used VOC2007 and VOC2012 for training as well.)
5.4. Inference Time
To simply the network during testing, BN is removed, and merged with the conv as follows:
In brief, they try to merge the BN effect into the conv layer’s weight and bias calculation so that the network is simplified. This improves the speed by 1.2 to 1.5 times, and reduce memory up to three times.
Due to the small input size, SSD does not work well on small object. With deconvolution path, DSSD shows obvious improvement.
References
[2017 arXiv] [DSSD]
DSSD : Deconvolutional Single Shot Detector
My Related Reviews
Image Classification
[LeNet] [AlexNet] [ZFNet] [VGGNet] [SPPNet] [PReLU-Net] [DeepImage] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet]
Object Detection
[OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [DeepID-Net] [R-FCN] [YOLOv1] [SSD] [YOLOv2 / YOLO9000]
Semantic Segmentation
[FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [ParseNet] [DilatedNet] [PSPNet]