[Review] BTBNet: Multi-Stream Bottom-Top-Bottom Fully Convolutional Network (Defocus Blur Detection)

U-Net-Like BTBNet, Plus Recurrent Refinement Network, Outperforms Park CVPR’17

A challenging example for defocus blur detection

In this story, Defocus Blur Detection via Multi-Stream Bottom-Top-Bottom Fully Convolutional Network, BTBNet, by Dalian University of Technology, and Chinese Academy of Sciences, is presented. In this story:

  • A multi-stream bottom-top-bottom fully convolutional network (BTBNet) is proposed, which is the first attempt of an end-to-end deep network. Low-level cues are integrated with high-level semantic information.
  • Multi-stream BTBNets are used to handle input images with different scales.
  • A fusion and recurrent reconstruction network is also designed to recurrently refine the preceding blur detection maps.
  • A new challenging dataset is also constructed.

This is a paper in 2018 CVPR with over 30 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. Bottom-Top-Bottom Fully Convolutional Network (BTBNet)
  2. Multi-Stream BTBNet
  3. Fusion and Recurrent Reconstruction Network, FRRNet (FNet+RRNet)
  4. Model Training
  5. Experimental Results

1. Bottom-Top-Bottom Fully Convolutional Network (BTBNet)

Bottom-Top-Bottom Fully Convolutional Network (BTBNet)
  • ImageNet Pre-trained VGG16 is used but with the top three fully connected layers of and the five pooling layers deleted.
  • Thus, the output resolution of the transformed VGG network is the same with the original input resolution, as shown above.
  • A step-wise feedback process is used. Between each block of the bottom-top backbone network, the feedback information is combined with forward information step by step.
  • The integration of feedback and forward information is achieved by element-wise addition. Before the information integration in each step,
  • An extra convolutional (Conv), with ReLU, is attached on both bottom-top and top-bottom streams. The extra layers have 3×3 kernels and 256, 128, 64, and 1 channels, respectively.
  • The final output is a DBD map.

2. Multi-Stream BTBNet

An example of multi-scale blur perception
  • The blur confidence is highly related to scales.
  • A clear image patch (e.g., scale 1) can be regarded as blurry depending on the size of the patch (e.g., scale 3), and vice versa.
Top: Multi-Stream BTBNet, Before going into FRRNet (FNet+RRNet).
  • The designed BTBNet is replicated repeatedly, with one replicate for one scale.
  • Specifically, an input image is resized to multiple different scales. Each scale In (n = 1, 2, …,N) of the input image passes through one of these replicated BTBNets, and a DBD map Mn in the same resolution of scale In is produced.
  • Then, these DBD maps are resized to the same resolution as the raw input image using bilinear interpolation.

3. Fusion and Recurrent Reconstruction Network, FRRNet (FNet+RRNet)

FRRNet (FNet+RRNet)
  • The FRRNet consists of two sub-networks, namely, fusion network (FNet) and recurrent reconstruction network (RRNet).
  • FNet merges the DBD maps generated by the multi-stream BTBNet, yielding a DBD map Mf with improved spatial coherence.
  • Then, RRNet gradually recursively refines the DBD map Mf to obtain a final DBD map Mfinal.

3.1. FNet

  • The multistream DBD maps (M1, M2,…, MN) and the source image I1 are first concatenated into a single (N+3)-channel feature map F0.
  • Then, this map is fed to a series of Conv and ReLU layers.
  • The Conv layers have 3×3 kernels and 64, 128, 64, and 1 channels, respectively. The final output after Conv is a fused DBD map ^Mf:
Comparison of multi-stream DBD map fusion results
  • FNet nonlinearly integrates the multi-stream DBD maps and exploits the dense spatial information of the source image.

FNet can generate smoother results with the pixel-wise accuracy.

3.2. RRNet

  • Although FNet improves the spatial coherence of the fused DBD map, noise inevitably occurs when the input image has low-contrast foreground or cluttered background.
  • A recurrent reconstruction network (RRNet) is used for refinement.
  • RRNet has the same architecture as FNet but with different parameters. In each iteration, both the source image and the input DBD map are fed forward through the RRNet to obtain the refined DBD map, which in turn serves as the input DBD map in the next iteration.
  • Let R denote the function modeled by a recursion, the final DBD map Mfinal can be obtained:
  • The proposed RRNet can refine the DBD map by correcting its previous mistakes until the final DBD map in the last iteration is produced.
  • In practice, it is enough to use three recurrent steps for achieving satisfactory performance.
Comparison of DBD results generated from the proposed method without (w/o) and with RRNet.

RRNet can reconstruct lost information in the foreground and suppress unexpected noise in the background.

4. Model Training

4.1. Loss Function

  • The multi-stream BTBNet and FRRNet are jointly trained.
  • The pixelwise loss function between the network output Md and the ground truth Gd is:
  • To boost the performance of our model, an auxiliary loss is applied at the output of each stream BTBNet.
  • Thus, the final loss function combining main and auxiliary losses can be written as follows:
  • where N is the number of streams for BTBNet, and αn is a trade-off parameter that is taken as 1 to balance all losses.
  • Caffe is used, batch size is 1.

4.2. Datasets

4.2.1. Shi’s Dataset (704 Images)

  • Shi’s dataset consists of 704 partially defocus blurred images and manually annotated ground truths.
  • it is divided into two parts, that is, 604 for training and the remaining 100 images for testing.
  • With data augmentation, the training set is enlarged to 9664 images by horizontal flipping at each orientation and rotating to 8 different orientations.

4.2.2. New DBD Dataset (500 Images)

  • A new DBD dataset is constructed, consisting of 500 images with pixel-wise annotations.
  • It is very challenging since numerous images contain homogeneous regions, low-contrast focal regions and background clutter.

4.2.3. Simulated Image Dataset (2000 Images)

  • A simulated image dataset, is also used.
  • 2000 clear images from the Berkeley segmentation dataset [1] and uncompressed colour image dataset [19] are collected.
  • A Gaussian filter for each image is used to smooth half of the image as the out-of-focus blur region, and the remaining half as the in-focus region.
  • Then, four blurred versions can be obtained by smoothing regions of different positions (up, down, left and right) for each image.
  • For each blurred version, a Gaussian filter with a standard deviation of 2 and a window of 7 × 7 is used to repeatedly blur the image five times.
  • Thus, 20 simulated images (four blurred versions and five different blurring levels for each version) are obtained. Thus, with the data augmentation above, the adopted pre-trained image dataset contains 640K images.
  • A simulated image dataset is used for training first, then fine-tuned by Shi’s training dataset.

5. Experimental Results

5.1. Qualitative Evaluation

Visual comparison
  • The proposed method performs well in various challenging cases (e.g., homogeneous regions, low-contrast in-focus regions, and cluttered background), yielding DBD maps closest to the ground truth maps.

5.2. Quantitative Evaluation

PR Curves
F-Measure
  • PR curves and F-measure values are shown as above. The proposed method achieves the top performance over both datasets and all evaluation metrics.
Quantitative comparison of F-measure and MAE scores
  • Especially for the F-measure metric, our method improves the second best one (LBP [31]) by 10.2% and 5.8% over the Shi’s dataset and our dataset, respectively.
  • Of course, BTBNet outperforms Park CVPR’17 (DHCF).

5.3. Ablation Studies

Ablation analysis using F-measure and MAE values.
  • These models are as follows: one-stream BTBNet with input image scale s1={1}, BTBNet(1S); two-stream BTBNet with input image scale s2={1, 0.8}, BTBNet(2S); three-stream BTBNet with input image scale s3 = {1, 0.8, 0.6}, BTBNet(3S), and four-stream BTBNet with input image scale s4 = {1, 0.8, 0.6, 0.4}.
  • The multi-stream mechanism effectively improves the detection performance and three-stream BTBNet achieves the best performance.
  • By comparing the last two rows in the table, we can see that the model with RRNet performs much better than that without RRNet on both datasets.
  • (But the ablation study is performed using test set since there is no validation set. This actually causes overfitting problem..)

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Popular Object Detection datasets —  Analysis and Statistics

Support vector machines

Making sense of grammar using NLTK — Part -7

Review: MWCNN — Multi-Level Wavelet-CNN for Image Restoration (Denoising & Super Resolution & JPEG…

Face Mask Detection In Real Time By Using Open CV and CNN

Our plastic QA engineer, or how to get data without people.

Review — EfficientDet: Scalable and Efficient Object Detection

Review — Big Transfer (BiT): General Visual Representation Learning

Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

More from Medium

Review: Vision Transformer (ViT)

Face Tracking With the Ryze Tello, Part 1: Face Detection

CLIP: Learning Transferable Visual Models From Natural Language Supervision

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

ViT architecture