[Review] BTBCRL (BTBNet + CRLNet): Defocus Blur Detection via Multi-Stream Bottom-Top-Bottom Network (Blur Detection)

Outperforms Park CVPR’17 & BTBNet (CVPR’18)

8 min readDec 20, 2020

--

**New defocus blur detection (DBD) dataset of 1100 Challenging Images**

In this story, Defocus Blur Detection via Multi-Stream Bottom-Top-Bottom Network, BTBCRL (BTBNet + CRLNet), by Dalian University of Technology, and Chinese Academy of Sciences, is presented.

This paper is the extension of BTBNet in 2018 CVPR. Thus, mainly the major differences will be mentioned in the story.
Instead of using FRRNet, a cascaded DBD map residual learning network (CRLNet) is proposed to gradually restore finer structures from the small scale to the large scale, which can accurately distinguish homogeneous regions and suppress the background clutter.
With BTBNet plus the new CRLNet, it forms the network called BTBCRL.

This is a paper in 2020 TPAMI where TPAMI has high impact factor of 17.861. (Sik-Ho Tsang @ Medium)

Outline

BTBCRL (BTBNet + CRLNet): Overall Framework
CRLNet: Network Architecture
Experimental Results

1. BTBCRL (BTBNet + CRLNet): Overall Framework

**BTBNet: Overall End-to-End Fully Convolutional Framework**

1.1. BTBNet (Top)

The input image is resized to different scales then input into the BTBNets as shown above. This part is similar to the BTBNet in 2018 CVPR.

The architecture is a bit different from the BTBNet (2018 CVPR) one. There are conv_1×1 at the Step1 to Step3 as well.
And the loss is the sum of the loss at each level/step.

1.2. CRLNet (Bottom)

Then, the the output of each BTBNet, is fed into the CRLNet. CRLNet gradually restore finer structures from the small scale to the large scale.
More details are described in the below passage.

2. CRLNet: Network Architecture

2.1. CRLNet

In the previous FRRNet that used in BTBNet (2018 CVPR), there are two drawbacks:

Directly upsampling reduces the resolution of DBD maps, which causes the blurred boundaries on the transition from the focused area to the unfocused region.
Simultaneously processing these DBD maps is inevitably disturbed by each background noise.

In the proposed CRLNet, The DBD map and image with scale n (n = 1, 2, …, N) are first concatenated into a single 4-channel feature map. Then, this map is fed to a series of Conv and ReLU layers.
The output DBD map and the residue of the current RLNet are integrated with the tail and middle of the next RLNet by element-wise addition operation, making the next step learn the residual.
Interpolation is used for resolution matching.
For step 1, the RLNet refines the full DBD map.
Then, the other steps learn the residual map with shared parameters.

Cascaded RLNets reconstruct the output to the original resolution from the small scale to the large scale step by step. The nth output refinement map ~Mnfinal can be obtained as follows:

The CRLNet model cannot only produce cleaner background but also better preserve the boundaries of the transition region with pixelwise accuracy.
The pixel-wise loss function between the network output Sd and the ground truth Gd is defined as follows:

where Sdi,j and Gdi,j indicate the dth network output and ground truth of pixel (i, j), respectively.
To boost the performance of the model, the losses are applied at the DBD map of each step in BTBNet and the output of each stream BTBNet and each step residual learning. Thus, the final loss function can be written as follows:

where N is the number of streams for BTBNet and J is the number of steps for residual learning. In this paper, N = J.

2.2. CRLNet Variants: MSCRLNet

In MSCRLNet, the guided source image is removed, and only multi-scale DBD map is transferred as input for residual learning to achieve a multi-scale cascaded residual learning network denoted as MSCRLNet.

Without the source image guide, MSCRLNet produces noisy foreground.

2.3. CRLNet Variants: SCCRLNet

In SCCRLNet, only considering single scale s = 1, interpolation operations are removed to obtain a single-scale cascaded DBD map residual learning network.

Single-scale cascaded DBD map residual learning also results in noise and the loss of foreground information.

CRLNet (at Section 2.1) achieves the best DBD results by jointly considering the source image guide and multi-scale cascaded residual learning.

3. Experimental Results

3.1. Datasets

Shi’s Dataset: The dataset used in Park CVPR’17, which consists of 704 partially defocus blurred images and manually annotated ground truths.

**Sampled images with labeled ground truths in the New defocus blur detection (DBD) dataset**

New defocus blur detection (DBD) dataset: consisting of 1100 images with pixel-wise annotations. 600 challenging images with pixel-level annotations are added comparing with BTBNet CVPR’18. Three volunteers annotate the images and their results are averaged to get the final masks.

**Representative images for generating simulated defocus images.**

**Simulated defocus images with manually defined ground truths**

Simulated image dataset: consisting of 40K images with pixel-wise annotations for pre-training. (Same as BTBNet CVPR’18).

3.2. Qualitative Evaluation

BTBFRR is the BTBNet in 2018 CVPR.
The proposed methods, (g) and (h), perform well in various challenging cases (e.g., homogeneous regions, low-contrast in-focus regions, and cluttered background), yielding DBD maps closest to the ground truth maps.

3.3. Quantitative Evaluation

**Comparison of precision-recall curves**

**Comparison of precision, recall and F-measure**

The proposed methods achieve the top performance over both datasets and all evaluation metrics, outperforms Park CVPR’17 (DHCF) and BTBNet in 2018 CVPR (BTBFRR).

**Quantitative Comparison of F-Measure and MAE Scores**

The BTBCRL model lowers the MAE achieved by the best-performing previous version (BTBFRR, the BTBNet in 2018 CVPR) by 30.5 and 40.6 percent over the Shi’s dataset and our dataset, respectively.
Moreover, BTBCRL improves the F-measure by 2.5 and 8.7 percent on both datasets.
CRLNet gradually locates the boundary of the focused area and unfocused region from the small scale to the large scale, which overcomes FRRNet’s shortcomings of directly upsampling and simultaneously processing DBD maps.

3.4. Ablation Studies Using Different Network Architectures

VGGNet(FC): The top three fully connected layers of VGG16 are removed.
BTBNet(1S): VGGNet(FC) with one-stream BTBNet.
Using MAE, BTBNet(1S) lowers the VGGNet(FC) method by 33.1 and 9.05 percent over the Shi’s dataset and our dataset, respectively.
Moreover, BTBNet(1S) improves the F-measure scores on both datasets.

MSJF: A non deep learning method.
FRRNet: The one in BTBNet (2018 CVPR).
ARLNet: Directly averaging the output of RLNet at each step.
CRLNet (unshare): CRLNet with unshared parameters.
These networks use the s³ = {1, 0.8, 0:6}. The training strategy is fixing BTBNet to train CRLNet.
CRLNet (share) achieves competitive or higher performance than CRLNet (unshare). In particular, CRLNet (share) reduces the MAE by 11.1 and 9.5 percent over Shi’s and our datasets, respectively, because CRLNet (unshare) has more parameters, which results in convergence difficulty.

**Visual comparison of multi-step cascaded DBD refinement**

**Architecture analysis of CRLNet on the Shi’s Dataset**

**Architecture analysis of CRLNet on the New Dataset**

1-step CRLNet: scale s¹ = {0.6}.
2-step CRLNet: scale s² = {0.8, 0.6}.
3-step CRLNet with scale s² = {1, 0.8, 0.6}.
The 3-stage CRLNet exhibits the best performance.

3.5. Ablation Studies Using Different Training Strategies

Four methods to train BTBNet. These methods are:

directly training BTBNet with Shi’s dataset (DT-ShD)
pre-training BTBNet with simulated dataset (PT-SD)
fine-tuning BTBNet with Shi’s dataset (FT-(SD+ShD)), and
fine-tuning BTBNet with Shi’s and our datasets (FT-(SD+ShD+OD)).

The PT-SD scores are inferior to the DT-ShD scores.
However, after fine-tuning, the pre-training mechanism-based FT-(SD+ShD) improves the F-measure achieved by DT-ShD.
Furthermore, adding the per-pixel annotated dataset to fine-tune BTBNet improves the performance of our model. FT-(SD+ShD+OD) improves FT-(SD+ShD).
The PT-SDUS scores are worse than those of PT-SD.
PT-SDUS-based fine-tuning BTBNet with Shi’s and the new datasets (FT-(SDUS+ShD+OD)) achieves 0.827 and 0.768 F-measure values and 0.147 and 0.204 MAE scores on the aforementioned datasets, respectively. In comparison, FT-(SD+ShD+OD) achieves better performance.

**Visual comparison of different methods for training BTBNet. (a)-(f) are source image, PT-SD, DT-ShD, FT-(SD+ShD), FT-(SD+ShD+OD) and ground truth, respectively.**

(b): Pre-training BTBNet with the simulated dataset enables learning general features (e.g., scene textures) to distinguish in-focus and out-of-focus regions.
(d): The pre-training mechanism-based FT-(SD+ShD) produces better results (e.g., sharp DBD boundaries) than those of the direct training method DT-ShD (c).
(e): Finally, the results achieved by adding our per-pixel annotated dataset to finetune the network highlight the focused area and locate the boundary effectively.

Reference

[2020 TPAMI] [BTBCRL (BTBNet + CRLNet)]
Defocus Blur Detection via Multi-Stream Bottom-Top-Bottom Network

Blur Detection / Defocus Map Estimation

[Park CVPR’17 / DHCF / DHDE] [BTBNet] [BTBCRL (BTBNet + CRLNet)]