Review — BR²Net: Defocus Blur Detection via a Bidirectional Channel Attention Residual Refining Network (Blur Detection)

Fusing Low and High Level Features Using CAM, Outperforms DHDE and BTBNet

Some challenging cases for defocus blur detection

In this story, BR²Net: Defocus Blur Detection via a Bidirectional Channel Attention Residual Refining Network, BR²Net, is reviewed. In this paper:

  • A residual learning and refining module (RLRM) is designed to correct the prediction errors in the intermediate defocus blur map.
  • A bidirectional residual feature refining network with two branches by embedding multiple RLRMs into it for recurrently combining and refining the residual features.
  • The outputs of the two branches are fused to obtain the final results.

This is a paper in 2020 TMM where TMM has a high impact factor of 6.051. (Sik-Ho Tsang @ Medium)

Outline

  1. BR²Net: Network Architecture
  2. Residual Learning and Refining Module (RLRM)
  3. Channel Attention Module (CAM)
  4. Defocus Map Fusion
  5. Experimental Results
  6. Ablation Study

1. BR²Net: Network Architecture

BR²Net: Network Architecture
  • The low-level features work well in refining the sparse and irregular detection regions.
  • On the other hand, the high-level semantic features work well in locating the blurry regions and suppressing background clutter.
  • ResNet structure is used as the backbone feature extraction network and the pretrained ResNeXt model is deployed for network initialization, which produces five basic feature extraction layers: conv1, conv2_x, conv3_x, conv4_x, and conv5_x.
  • A bidirectional feature refining network to capture the different levels of information from different layers in two different directional pathways: one pathway goes from the shallow layers to the deep layers (denoted by L2H), while the other pathway goes in the opposite direction (denoted by H2L).
  • A residual learning and refining module (RLRM) is designed and multiple RLRMs are embedded into the two directional feature-refining network.
  • Suppose that the feature maps extracted from the pretrained ResNeXt model are denoted by F1, F2, F3, F4, F5 from the shallow layers to the deep layers.
  • For the L2H pathway, let OLH1 represent the output from F1, the output of the t-th recurrent step is obtained:
  • where Cat() denotes the cross-channel concatenation operation, and F() is the mapping function of RLRM module.
  • Similarly, the output of the t-th recurrent step in the H2L pathway:

2. Residual Learning and Refining Module (RLRM)

The detailed structure of the proposed RLRM
  • The refined prediction of an RLRM unit is obtained by adding the learned residual map and the previous result.

Regarding the L2H pathway, the initial output contains rich low-level fine details but lacks high-level semantic information. The feature maps from the higher layers are used to refine the semantic information of the intermediate outputs in a recurrent manner.

Similarly, the initial output of the H2L pathway contains rich high-level semantic information but lacks low-level fine details. The feature maps from the lower layers are used to refine the details of the intermediate outputs.

  • Taking the L2H pathway as an example, let RLHt represent the residual map learned from the t-th recurrent step:
  • where Φ represents a mapping function that consists of a series of convolution and ReLU operations. Then, the output of the current recurrent step can be obtained by adding RLHt and OLHt-1 in an elementwise manner as follows:
  • The output of each recurrent step in the H2L pathway can be obtained in a similar way.
  • In addition, the supervision signal is imposed on each RLRM to improve residual learning at each recurrent step during the training process.
  • There are at least three advantages of proposing and embedding multiple RLRMs into BR²Net.
  1. First, better prediction results are obtained than using a traditional plain network.
  2. Second, the RLRM can easily integrate deep features extracted from different layers to refine the residual learning process step by step.
  3. Third, faster convergence at the early stages can be obtained. Both the time cost and training error can be effectively reduced.

3. Channel Attention Module (CAM)

Channel Attention Module (CAM)
  • A channel attention module (CAM) is used to learn the weights for adaptively rescaling the channel-wise features of each feature extraction layer.
  • The channel-wise global spatial information is first converted into channel descriptors by leveraging global average pooling.
  • To capture the nonlinear interactions and non-mutually exclusive relationship between different channels, a simple sigmoid function-induced gating mechanism is used.
  • where f() is the sigmoid gating function.
  • Then, the final weighted channel-wise feature maps

CAM can effectively learn different weights for different feature channels, which can strengthen the role of some important feature channels as well as weaken the influence of some useless channels.

(If interested, please feel free to read SENet for CAM, there are other variants of CAM such as BAM and CBAM.)

4. Defocus Map Fusion

4.1. Final Output

  • The final defocus blur map is generated by fusing the predictions from the outputs of the two pathways (denoted by OLH and OHL).
  • Specifically, the outputs OLH and OHL are first concatenated, and then a convolution layer with a ReLU activation function is imposed on the concatenated maps to obtain the final output defocus blur map B:

4.2. Training

  • For each intermediate output, the cross-entropy loss is used for training.
  • Specifically, for the L2H pathway at the t-th recurrent step, the pixelwise cross-entropy loss between OLHt and the ground-truth blur mask G:
  • Similarly, for the H2L pathway:
  • Finally, the final loss function as the loss summation of all the intermediate predictions and the final loss:
  • where Lf denotes the loss of the final fusion layer. And all the weights are simply set to 1 without further tuning.
  • The network is initialized by the well-trained ResNeXt network ImageNet, then fine-tuned on part of Shi’s dataset (604 images).

5. Experimental Results

5.1. Quantitative Comparison

The comparison of different methods in terms of the MAE, F-measure and AUC scores
  • BR²Net consistently performs favorably against other methods on the two datasets, such as DHDE and BTBNet.
Comparison of the precision-recall curves, F-measure curves and ROC curves of the different methods on the Shi’s
Comparison of the precision-recall curves, F-measure curves and ROC curves of the different methods on the DUT
Comparison of the precision-recall curves, F-measure curves and ROC curves of the different methods on the CTCUG
  • The PR curves, F-measure curves and ROC curves of different methods on the different datasets are shown above.
  • The results again demonstrate that BR²Net consistently outperforms the other methods.

5.2. Qualitative Comparison

Visual comparison of the detected defocus blur maps generated from the different methods
  • BR²Net can obtain more accurate defocus blur maps when the input image contains in-focus smooth regions and background clutter.
  • In addition, BR²Net can preserve the boundary information of the in-focus objects well.
  • When the background is in focus and the foreground regions are blurred, BR²Net also works well.

5.3. Running Efficiency Comparison

Running Platform and Average Running Time (seconds)
  • The whole training process of BR²Net takes only approximately 0.75 hours.
  • When BR²Net is well trained, it is faster than all of the other methods.
  • For BTBNet, according to their paper, they used 5 days for training, and approximately 25 seconds is required to generate the defocus blur map for an input image with 320×320 pixels.

6. Ablation Study

Ablation Study

6.1. Effectiveness of the RLRM

  • BR²Net_no_RLRM: All of the RLRMs are removed and the intermediate side output directly refined without residual learning.
  • BR²Net with the RLRM performs significantly better than BR²Net_no_RLRM.
  • BR²Net with residual learning is superior to the case without residual learning.
The training loss of BR²Net with and without the RLRM
  • As shown above, residual learning can ease the optimization process, and faster convergence at the early stages can be achieved.

6.2. Effectiveness of the Final Defocus Blur Map Fusion Step

  • The final outputs of the two pathways are represented by OLH and OHL.
  • As shown in the above table, it can be observed that the fusing mechanism effectively improves the final results.
The intermediate outputs of the two feature-refining pathways
  • Some visual results of the outputs generated from the two pathways are shown as above.

6.3. Effectiveness of the Different Backbone Network Architectures

  • VGG16 is used as the backbone.
  • As shown in the table above, BR²Net_VGG16 also achieves an impressive performance.

6.4. Failure Cases

Failure cases generated by using the proposed method. Left: Input, Middle: GT, Right: Predicted Results
  • For the L2H pathway, fine details could be lost during feature refinement, as shown in blue boxes above.
  • For the H2L pathway, some semantic information is also erased, as shown in red boxes above.
  • In the future work, authors mentioned they may add edge and segmentation loss functions for supervising the network training.

--

--