Review — Kim JCGF’18: Defocus and Motion Blur Detection with Deep Contextual Features (Blur Detection)

U-Net-Like Architecture. Outperforms Park CVPR’17 and DBM

5 min readJan 10, 2021

--

In this story, Defocus and Motion Blur Detection with Deep Contextual Features, Kim JCGF’18, by POSTECH, and DGIST, is reviewed. In this paper:

A deep encoder-decoder U-Net-like network with long residual skip-connections and multi-scale reconstruction loss to exploit high-level contextual features as well as low-level structural features.
A synthetic dataset is constructed that consists of complex scenes with both motion and defocus blur.

This is a paper in 2018 JCGF, Computer Graphics Forum. (Sik-Ho Tsang @ Medium)

Training with only CUHK dataset makes a network less effective in distinguishing different types of blur in an image (motion and defocus blur).
A new synthetic dataset is constructed whose images have both types of blur together with pixel-wise annotations.
Synthetic dataset is generating using CUHK dataset and a salient object detection dataset.
The salient object detection dataset consists of 4,000 images and their corresponding binary masks indicating salient objects.
The above algorithm summarizes our process of generating the synthetic dataset.
Object motion blur is usually modeled by locally linear blur kernels as it is usually caused by fast moving objects like cars.

Finally, 8,460 images are obtained that have both defocus and motion blur in our synthetic dataset.

The network takes a 3-channel RGB image as input, and outputs a 3-channel blur map, each channel of which corresponds to motion blur, defocus blur, or no-blur.
The network is built upon an encoder-decoder framework with long skip-connections, similarly to U-Net architecture.
The encoder network consists of 4 max-pooling and 16 convolution layers, similarly to VGG-19. ImageNet pre-trained VGG-19 is used to initialized for the encoder.
The decoder network consists of 4 deconvolution and 15 convolution layers to upsample features and generate an output of the same size as the input.
Batch-normalization is not used for our network as we found that it decreased the accuracy in the experiments.
4 long symmetric skip-connections are added with a 1×1 convolution layer in the middle.
In contrast to U-Net, element-wise summation is adopted, instead of concatenation. With concatenation, the network may merely use the structural features for reconstructing the final blur map. Element-wise summation is used to forcibly combine both contextual features and structural features passed by skip-connections, strengthening the influence of contextual features to better deal with homogeneous regions.

where bs is a blur map with three channels at scale s.
p and c are pixel and channel indices, respectively.
The cross-entropy loss at the finest scale helps detailed reconstruction of a blur map, whereas the one at the coarsest scale helps coarse approximation.

The first model (baseline) is trained from scratch using a single-scale loss instead of the multi-scale loss.
The second one (multi loss) is also trained from scratch using the multi-scale loss.
The third one (multi loss + VGG) is trained from pre-trained VGG-19 using the multi-scale loss.
The final one is the final model, which uses the multi-scale loss, and pre-trained VGG-19 parameters, and is trained using both CUHK and our synthetic datasets.
The results show that each component effectively increases the performance.

As shown above figure, the baseline model has many errors in homogeneous regions such as the sky and the paper.
The result of the second model that uses the multi-scale loss shows much less error than the baseline, although it still has quite amount of error. It encourages the network to more effectively use large contextual information captured by coarse-level features.
Finally, the results of our third (multi loss+VGG) and final models have only a very small amount of error. This shows that well-trained features of pre-trained VGG-19 are very effective to detect and propagate blurriness.

Nonetheless, the proposed method clearly outperforms all the others, such as Park CVPR’17 and DBM (MFL * 18), by a large margin in terms of accuracy, F-measure, and mIoU.

Proposed method only has one fixed PR value. It shows a low variance of the accuracy, implying high stability in handling various input images.

Previous defocus estimation methods fail in homogeneous regions such as human faces because human faces have too weak edges.

The model trained using only CUHK dataset cannot separate motion blur from the defocus-blurred background, as the dataset has no such images with both types of blur together.
In contrast, the model trained using both datasets accurately detects both types of blur.