Review — DMENet: Deep Defocus Map Estimation using Domain Adaptation (Blur Detection)

Using Domain Adaptation, Dataset with Synthetic Blurring is used, Outperforms Park CVPR’17

10 min readJan 9, 2021

In this story, Deep Defocus Map Estimation using Domain Adaptation, DMENet, by POSTECH, Sungkyunkwan University, and DGIST, are reviewed. In this paper:

A novel depth-of-field (DOF) dataset, SYNDOF, is produced where each image is synthetically blurred with a ground-truth depth map.
As the feature characteristics of images in SYNDOF can differ from those of real defocused photos, domain adaptation is used to transfer the features of real defocused photos into those of synthetically blurred ones.
DMENet consists of four subnetworks: blur estimation, domain adaptation, content preservation, and sharpness calibration networks. The subnetworks are connected to each other and jointly trained.

This is a paper in 2019 CVPR with so far 8 citations. (Sik-Ho Tsang @ Medium)

Outline

The SYNDOF Dataset
DMENet: Network Architecture (B+D+C+S)
Blur Estimation (B)
Domain Adaptation (D)
Content Preservation (C)
Sharpness Calibration (S)
Experimental Results

1. The SYNDOF Dataset

1.1. Data Collection

**Collection summary of our SYNDOF dataset.**

Images are from MPI Sintel Flow(MPI) [35], SYNTHIA [27], and Middlebury Stereo 2014 (Middlebury) [28] datasets.
MPI dataset is a collection of game scene renderings, SYNTHIA dataset contains synthetic road views, and Middlebury dataset consists of real indoor scene images with accurate depth measurements.
By keeping only dissimilar images, 2,006 distinct sample images are remained.
An image from the sample set is randomly selected to generate a defocused image with random sampling of camera parameters and the focal distance.
The total number of defocused images we generated is 8,231.

1.2. Thin Lens Model

The thin-lens model [25], which is a standard for defocus blur in computer graphics.
Let the focal length be F (mm), the object-space focal distance S1 (mm), and the f-number N.
The image-space focal distance is f1=F×S1/(S1-F).
The aperture diameter is D=F/N.
A defocus map contains the amount of defocus blur or the size of circle of confusion (COC) per pixel for a defocus-blurred (in short, defocused) image.
A pixel at object boundary with depth discontinuity contains a mixture of different COCs in a defocused image.
Then, the image-space COC diameter c(x) of a 3D point located at the object distance x is defined as:

1.3. Defocused Image Generation

The minimum and maximum depth bounds, xnear and xfar, are extracted from the depth map, respectively. Then, S1 is randomly sampled from the range of [xnear, xfar].
When computing c(x) using the above equation, α is only needed that can abstract physical parameters. In practice, x is not near zero.
The COC size is limited up to cmax. Thereby, the upper bound of α, denoted by αup, is

α is randomly sampled within [0, αup]. Gaussian blur is applied to the image with kernel standard deviation σ, where σ(x) = c(x)/4 empirically.
To blur an image based on the computed COC sizes, the image is firstly decomposed into discrete layers where the maximum number of layers is limited to 350.
Gaussian blur is applyed to each layer with σ(x), blurring both image and mask of layers.
Then, we can alpha-blend blurred layer images in the back-to-front order using the blurred masks as alpha values.
Labels are then generated.
This SYNDOF dataset enables a network to accurately estimate a defocus map.
The defocus map is densely (per-pixel and not binary) labeled.
The dense labels respect the scene structure, including object boundaries and depth discontinuities, and resolve ambiguities in homogeneous regions.

2. DMENet: Network Architecture (B+D+C+S)

2.1. Overview

The network consists of four subnetworks: blur estimation (B), domain adaptation (D), content preservation (C), and sharpness calibration networks (S).
The blur estimation network B is the main component of DMENet and supervised with ground-truth synthetic defocus maps from the SYNDOF dataset to predict blur amounts.
To enable network B to measure the blur amounts on real defocused images, the domain adaptation network D is attached to it, which minimizes domain differences between synthetic and real features.
The content preservation network C supplements network B to avoid a blurry output.
The sharpness calibration network S allows real domain features to induce correct sharpness in a defocus map by informing network B whether the given real domain feature corresponds to a sharp or blurred pixel.

2.2. Training

Networks B, D, and S, parametrized by θB, θD, and θS, are jointly trained.
θB and θS are trained with a loss Lg, and θD are trained with a loss Ld, they are trained alternatively.
And we got three kinds of images for training.
Is, synthetic defocused images with ground truth defocus maps.
Ir, real defocused images with no labels.
Ib, real defocused images with ground truth binary blur maps.
Following the common practice of adversarial training, the loss Lg is:

where LB, LC, LS, and Ladv are blur map loss, content preservation loss, sharpness calibration loss, and adversarial loss, respectively.
The loss Ld is:

where LD is discriminator loss.
During training, the networks D, C, and S differently affect B depending on the domain of the input.
In the case of synthetically blurred images with GT defocus maps, the difference between y and the predicted defocus map B(IS) is minimized using the blur map loss LB that measures the mean squared error (MSE).
The content preservation loss LC is also minimized to reduce blurriness in the prediction B(IS) using the network C.
Real defocused images with binary blur maps, are used to calibrate sharpness measurement from domain transferred features, guide the network S by minimizing the sharpness calibration loss Ls.
To minimize domain difference between features extracted from synthetic and real defocused images, the discriminator loss LD and the adversarial loss Ladv are minimized in an adversarial way, in which network D is trained to correctly classify the domains of features from different inputs, while training network B to confuse D.

3. Blur Estimation (B)

The blur estimation network B is the core module in our DMENet.
A FCN based on U-Net is used. Pretrained VGG19 is used for encoder.
Scale-wise auxiliary loss at each up-sampling layer is used to guide multi-scale prediction of a defocus map.
This structure induces our network not only to be robust on various object scales, but also to consider global and local contexts with large receptive fields.
After the last up-sampling layer of the decoder, convolution blocks with short skip connections are attached to refine domain adapted features.
MSE is used for LB. Given a synthetic defocused image IS:

where the scale-wise auxiliary loss is:

where each auxiliary network A consists of two convolutional layers to produce the defocus map at each up-sampling level of network B.

4. Domain Adaptation (D)

4.1. Discriminator Loss

The domain adaptation network D compares the features of real and synthetic defocused images captured by the blur estimation network B.
In principle, D is a discriminator in the GAN. It makes the characteristics of the captured features of real and synthetic defocused images indistinguishable.
D is a CNN with four convolution layers, each of it consists of Conv, BN and Leaky ReLU.
D is a discriminator to classify features from synthetic and real domains and trained with the discriminator loss LD:

where z is a label whether it is real or synthetic defocus image.
z = 0 if the feature is real and z = 1 otherwise.

4.2. Adversarial Loss

B is trained so that it treats real and synthetic defocused images as they are from the same domain.

The domain adaptation network D becomes stronger as the domain classifier, network B has to generate more indistinguishable features for real and synthetic domains, minimizing the domain difference between features of synthetic and real defocused images.

5. Content Preservation (C)

**The Content Preservation Network (C)**

The blur estimation loss LB is a MSE loss and has a nature of producing blurry outputs.
A content preservation loss that measures the distance in a feature space φ.
The content preservation network C is the pre-trained VGG19. During training, network B is optimized to minimize:

where φl(·) at the last convolution layer in the l-th max pooling block of VGG19.

6. Sharpness Calibration (S)

D concentrates on modulating the overall distributions, and it does not specifically align the amounts of blurs corresponding to the features between the two domains.

In other words, the blur amounts learned by our blur estimation network B for synthetic defocused images cannot be readily applied to real defocused images, and we need to calibrate the estimated blur amounts for the two domains.

For a given real defocused image from the dataset, network S is trained to classify the output of network B in terms of the correctness of estimated blurriness.
The prediction is considered to be correct, only when a pixel estimated as sharp belongs to a sharp region in the input image.
The network S is with 1×1 convolutional layers, BN and Leaky ReLU.
Sigmoid cross entropy loss is used:

1×1 conv is used in order to not affect the receptive field of network B. A larger kernel for S eventually leads B to generate a smudged defocus map.

7. Experimental Results

λadv = 1e−3, λD = 1.0, λC = 1e−4, λS = 2e−2, and λaux = 1.0. l=4 for φ.
For synthetic defocused images IS, SYNDOF dataset is used. cmax=28.
For real defocused images IR for domain adaptation, 2,200 real defocused images are used, which are collected from Flickr and 504 images from CUHK blur detection dataset.
For sharpness calibration, the same 504 images from CUHK dataset for real defocused images IB, are used, which require binary blur maps.
For evaluation, we used 200 images of CUHK dataset and 22 images of RTF dataset.

7.1. Evaluation on Subnetworks

**Outputs generated with incremental additions of subnetworks in our network**

(b): For a real defocused image, the sole use of subnetwork B for blur estimation fails, confirming there is significant domain difference between features of synthetic and real defocused images.
(c): With our domain adaptation, DMENetBD starts to recognize the degree of blur for a real image to some extents, yet with blurry output.
(d): Adding content preservation subnetwork (DMENetBDC) effectively removes blur artifacts.
(e): DMENetBDCS without the auxiliary module generates a less clear and inaccurate defocus map.
(f): Finally, with the sharpness calibration subnetwork S, DMENetBDCS correctly classifies real-domain features corresponding to blurry or sharp regions.

7.2. Evaluation on CUHK and RTF Datasets

DMENet significantly outperforms the previous methods in accuracy.

**Precision-Recall comparison on CUHK dataset**

Precision-recall curves also show superiority of DMENet in detecting blurred regions, with different levels of τ, the threshold for binarization.

**(a) Input and the defocus maps estimated by (b) Zhou et al. [40], (c) Shi et al. [30], (d) Park et al. [24], (e) Karaali et al. [13], (f) ours, and (g) ground-truth**

Defocus maps by DMENet show more continuous spectrum compared to others.
In the first row, DMENet result exhibits less noise and smoother transitions with depth changes, and estimates more accurate blur for objects (e.g., human, sky).
In the second row, DMENet shows coherently labeled blur amounts while clearly respecting object boundaries.
In the third row, DMENet estimates consistent blur amounts both for the box surface and the symbol.
Lastly, DMENet is more robust in homogeneous regions. In the second and fourth rows, DMENet results show little smudginess around some objects, but they are still accurate in terms of relative depths.

**Left: Input, Middle: defocus map estimated by [38] and Right: DMENet**

The implementation of [38] has not been publicized yet. But the above figure shows that DMENet can handle wider depth range of a scene.

DMENet shows the state-of-the-art accuracy on the dataset.

7.3. Applications

**Depth from our defocus map estimated by DMENet**

There are potential applications for blur detection, as shown above.

Though domain adaptation is used, some blur detection results are still inaccurate, which as shown in another later blur detection paper in which the model is named as DeFusionNet. Hope I can write the story about it later as well.

Reference

[2019 CVPR] [DMENet]
Deep Defocus Map Estimation using Domain Adaptation

Blur Detection / Defocus Map Estimation

2017 [Park CVPR’17 / DHCF / DHDE] 2018 [Purohit ICIP’18] [BDNet] [DBM] [BTBNet] 2019 [Khajuria ICIIP’19] [Zeng TIP’19] [PM-Net] [CENet] [DMENet] 2020 [BTBCRL (BTBNet + CRLNet)]

Generative Adversarial Network (GAN)

Image Synthesis [GAN] [CGAN] [LAPGAN] [DCGAN] [Pix2Pix]
Super Resolution [SRGAN & SRResNet] [EnhanceNet] [ESRGAN]
Blur Detection [DMENet]
Video Coding [VC-LAPGAN]