[Review] Park CVPR’17 / DHCF / DHDE: Multi-scale Deep and Hand-crafted Features for Defocus Estimation (Blur Detection)

Deep and Hand-Crafted Features Together for Blur Detection

Sik-Ho Tsang
9 min readDec 10, 2020

In this story, A Unified Approach of Multi-scale Deep and Hand-crafted Features for Defocus Estimation, Park CVPR’17, by KAIST, and Tencent YouTu Lab., is presented. Some papers cited this method as the name of DHCF or DHDE. In this paper:

  • Hand-crafted features and a simple but efficient deep feature from a convolutional neural network (CNN) architecture, are extracted.
  • A neural network classifier followed by a probability-joint bilateral filter are used to generate the final defocus map.

This is a paper in 2017 CVPR with over 40 citations. (Sik-Ho Tsang @ Medium)


  1. Hand-Crafted Features
  2. Deep Feature Using CNN
  3. Defocus Feature
  4. Training
  5. Experimental Results

1. Hand-Crafted Features

  • Three hand-crafted features related to the frequency domain power distribution, the gradient distribution, and the singular values of a grayscale image patch are proposed.
  • A sharp image patch can be regarded as blurry mistakenly depending on the size of the patch, and vice versa. In order to avoid patch scale dependency, multi-scale patches are extracted depending on the strength of the edges.
  • In natural images, strong edges are more likely to be in-focus than blurry ones ordinarily. Therefore, it is assumed that image patches from strong edges are in-focus and that weak edges are blurry during the patch extraction step.

1.1. DCT Feature

Hand-Crafted Features: Average DCT, DFT, DST features from sharp (dotted) and blurry (solid) patches.
  • A grayscale image patch PI to the frequency domain to analyze its power distribution.
  • Discrete cosine transform (DCT) offers strong energy compaction.
  • When an image is more detailed, more non-zero DCT coefficients are needed to preserve the information.
  • Because an in-focus image has more high-frequency components than an out-of-focus image, the ratio of high-frequency components in an image patch can be a good measure of the blurriness of an image.
  • DCT feature fD is constructed using the power distribution ratio of frequency components as follows:
  • where |·|, P(ρ, θ), ρk, Sk, WD and nD denote the absolute operator, the discrete cosine transformed image patch with polar coordinates, the k-th boundary of the radial coordinate, the area enclosed by ρk and ρk+1, a normalization factor to make sum of the feature unity, and the dimensions of the feature, respectively.

The above figure: The absolute difference between sharp and blurry features can be a measure of discriminative power. The DCT feature has the best discriminative power because its absolute difference between sharp and blurry features is greater than those of the other transformations.

1.2. Gradient Feature

Hand-Crafted Features: Average gradient feature from sharp (dotted) and blurry (solid) patches.
  • The gradients of PI is calculated using Sobel filtering to obtain a gradient patch PG. Typically, there are more strong gradients in a sharp image patch than in a blurry image.
  • Therefore, the ratio of the strong gradient components in an image patch can be another measure of the sharpness of the image.
  • The normalized histogram of PG is used as a second component of our defocus feature. The gradient feature fG is define as follows:
  • where HG, WG and nG denote the histogram of PG, the normalization factor and the dimensions of the feature, respectively.

The above figure: shows a comparison of sharp and blurry gradient features. Despite its simplicity, the gradient feature shows quite effective discriminative power.

1.3. SVD Feature

Hand-Crafted Features: Average SVD feature from sharp (dotted) and blurry (solid) patches.
  • Singular value decomposition (SVD) has many useful applications in signal processing. One such application of SVD is the low-rank matrix approximation of an image.
  • The factorization of an m×n real matrix A can be written as follows:
  • where Λ, N, λk, uk and vk denote the m×n diagonal matrix, the number of non-zero singular values of A, the k-th singular value, and the k-th column of the real unitary matrices U and V, respectively.
  • If we construct a matrix:
  • we can approximate the given matrix A with ˜A.
  • In the case of image reconstruction, low-rank matrix approximation will discard small details in the image, and the amount of the loss of details is inversely proportional to n.
Low-rank matrix approximation of an image
  • The above figure shows an example of the low-rank matrix approximation of an image.
  • A SVD feature is extracted based on low-rank matrix approximation. Because more non-zero singular values are needed to preserve the details in an image, a sharp image patch tends to have more non-zero singular values than a blurry image patch; i.e., a non-zero λk with large k is a clue to measure the amount of detail. The scaled singular values define the last hand-crafted feature as follows:

The above curve figure shows a comparison of sharp and blurry SVD features. The long tail of the sharp feature implies that more details are preserved in an image patch.

2. Deep Feature Using CNN

CNN for Deep Feature
  • The deep feature fC is extracted from a color image patch using a CNN.
  • The feature extraction network consists of convolutional, ReLU and max pooling layers. The stride is set to 1 for convolution and to 3 for max pooling.
  • It compensates for the lack of color and cross-channel information in the hand-crafted features.
The average activations with sharp, intermediate and blurry patches.
  • The above figure shows the average outputs from our feature extraction network with sharp, intermediate and blurry image patches. The activations are proportional to the sharpness of the input image.

3. Defocus Feature

3.1. Concatenations of All Features

  • All of the extracted features are concatenated to construct our final defocus feature fB as follows:
  • where [·] denotes the concatenation.
Classification accuracies. Note that the accuracy of a random guess is 9.09%.
  • The above table shows the classification accuracy of each feature.
  • The classification tolerance is set to an absolute difference of 0.15 compared to the standard deviation value σ of the ground truth blur kernel.
  • Neural networks are trained with the same architecture using those features individually and test on 576,000 features of 11 classes. (The details of the neural network will be presented below.)

The deep feature, fC, has the most discriminative power.

When all hand-crafted and deep features are concatenated (fB), the performance is even more enhanced.

Removing one of the hand-crafted features drops the performance by approximately 1–3%.

  • For example, the classification accuracies of [fD, fS, fC] and [fG, fS, fC] are 93.25% and 91.10%, respectively.
  • The performance of fB with only single-scale patch extraction also decreases to 91.00%.

3.2. Neural Network Classifier

  • The classifier network consists of three fully connected layers (300–150- 11 neurons each) with ReLU and Dropout layers. The softmax classifier is used for the last layer.
  • Using this classification network, the labels and probabilities of features are obtained, after which the labels are converted to the corresponding σ values of the Gaussian kernel, which describe the amount of defocus.
  • Subsequently, the sparse defocus map IS is constructed using the σ values and the confidence map IC using the probabilities.

3.3. Defocus Map

  • The final defocus map is post-processed by probability-joint bilateral filter and also matting Laplacian algorithm. Please feel free to read the paper for more details.

4. Training

  • For the classification network training, 300 sharp images are randomly selected from the ILSVRC training data.
  • Approximately 1M multi-scale image patches are extracted on strong edges and regard these patches as sharp patches. After that, each sharp patch PSI is convolved with synthetic blur kernels to generate blurry patches PBI as follows:
  • where h, ∗ and L denote the Gaussian blur kernel with a zero mean and variance σ², the convolution operator and the number of labels, respectively.
  • L is set to 11. σmin = 0.5 and σinter = 0.15.
  • For the training of the deep feature, fC, the feature extraction network and the classifier network are directly connected to train the deep feature and classifier simultaneously.
  • The same method is applied when using the concatenated feature, fH, for training.
  • For the training of fB, the classifier is initially trained that is connected to the feature extraction network only (i.e., with fC only), after which, the classifier with the hand-crafted features, fH, is fine-tuned.
Feature scale encoding scheme
  • The classifier are trained with features from small and large patches together.
  • If a feature is from large-scale, we fill the small-scale positions with zeros, and vice versa.
  • 15×15 patches on strong edges and 27×27 patches on weak edges are extracted.
  • nD = nG = nS = 25 for large patches, nD = nG = nS =13 for small patches. σs = σr = 100.0, σc = 1.0, ǫ = 1e−5 and γ = 0.005 for all experiments.
  • A simple thresholding method is applied to the full defocus map. The threshold value τ is determined as follows:
  • where α = 0.3 for the experiments empirically.

5. Experimental Results

Segmentation accuracies (top) and Precision-Recall comparison (bottom)
  • The segmentation accuracies are obtained from the ratio of the number of pixels correctly classified to the total number of pixels.
  • Precision-Recall curves can be calculated by adjusting τ from σ1 to σL.
(a) Input images. (b) Results of [30]. (c) Results of [31]. (d) Results of [32] (Inverted for visualization). (e) Results of [42]. (f) Proposed defocus maps and (g) corresponding binary masks. (h) Ground truth binary masks.
  • (c) & (d): The results of [31] and [32], therefore, show some erroneous classification results in homogeneous regions.
  • (e) & (f): The proposed algorithm and [42] can avoid this problem.
  • (f) & (g): An edge-preserving smoothed color image is adopted as a propagation prior in our algorithm, and it gives better results.
Defocus maps from each feature
  • (b), (c) & (d): Single hand-crafted features give unsatisfactory results.
  • (e): Surprisingly, a deep feature alone works quite well but gives a slightly moderate result compared to the concatenated features. (Solid blue)
  • (f): The hand-crafted feature alone also works nicely but there are several misclassifications. (Dashed red)
  • (g): Certain misclassifications due to the hand-crafted feature are well handled by the deep feature, and the discriminative power of the deep feature was strengthened with the aid of the hand-crafted features.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.