Review — Khajuria ICIIP’19: Blur Detection in Identity Images (Blur Detection)

Outperforms BRISQUE Using SVM

5 min readJan 2, 2021

In this story, Blur Detection in Identity Images Using Convolutional Neural Network, Khajuria ICIIP’19, by Centre for Development of Advanced Computing (C-DAC), is reviewed. In this paper:

There are 250,000 identity images without ground truth labels. Some are blur. Some are clear.
Apart from naturally blurred images, five types of artificial blur are synthesized into the clear images.
A CNN is proposed to detect whether the image is blur or clear which giving 98.05% on 113,000 images, and thus outperforming BRISQUE using SVM.

This is a paper in 2019 ICIIP. (Sik-Ho Tsang @ Medium)

Outline

Data Preparation
Blur Synthesis
Proposed CNN: Network Architecture
Experimental Results

1. Data Preparation

The data comprising 250,000 images which was unlabeled.
Firstly, a statistical IQA method, BRISQUE is used, which can cursorily rate the images.
BRISQUE can extract 36 features for each image.
Then, these feature vectors are fed to SVR for regressing a score for each image.
Images with lower scores were generally clear. And, images with higher scores were blur.
By observation, 2 threshold values i.e. 40 and 75 are identified.
The images with scores less than 40 were labeled clear.
The images with score greater than 75 were labeled blur.
Only 26% of images fall in mentioned two class intervals i.e. score less than 40 and more than 75. (Rest image samples which includes nearly 74% of data are still ambiguous.)

66163 images which were labeled clear and other 1010 images. Now images have the labels and are used for training the CNN. But images are few, data augmentation is needed before training the CNN since the less amount of data increases the probability of “over-fitting”.

2. Blur Synthesis

The degradation in an image can be mathematically modeled as:

where g(x, y), f(x, y), h(x, y) and n(x, y) represents degraded image, original image, point spread function (PSF) and noise respectively.
The additive noise n(x, y) is ignored in this paper.
5 variants of PSF are applied.
For each PSF, 2 different kernel sizes (K) i.e K1 and K2 are chosen.
Thus, 10 kinds of blur are synthesized.

2.1. Gaussian Blur

with K1=11 and K2=13.

2.2. Out Of Focus Blur

where R is the “radius” of blur and C is the “center” of PSF.
K1=7 and K2 = 9.

2.3. Horizontal Motion Blur

This blur is induced in horizontal motion using the horizontal motion filter, with the size of K1 = 7 and K2 = 9.

2.4. Vertical Motion Blur

This blur is induced in vertical motion using the vertical motion filter, with the size of K1 = 9 and K2 = 11.

2.5. Box Blur

This is the average filter for smoothing the image with K1 = 7 and K2 = 9.
val(x, y) gives the pixel value for kernel.

2.6. Data Augmentation

The clear images (66163) are divided into 10 parts for applying each blur type. (10 types in total)
This results in 6616 images per degradation type.
Now, there are 67170 blur images (66160 artificial blur images and 1010 natural blur images) and 66163 belongs to clear category.
Cumulatively, we have 133,333 images.
To train the CNN, images are resized to a standard size 320×240 using bicubic.

3. Proposed CNN: Network Architecture

The first convolution layer takes 320×240×3 as input image.
There are 32 kernels which has size of 3×3 and strides of 1.
It outputs 318×238×32, then ReLU.
Max pooling which has size 2×2 with 2 pixels strides is used, and outputs 159×119×32.
Dropout with drop rate of 0.25 for the first convolution layer i.e. 25% of these connections are dropped.
The second convolutional layer has 64 kernels which has size of 3×3 and strides of 1. Dropout with drop-rate of 0.5 is applied.
It outputs 157×117×64, then ReLU.
Max pooling which has size 2×2 with 2 pixels strides is used, and outputs 78×58×64.
Then it is fed to a FC layer, consists of 250 units with ReLU and Dropout with drop rate of 0.5.
The output of FC layers are input to a Softmax function.

There are 72,404,144 parameters to train, as shown in the table above, which is economical compared to many SOTA architectures.

4. Experimental Results

The data used for training is stratified sampled from whole i.e. 133,333 images. Around 12% and 3% data are used for training and validation i.e around 15,000 and 5,000 images respectively.
Thus, remaining 113,333 images were used for testing.

Performance Comparison

BRISQUE: This feature vector is of 36 dimension. On these extracted features, Linear Support-Vector Classifier is trained for classification. Here, the classifier reported 94.51% on test samples i.e. 113,333 images.
The proposed CNN model is tested over 113,333 images and confirmed an accuracy of 98.05%.