[Paper] DeepVQUE: Deep Video QUality Evaluator VQA (Video Quality Assessment)

Competitive Performance With C3D, Outperforms MOVIE, VQM & FLOSIM (BA)

4 min readOct 31, 2020

In this story, “Full-Reference Video Quality Assessment Using Deep 3D Convolutional Neural Networks” (DeepVQUE), by Indian Institute of Technology, is briefly presented. This is a paper introduced by colleague when I study on VQA. In this paper:

Deep 3D ConvNet models are used for feature extraction where 3D ConvNets are capable of extracting spatio-temporal features of the video while most of the existing Full Reference (FR) VQA approaches (at that moment) operate on spatial and temporal domains independently followed by pooling.
Spatial quality is estimated using off-the-shelf full-reference image quality assessment (FRIQA) methods, i.e. MS-SSIM.
Overall video quality is estimated using support vector regression (SVR) applied to the spatio-temporal and spatial quality estimates.

This is a paper in 2019 NCC. (Sik-Ho Tsang @ Medium)

Outline

DeepVQUE: Framework
Spatial Quality Estimation
Spatio-Temporal Quality Estimation
Overall Quality Estimation
Experimental Results

1. DeepVQUE: Framework

It comprises of two sections.
One for spatial quality estimation of video on a frame-by-frame basis using MS-SSIM and the other for spatio-temporal quality estimation using 3D convolutional network by taking video volumes as input.
Then, video quality pooling is performed using SVR to predict the overall final video quality.

2. Spatial Quality Estimation

The spatial quality score of the i-th test video frame relative to the reference be denoted by Si. Then the overall spatial quality of video is defined Qs:

where N is the total number of frames in the video.
MS-SSIM index is used to estimate spatial quality.

3. Spatio-temporal quality estimation

The 3D convolutional autoencoder (CAE) architecture is used where weights are learned by using videos from the KoNViD-1k database in an unsupervised setting.
The features at the bottle neck are used for spatio-temporal quality estimation. The proposed framework is trained and tested with different input video volume sizes.

Let p and d denote the pristine and distorted video respectively that are fed forward through the 3D ConvNet model Z.
The feature vectors at the intermediate layers are denoted by Vp = [v1p, v2p, …, vNp]T and Vd = [v1d, v2d, …, vNd]T of both pristine and distorted videos respectively.
The spatio-temporal quality of the distorted video is estimated using a distance measure d(Vp, Vd).

l1 norm and l2 norm can also be used as the distance measure.

4. Overall Quality Estimation

The overall quality of video is estimated using support vector regression (SVR).
The spatial quality score and spatiotemporal quality score are used to train the SVR against DMOS scores of the VQA datasets.
The standard procedure followed in the literature is to split the dataset into 80% training samples and the remaining 20% for testing.

**Linear correlation coefficient (LCC) for different input sizes**

Three models are trained using different input sizes.
Finally, Model-2 is used in the experiments.

5. Experimental Results

5.1. Localization of spatio-temporal quality of video

**Localization of spatio-temporal quality of videos using Model-2**

The spatio-temporal video volumes, gives the flexibility to localize distortions in space-time, in particular, the (x, y, t) location the where x, y are spatial coordinates and t is the time coordinate.
Distortion intensity at particular pixel location is estimated using the spatiotemporal features difference between the distorted video and corresponding pristine video with 70-pixel overlap in spatial and no overlap in the temporal direction.
We can see the localization of the distortions in space-time of a sample high and low quality video. as shown above.

5.2. VQA Performance

**Details of existing video quality assessment datasets.**

The above datasets are used for evaluation.

**Performance on the LIVE-SD and EPFL PoliMI datasets**

**Performance on the LIVE Mobile dataset**

At that moment, the only freely available pre-trained model is C3D to validate the proposed framework. C3D accepts video volume as input at a resolution of 171×128×16 and has been trained over 1 million YouTube sports videos.
The competitive performance of DeepVQUE justifies the choice of spatio-temporal features for the VQA task.
Also, the existing FRVQA techniques are computationally expensive, but the proposed approach involves simple feedforward operation during the testing phase, so this can be used in real time applications.

Reference

[2019 NCC] [DeepVQUE]
Full-Reference Video Quality Assessment Using Deep 3D Convolutional Neural Networks

Video Quality Assessment

[3D-CNN+LSTM] [DeepVQUE]