[Paper] DeepVQUE: Deep Video QUality Evaluator VQA (Video Quality Assessment)

Competitive Performance With C3D, Outperforms MOVIE, VQM & FLOSIM (BA)

In this story, “Full-Reference Video Quality Assessment Using Deep 3D Convolutional Neural Networks” (DeepVQUE), by Indian Institute of Technology, is briefly presented. This is a paper introduced by colleague when I study on VQA. In this paper:

  • Deep 3D ConvNet models are used for feature extraction where 3D ConvNets are capable of extracting spatio-temporal features of the video while most of the existing Full Reference (FR) VQA approaches (at that moment) operate on spatial and temporal domains independently followed by pooling.
  • Spatial quality is estimated using off-the-shelf full-reference image quality assessment (FRIQA) methods, i.e. MS-SSIM.
  • Overall video quality is estimated using support vector regression (SVR) applied to the spatio-temporal and spatial quality estimates.

This is a paper in 2019 NCC. (Sik-Ho Tsang @ Medium)


  1. DeepVQUE: Framework
  2. Spatial Quality Estimation
  3. Spatio-Temporal Quality Estimation
  4. Overall Quality Estimation
  5. Experimental Results

1. DeepVQUE: Framework

DeepVQUE framework for FRVQA
  • It comprises of two sections.
  • One for spatial quality estimation of video on a frame-by-frame basis using MS-SSIM and the other for spatio-temporal quality estimation using 3D convolutional network by taking video volumes as input.
  • Then, video quality pooling is performed using SVR to predict the overall final video quality.

2. Spatial Quality Estimation

  • The spatial quality score of the i-th test video frame relative to the reference be denoted by Si. Then the overall spatial quality of video is defined Qs:
  • where N is the total number of frames in the video.
  • MS-SSIM index is used to estimate spatial quality.

3. Spatio-temporal quality estimation

3D CAE model summary
  • The 3D convolutional autoencoder (CAE) architecture is used where weights are learned by using videos from the KoNViD-1k database in an unsupervised setting.
  • The features at the bottle neck are used for spatio-temporal quality estimation. The proposed framework is trained and tested with different input video volume sizes.
Spatio-temporal VQA algorithm
  • Let p and d denote the pristine and distorted video respectively that are fed forward through the 3D ConvNet model Z.
  • The feature vectors at the intermediate layers are denoted by Vp = [v1p, v2p, …, vNp]T and Vd = [v1d, v2d, …, vNd]T of both pristine and distorted videos respectively.
  • The spatio-temporal quality of the distorted video is estimated using a distance measure d(Vp, Vd).
  • l1 norm and l2 norm can also be used as the distance measure.

4. Overall Quality Estimation

  • The overall quality of video is estimated using support vector regression (SVR).
  • The spatial quality score and spatiotemporal quality score are used to train the SVR against DMOS scores of the VQA datasets.
  • The standard procedure followed in the literature is to split the dataset into 80% training samples and the remaining 20% for testing.
Linear correlation coefficient (LCC) for different input sizes
  • Three models are trained using different input sizes.
  • Finally, Model-2 is used in the experiments.

5. Experimental Results

5.1. Localization of spatio-temporal quality of video

Localization of spatio-temporal quality of videos using Model-2
  • The spatio-temporal video volumes, gives the flexibility to localize distortions in space-time, in particular, the (x, y, t) location the where x, y are spatial coordinates and t is the time coordinate.
  • Distortion intensity at particular pixel location is estimated using the spatiotemporal features difference between the distorted video and corresponding pristine video with 70-pixel overlap in spatial and no overlap in the temporal direction.
  • We can see the localization of the distortions in space-time of a sample high and low quality video. as shown above.

5.2. VQA Performance

Details of existing video quality assessment datasets.
  • The above datasets are used for evaluation.
Performance on the LIVE-SD and EPFL PoliMI datasets
Performance on the LIVE Mobile dataset
  • At that moment, the only freely available pre-trained model is C3D to validate the proposed framework. C3D accepts video volume as input at a resolution of 171×128×16 and has been trained over 1 million YouTube sports videos.
  • The competitive performance of DeepVQUE justifies the choice of spatio-temporal features for the VQA task.
  • Also, the existing FRVQA techniques are computationally expensive, but the proposed approach involves simple feedforward operation during the testing phase, so this can be used in real time applications.




PhD, Researcher. I share what I've learnt and done. :) My LinkedIn: https://www.linkedin.com/in/sh-tsang/, My Paper Reading List: https://bit.ly/33TDhxG

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Is BERT really robust? — A paper summary.

Review — CB Loss: Class-Balanced Loss Based on Effective Number of Samples (Image Classification)

ML horoscope generation pipeline as a REST API using GPT, Transformers, Fast API and GCP (Part 1)

Classifying images of different places

Milvus in IP Protection: Building a Trademark Similarity Search System with Milvus

Reading: GVCNN — One-For-All Group Variation Convolutional Neural Network (HEVC Inter)

Pirating AI

Applying deep tech to enhance lifestyle retail

Get the Medium app