[Paper] DeepVQUE: Deep Video QUality Evaluator VQA (Video Quality Assessment)
Competitive Performance With C3D, Outperforms MOVIE, VQM & FLOSIM (BA)
4 min readOct 31, 2020
In this story, “Full-Reference Video Quality Assessment Using Deep 3D Convolutional Neural Networks” (DeepVQUE), by Indian Institute of Technology, is briefly presented. This is a paper introduced by colleague when I study on VQA. In this paper:
- Deep 3D ConvNet models are used for feature extraction where 3D ConvNets are capable of extracting spatio-temporal features of the video while most of the existing Full Reference (FR) VQA approaches (at that moment) operate on spatial and temporal domains independently followed by pooling.
- Spatial quality is estimated using off-the-shelf full-reference image quality assessment (FRIQA) methods, i.e. MS-SSIM.
- Overall video quality is estimated using support vector regression (SVR) applied to the spatio-temporal and spatial quality estimates.
This is a paper in 2019 NCC. (Sik-Ho Tsang @ Medium)
Outline
- DeepVQUE: Framework
- Spatial Quality Estimation
- Spatio-Temporal Quality Estimation
- Overall Quality Estimation
- Experimental Results
1. DeepVQUE: Framework
- It comprises of two sections.
- One for spatial quality estimation of video on a frame-by-frame basis using MS-SSIM and the other for spatio-temporal quality estimation using 3D convolutional network by taking video volumes as input.
- Then, video quality pooling is performed using SVR to predict the overall final video quality.
2. Spatial Quality Estimation
- The spatial quality score of the i-th test video frame relative to the reference be denoted by Si. Then the overall spatial quality of video is defined Qs:
- where N is the total number of frames in the video.
- MS-SSIM index is used to estimate spatial quality.
3. Spatio-temporal quality estimation
- The 3D convolutional autoencoder (CAE) architecture is used where weights are learned by using videos from the KoNViD-1k database in an unsupervised setting.
- The features at the bottle neck are used for spatio-temporal quality estimation. The proposed framework is trained and tested with different input video volume sizes.
- Let p and d denote the pristine and distorted video respectively that are fed forward through the 3D ConvNet model Z.
- The feature vectors at the intermediate layers are denoted by Vp = [v1p, v2p, …, vNp]T and Vd = [v1d, v2d, …, vNd]T of both pristine and distorted videos respectively.
- The spatio-temporal quality of the distorted video is estimated using a distance measure d(Vp, Vd).
- l1 norm and l2 norm can also be used as the distance measure.
4. Overall Quality Estimation
- The overall quality of video is estimated using support vector regression (SVR).
- The spatial quality score and spatiotemporal quality score are used to train the SVR against DMOS scores of the VQA datasets.
- The standard procedure followed in the literature is to split the dataset into 80% training samples and the remaining 20% for testing.
- Three models are trained using different input sizes.
- Finally, Model-2 is used in the experiments.
5. Experimental Results
5.1. Localization of spatio-temporal quality of video
- The spatio-temporal video volumes, gives the flexibility to localize distortions in space-time, in particular, the (x, y, t) location the where x, y are spatial coordinates and t is the time coordinate.
- Distortion intensity at particular pixel location is estimated using the spatiotemporal features difference between the distorted video and corresponding pristine video with 70-pixel overlap in spatial and no overlap in the temporal direction.
- We can see the localization of the distortions in space-time of a sample high and low quality video. as shown above.
5.2. VQA Performance
- The above datasets are used for evaluation.
- At that moment, the only freely available pre-trained model is C3D to validate the proposed framework. C3D accepts video volume as input at a resolution of 171×128×16 and has been trained over 1 million YouTube sports videos.
- The competitive performance of DeepVQUE justifies the choice of spatio-temporal features for the VQA task.
- Also, the existing FRVQA techniques are computationally expensive, but the proposed approach involves simple feedforward operation during the testing phase, so this can be used in real time applications.
Reference
[2019 NCC] [DeepVQUE]
Full-Reference Video Quality Assessment Using Deep 3D Convolutional Neural Networks
Video Quality Assessment
[3D-CNN+LSTM] [DeepVQUE]