Review: Wang VCIP’19 — Multi-QP SSIM Prediction Model (HEVC Prediction)

U-Net-like CNN-Based SSIM Prediction, Outperforms Xu VCIP’17

5 min readApr 28, 2020

In this story, SSIM Prediction for H.265/HEVC based on Convolutional Neural Network (Wang VCIP’19), by Wuhan University, is reviewed. I read this because I work on video coding research. This is a paper in 2019 VCIP. (Sik-Ho Tsang @ Medium)

Outline

Distortion & Structural Similarity (SSIM)
Network Architecture
Experimental Results

1. Distortion & Structural Similarity (SSIM)

**An Example of Structural Similarity (SSIM) Map**

1.1. Distortion

To compress/encode a video with different bitrates, a quantization parameter (QP) should be tuned.
Lower QP, higher bitrate (rate), higher SSIM, higher video quality, i.e. smaller distortion.
Higher QP, lower bitrate (rate), lower SSIM, lower video quality, i.e. larger distortion.

1.2. SSIM

The example as shown above, after compressed by JPEG, the image has distortion around the edges, also, there are contours at the sky.
The absolute error map shows there are large values of error for the sharp edges only.
However, SSIM map can display the low values of SSIM (low means quality is bad) for the contours at the sky, and also it is much clearer for the edges (that not as sharp as the building) for the trees. Because SSIM considers the structural similarity as well.

1.3. Relationship Between SSIM & Video Compression/Encoding

If we can control the the rate and distortion, we can have a constant bitrate for stable streaming/transmission or a constant video quality for viewing.
However, without encoding, we do not know the actual rate and distortion.
In this paper, CNN is used to predict the distortion without any encoding where the distortion measure is indicated by Structural Similarity (SSIM).
With an accurate prediction of SSIM, e.g.: we can adjust the rate to increase or decrease the SSIM for our purpose whether we want to improve the quality by increasing the coding rate.
Or we can reduce the required transmission rate by decreasing the coding rate when we know that the SSIM is already high enough.

2. Network Architecture

2.1. Overall Architecture

**Multi-QP SSIM Prediction Model for H.265/HEVC**

Symmetrical Design: With an exactly symmetrical structure, the network can fuse the features of symmetrical layers easily since they have the same feature size. We don’t have to scale the feature to the target size for feature fusion. It is just like a U-Net architecture.
Layer Details: Generally, small filter size of 3×3 and stride 1×1 is used. For the last convolutional layer (Conv9 in the figure), in order to get a larger receptive field for better SSIM map generation, we extend the filter size to 5×5 with stride of 2×2. And the down-sampling and two up-sampling layers are with kernel size 2×2 and stride of 2×2. Residual block is is used.
Activation Functions: All the convolutional layers are followed by a ReLU activation function except the last one, because the output values of the last convolutional layer are not all positive.
Feature Fusion: Feature fusion is applied both on symmetrical convolutional layers and residual blocks. concatenate operation and skip connection operation are applied.

2.2. Residual Block

It contains two convolutional layers with filter size 3×3, padding size 1×1 and stride 1×1. ReLU activation function is applied in residual block.

2.3. SSIM Map Generation and Normalization

Original images are first encoded through H.265/HEVC to get the reconstructed images, then the actual SSIM maps are calculated between the original images and the reconstructed ones.
The original images and actual SSIM maps are normalized with mean and standard deviation. The normalized original image or SSIM maps are derived as:

2.4. QP Labels

QP label is a single channel feature map with the same size as original image. It is concatenated with original image as the inputs of the CNN model while training. As 10 QPs are used which is ranging from 35 to 44 for experiments, and the values of our QP labels can be calculated with:

Thus, this formulation is used to normalize the QP value into range (-1,1).

2.5. Loss Function

The loss function is standard MSE function:

3. Experimental Results

Microsoft Common Objects in Context (MS COCO) Dataset is used for experiments.
All the training samples are 416×240, the original image size of MS COCO dataset.
40000 images are selected from MS COCO dataset as the training dataset and 4000 images as testing dataset.
HM 16.9 platform is adopted for image coding. Ten QPs ranging from 35 to 44 are used for the coding process.
Thus, the training dataset has 40000 × 10 images and the testing dataset has 40000×10 images after image coding.

The prediction results of our CNN model on the test dataset are visualized as above.
The predicted SSIM maps are almost the same as the actual ones. It’s hard to distinguish them visually.

The above figure shows a few random samples of predicted SSIM values.
We can observe that the predicted SSIM values are very close to the actual SSIM values.

**Prediction Error Using SSIM for Different QPs**

Compared with Bin Xu et al. [9] (i.e. Xu VCIP’17 that I read before) which needs to train 10 CNN models separately for 10 QPs, now the CNN model can simply change the QP label value to obtain different distortion maps for each QP.
And the obtained error by the proposed approach is much smaller than the ones obtained by Xu VCIP’17.

During the days of coronavirus, I hope to write 30 stories in this month to give myself a small challenge. This is the 30th story in this month. Thanks for visiting my story… Mission Accomplished !!!
3 Days left for this month. How about 35 stories within this month…?

Reference

[2019 VCIP] [Wang VCIP’19]
SSIM Prediction for H.265/HEVC based on Convolutional Neural Networks

Codec Prediction

HEVC Intra: [CNNIF] [Xu VCIP’17] [Song VCIP’17] [IPCNN] [IPFCN] [CNNAC] [Li TCSVT’18] [AP-CNN] [MIP] [Wang VCIP’19]
HEVC Inter: [Zhang VCIP’17] [NNIP]