Review: Xu VCIP’17 — CNN Based Rate Distortion Modeling (HEVC Intra Prediction)

U-Net-Like Network Structure, Model the Rate and Distortion Without Pre-Encoding

4 min readApr 21, 2020

In this story, CNN-Based Rate-Distortion Modeling for H.265/HEVC (Xu VCIP’17), by Wuhan University, is reviewed. CNN is used to predict the rate and distortion without any encoding for rate control purpose. I read this because I work on video coding research. This is a paper in 2017 VCIP. (Sik-Ho Tsang @ Medium)

Outline

Rate Distortion (RD)
Network Architecture
Experimental Results

1. Rate Distortion (RD)

To compress/encode a video with different bitrates, a quantization parameter (QP) should be tuned.
Lower QP, higher bitrate (rate), higher video quality, i.e. smaller distortion.
Higher QP, lower bitrate (rate), lower video quality, i.e. larger distortion.
If we can control the the rate and distortion, we can have a constant bitrate for stable streaming/transmission or a constant video quality for viewing.
However, without encoding, we do not know the actual rate and distortion.
In this paper, CNN is used to predict the rate and distortion without any encoding.

2. Network Architecture

There are 2 paths for the network.
Lower (Left) path (Distortion Prediction): is to predict SSIM (Structural SIMilarity) map where higher SSIM, lower distortion, or vice versa.
Upper (Right) path (Rate Distortion): is to predict the rate.

2.1. Distortion (D) Prediction

The network takes the original images as the input and output the SSIM maps.
All the convolutional layers in the network are designed with stride 1×1.
Two max pooling layers with size 2×2 and stride 2×2 are constructed in two different stages to extract information better.
Corresponding to this, two upsampling layers are added to make compensate for the size reduction after pooling.
Skip connections strategy is conducted to aggregate multi-level features, and then a convolutional layer handles all features and determines the size of the output.
Thus, the network is similar to U-Net using fully convolutional network (FCN).
The loss function is MSE loss of the SSIM map:

2.2. Rate (R) Prediction

As the rate information indicating the resource consumption after compression is a scalar, a different network is designed to predict it.
Several rate values with different QPs are combined into one fixed-sized vector and it will be used as the output of the network, with the original images as input.
The former layers of this network are the same as the left part for predicting SSIM.
Several convolutional layers and fully connected layers are added to extract information better.
The loss function is:

3. Experimental Results

3.1. SSIM Map Prediction

The first row indicates the original images in luminance channel.
The second row indicates the actual SSIM maps with QP 35.
The third row indicates the predicted SSIM maps with QP 35. (All the SSIM maps are squared for visibility.)
The predicted SSIM maps are quite similar to the actual ones.

3.2. Prediction Error of SSIM and Rate

The above table shows the prediction error between the predicted ones and the actual ones under different QPs.
Most prediction results are acceptable.

**SSIM and Rate for Different Images Under QP 34**

SSIM and rate of different images under QP 34 are shown above. Again, the predicted ones are quite similar to the actual ones.

3.3. SOTA Comparison

R-SSIM relationship is built for the proposed approach and compare with [26] and [27].
Approaches in [26] and [27] use the actual data of rate and SSIM to revisit, which means multi-pass encoding is necessary. (They need to encode for at least one time for the SSIM and rate.)
The proposed CNN approach is much closer to the actual ones.

During the days of coronavirus, I hope to write 30 stories in this month to give myself a small challenge. This is the 17th story in this month. Thanks for visiting my story…

Reference

[2017 VCIP] [Xu VCIP’17]
CNN-Based Rate-Distortion Modeling for H.265/HEVC

Codec Prediction

[CNNIF] [Xu VCIP’17] [IPCNN] [IPFCN] [NNIP] [Li TCSVT’18]