Review: Xu VCIP’17 — CNN Based Rate Distortion Modeling (HEVC Intra Prediction)
U-Net-Like Network Structure, Model the Rate and Distortion Without Pre-Encoding
In this story, CNN-Based Rate-Distortion Modeling for H.265/HEVC (Xu VCIP’17), by Wuhan University, is reviewed. CNN is used to predict the rate and distortion without any encoding for rate control purpose. I read this because I work on video coding research. This is a paper in 2017 VCIP. (Sik-Ho Tsang @ Medium)
- Rate Distortion (RD)
- Network Architecture
- Experimental Results
1. Rate Distortion (RD)
- To compress/encode a video with different bitrates, a quantization parameter (QP) should be tuned.
- Lower QP, higher bitrate (rate), higher video quality, i.e. smaller distortion.
- Higher QP, lower bitrate (rate), lower video quality, i.e. larger distortion.
- If we can control the the rate and distortion, we can have a constant bitrate for stable streaming/transmission or a constant video quality for viewing.
- However, without encoding, we do not know the actual rate and distortion.
- In this paper, CNN is used to predict the rate and distortion without any encoding.
2. Network Architecture
- There are 2 paths for the network.
- Lower (Left) path (Distortion Prediction): is to predict SSIM (Structural SIMilarity) map where higher SSIM, lower distortion, or vice versa.
- Upper (Right) path (Rate Distortion): is to predict the rate.
2.1. Distortion (D) Prediction
- The network takes the original images as the input and output the SSIM maps.
- All the convolutional layers in the network are designed with stride 1×1.
- Two max pooling layers with size 2×2 and stride 2×2 are constructed in two different stages to extract information better.
- Corresponding to this, two upsampling layers are added to make compensate for the size reduction after pooling.
- Skip connections strategy is conducted to aggregate multi-level features, and then a convolutional layer handles all features and determines the size of the output.
- Thus, the network is similar to U-Net using fully convolutional network (FCN).
- The loss function is MSE loss of the SSIM map:
2.2. Rate (R) Prediction
- As the rate information indicating the resource consumption after compression is a scalar, a different network is designed to predict it.
- Several rate values with different QPs are combined into one fixed-sized vector and it will be used as the output of the network, with the original images as input.
- The former layers of this network are the same as the left part for predicting SSIM.
- Several convolutional layers and fully connected layers are added to extract information better.
- The loss function is:
3. Experimental Results
3.1. SSIM Map Prediction
- The first row indicates the original images in luminance channel.
- The second row indicates the actual SSIM maps with QP 35.
- The third row indicates the predicted SSIM maps with QP 35. (All the SSIM maps are squared for visibility.)
- The predicted SSIM maps are quite similar to the actual ones.
3.2. Prediction Error of SSIM and Rate
- The above table shows the prediction error between the predicted ones and the actual ones under different QPs.
- Most prediction results are acceptable.
- SSIM and rate of different images under QP 34 are shown above. Again, the predicted ones are quite similar to the actual ones.
3.3. SOTA Comparison
- R-SSIM relationship is built for the proposed approach and compare with  and .
- Approaches in  and  use the actual data of rate and SSIM to revisit, which means multi-pass encoding is necessary. (They need to encode for at least one time for the SSIM and rate.)
- The proposed CNN approach is much closer to the actual ones.
During the days of coronavirus, I hope to write 30 stories in this month to give myself a small challenge. This is the 17th story in this month. Thanks for visiting my story…