Reading: RSR — Residue Super-Resolution Mode for Video Coding (HEVC Inter Prediction)

Down-Sample > Encode > Up-Sample, 4.0% & 2.8% BD-Rate Reduction Using Low Delay P & Low Delay B Configurations Respectively

5 min readMay 25, 2020

In this story, Residue Super-Resolution (RSR) Mode for video coding, by University of Science and Technology of China, is presented. I read this because I work on video coding research. In this paper:

After inter prediction, the residue is down-sampled and it is compressed at low resolution such that fewer bits can be used for compression.
This LR residue is super-resolved using a trained CNN model.

This is a paper in 2018 VCIP. (Sik-Ho Tsang @ Medium)

Outline

Overall Framework
CNN Network Architecture
HEVC Implementation
Experimental Results

1. Overall Framework

**Overall Framework (The original figure is unclear…)**

After inter prediction, with the motion compensated predicted block, we can obtain the residue which is the difference between the predicted block and the current block.
And this residue block can go through either one path:
One is the conventional path without any down-sampling and up-sampling.
Another one is the proposed path, i.e. RSR mode, in which down-sampling is performed first to have a LR residue, then CNN is used to up-sample it, such that fewer bits can be used to encode the LR residue.
For the RSR mode, traditional up-sampling method can be used. However, there are 2 drawback:
The spatial correlation between adjacent pixel positions is relatively weak since it is already the difference between the predicted block and the current block.
And it is difficult for traditional methods to introduce extra information during the up-sampling to improve the reconstruction quality.
Thus, in this paper, CNN is used for up-sampling.

2. CNN Network Architecture

2.1. Simplified Dense Block (SDB)

Similar Dense Block (SDB) structure consisting of 2 convolutional (Conv) layers, 1 rectified linear unit (ReLU) layer and 1 concatenation layer, as shown above.
Specifically, each SDB has two layers of {32×3×3} (32 feature maps, kernel size is 3×3) and {32×1×1} with zero padding.
The second convolutional layer linearly fuses the features among channels, whose output will be concatenated with the input of SDB to get the output of SDB.
Actually, it feels like a residual block in ResNet, as shown above.

2.2. Network Architecture

Two inputs are used including reconstructed low-resolution residue and high-resolution prediction.
The high-resolution prediction is processed by 8 SDBs and 1 convolutional layer to explore the correlation between prediction information.
The output at the middle layer (i.e. after 4 SDBs) is branched out to be used for improving the processing of residue information.
The low-resolution residue is processed by 4 SDBs to extract residue features. Then a deconvolutional layer is used to enlarge the feature maps to the original resolution.
A concatenation layer is used to combine the information from the prediction and the residue, followed by another SDB and a convolutional layer.
Finally, the outputs of the prediction path and the residue path are directly summed up pixel by pixel to give the super-resolved residue signal.

3. HEVC Implementation

3.1. Coding Parameter Setting

Since a lot of high-frequency information is already lost during the down-sampling process, if we compress the down-sampled residue at high quantization parameter (QP), this mode is hardly competitive.
Thus, the QP for low-resolution coding is the QP for full-resolution coding minus 6 (QP-6).

3.2. Mode Decision

Two modes are at CTU level, instead of smaller block level because smaller block level has 2 drawbacks:
One is increasing overheads. Another one is the increasing blocking artifacts.
One extra signalling bit is needed.

4. Experimental Results

4.1. Training Set

87 high definition (HD) videos from the consumer digital video library (CDVL), 11 videos at 4K resolution from SJTU, and other 20 videos from the Internet, are used.
The resolution of these videos includes 1080 (Not 1280? lol)×720, 1920×1080, 2560×1440, up to 4K (4096×2160).
Multi-resolution and multi-scene videos are used to ensure the robustness of the trained model.
The raw videos are encoded by HM-12.1 under low delay P (LDP) configuration using several QP settings {27, 32, 37, 42} in which 800,000 CTU-level training samples are collected for each QP.

4.2. Testing Set

HEVC Testing Sequences are used with QP settings {27, 32, 37, 42}.
The use of higher QPs because the approach is more beneficial to low bit-rate coding.
First 32 frames are encoded.

4.3. BD-Rate

**BD-Rate (%) Under LDP and LDB Configurations**

4.0% and 2.8% average BD-rate reductions are obtained using LDP and LDB configurations respectively.

4.4. Hit Ratio

Hitting ratios are also measured as shown above, where #NRSR is the amount of CTUs using the residue SR mode, #NTotal is the total amount of CTUs, and PHitting represents the ratio of choosing the residue SR mode.
The ratio of choosing the residue SR mode increases with the increase of QP value, which confirms that the residue SR mode is more suitable for lower bit rates.

4.5. RD Curves

**RD Curves for 4 Sequences (The original figure is unclear…)**

Rate-distortion curves of several typical sequences with different sizes are shown in the above figure.
The coding gain is higher at lower bit rates for most of the test sequences.

During the days of coronavirus, A challenge of writing 30 stories again for this month has been accomplished. A new target of 35 stories is set by now. This is the 35th story in this month.. Let me challenge 40 stories!! Thanks for visiting my story..

Reference

[2018] [RSR]
Convolutional Neural Network-Based Residue Super-Resolution for Video Coding

Codec Inter Prediction

HEVC [Zhang VCIP’17] [NNIP] [Ibrahim ISM’18] [VI-CNN] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [ES]
VVC [FRUC+DVRF+VECNN]