Reading: CNN-SR & CNN-UniSR & CNN-BiSR — Block Upsampling (HEVC Inter Prediction)

Modified EDSR, 3.8%, 2.6%, 3.5% BD-Rate Reduction Under RA, LDB & LDP Configurations

5 min readMay 30, 2020

In this story, “Convolutional Neural Network-Based Block Up-Sampling for HEVC” (CNN-SR & CNN-UniSR & CNN-BiSR), is presented. I read this because I work on video coding research. This paper extends the idea of another TCSVT paper, Li TCSVT’18, in 2018 TCSVT. In this paper:

The coding block is downsampled before encoding to save more bits so that coding efficiency can be improved.
The overall framework is similar to Li TCSVT’18 except that there is no second-stage upsampling in this paper. (Thus, I will not cover the framework here.)
Also, depending on the frame type, different networks with different inputs are used. And they are called CNN-SR, CNN-UniSR and CNN-BiSR.

This is a paper in 2019 TCSVT where TCSVT has high impact factor of 4.046. (Sik-Ho Tsang @ Medium)

Outline

Single-Frame Up-Sampling CNN (CNN-SR)
Multi-Frame Up-Sampling CNN (CNN-UniSR & CNN-BiSR)
Experimental Results

1. Single-Frame Up-Sampling CNN (CNN-SR)

1.1. Network Architecture

EDSR is revised to become CNN-SR. The entire CNN-SR can be divided into four functional units.
The first unit, from Conv1 to Conv2 and Sum, performs feature extraction and enhancement at low resolution.
There are 6 residue-learning blocks (ResBlocks), each of which consists of two convolutional layers separated by a ReLU function and a Sum layer.
The second unit, i.e. the Deconv layer, performs resolution change.
The third unit, from Conv3 to Conv4 and Sum, performs feature enhancement at high resolution, including 8 ResBlocks herein.
The fourth unit, Conv5, performs the reconstruction from feature maps to pixels.

1.2. Differences from EDSR

First, a deconvolution layer is adopted to fulfill resolution change, slightly better than the convolution-shuffle layer in EDSR.
Second, the resolution change unit is moved to the middle of the entire network.
Third, the number of convolutional filters, in each convolutional layer, is decreased from 64 to 32.

2. Multi-Frame Up-Sampling CNN (CNN-UniSR & CNN-BiSR)

2.1. Multi-Frame Up-Sampling CNN (CNN-UniSR)

CNN-UniSR is similar to CNN-SR but with more inputs.
Specifically, in addition to the low-resolution reconstructed CTU, its collocated CTU in the reference frame, and the down-sampled version of the collocated CTU, are also input to CNN-UniSR.
A convolutional layer is to extract features from the down-sampled version of the collocated CTU.
A convolutional layer and 3 ResBlocks are to extract and enhance features of the collocated CTU, as shown above.
Thus, the problem is different from video SR whose input is merely low-resolution frames.
Feature combination at both low and high resolution is helpful due to the multi-scale exploitation.
There is no motion compensation for the colocated CTU.

2.2. Multi-Frame Up-Sampling CNN (CNN-BiSR)

There are two reference lists for bi-directional prediction, list0 & list1.
CNN-BiSR is similar to CNN-UniSR but with more inputs., i.e. the collocated CTU in the reference frame of list1, and the down-sampled version of this collocated CTU

2.3. Others

For chroma, the upsampling CNN in Li TCSVT’18 is used.
There are also some enhancements in the codec, which related to merge candidates and motion vector scaling. (But I will not talk about these here since they are non-CNN stuffs.)
QP is adjust to QP-6 to prefer higher quality and higher bitrate.
Training set: 84 sequences from CDVI database, 10 ones from SJTU, which have 1,500,000 samples.
HM-12.1 is used.

3. Experimental Results

3.1. BD-Rate

**BD-Rate (%) Compared to HEVC (S stands for Y-SSIM)**

For class A-E, 2.6% to 3.8% BD-rate (Y) reduction is obtained for RA, LDB and LDP configurations, respectively.
For SDR, even higher of 5.1% to 6.8% BD-rate (Y) reduction is obtained for RA, LDB and LDP configurations, respectively.

3.2. Hitting Ratios

Hitting ratio has the same conclusion that more CTUs use the proposed method than those in Class A-D.

3.3. Ablation Study

When only CNN-SR is used, 6.3% and 4.6% BD-rate reduction is obtained for RA and LDB configurations, respectively.
When only CNN-BiSR/CNN-UniSR is used, Only little further BD-rate reduction is obtained for RA and LDB configurations.

3.3. Time Analysis

Since the CNN is not optimized for computational efficiency, both the encoding and decoding time increases largely.

3.4. Model Size & Memory

When stored on disk, the CNN-SR, CNN-UniSR, CNN-BiSR, and chroma up-sampling CNN models occupy 1.35M, 1.68M, 2.01M, and 0.49M bytes, respectively.
When running the CNN-based up-sampling, without any optimization, the extra bytes of memory are 212.5M, 257.1M, 293.3M, and 16M bytes for CNN-SR, CNNUniSR, CNN-BiSR, and chroma up-sampling CNN, respectively.

There are still a lot of details and results not yet shown here. Please feel free to read the paper if interested.

During the days of coronavirus, A challenge of writing 30/35/40 stories again for this month has been accomplished. Let me challenge 45 stories!! This is the 44th story in this month.. Thanks for visiting my story..

Reference

[2019 TCSVT] [CNN-SR & CNN-UniSR & CNN-BiSR]
Convolutional Neural Network-Based Block Up-Sampling for HEVC

Codec Inter Prediction

HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [Ibrahim ISM’18] [VI-CNN] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [ES] [CNN-SR & CNN-UniSR & CNN-BiSR]
VVC [FRUC+DVRF+VECNN]