Reading: CNN-SR & CNN-UniSR & CNN-BiSR — Block Upsampling (HEVC Inter Prediction)
Modified EDSR, 3.8%, 2.6%, 3.5% BD-Rate Reduction Under RA, LDB & LDP Configurations
In this story, “Convolutional Neural Network-Based Block Up-Sampling for HEVC” (CNN-SR & CNN-UniSR & CNN-BiSR), is presented. I read this because I work on video coding research. This paper extends the idea of another TCSVT paper, Li TCSVT’18, in 2018 TCSVT. In this paper:
- The coding block is downsampled before encoding to save more bits so that coding efficiency can be improved.
- The overall framework is similar to Li TCSVT’18 except that there is no second-stage upsampling in this paper. (Thus, I will not cover the framework here.)
- Also, depending on the frame type, different networks with different inputs are used. And they are called CNN-SR, CNN-UniSR and CNN-BiSR.
This is a paper in 2019 TCSVT where TCSVT has high impact factor of 4.046. (Sik-Ho Tsang @ Medium)
- Single-Frame Up-Sampling CNN (CNN-SR)
- Multi-Frame Up-Sampling CNN (CNN-UniSR & CNN-BiSR)
- Experimental Results
1. Single-Frame Up-Sampling CNN (CNN-SR)
1.1. Network Architecture
- EDSR is revised to become CNN-SR. The entire CNN-SR can be divided into four functional units.
- The first unit, from Conv1 to Conv2 and Sum, performs feature extraction and enhancement at low resolution.
- There are 6 residue-learning blocks (ResBlocks), each of which consists of two convolutional layers separated by a ReLU function and a Sum layer.
- The second unit, i.e. the Deconv layer, performs resolution change.
- The third unit, from Conv3 to Conv4 and Sum, performs feature enhancement at high resolution, including 8 ResBlocks herein.
- The fourth unit, Conv5, performs the reconstruction from feature maps to pixels.
1.2. Differences from EDSR
- First, a deconvolution layer is adopted to fulfill resolution change, slightly better than the convolution-shuffle layer in EDSR.
- Second, the resolution change unit is moved to the middle of the entire network.
- Third, the number of convolutional filters, in each convolutional layer, is decreased from 64 to 32.
2. Multi-Frame Up-Sampling CNN (CNN-UniSR & CNN-BiSR)
2.1. Multi-Frame Up-Sampling CNN (CNN-UniSR)
- CNN-UniSR is similar to CNN-SR but with more inputs.
- Specifically, in addition to the low-resolution reconstructed CTU, its collocated CTU in the reference frame, and the down-sampled version of the collocated CTU, are also input to CNN-UniSR.
- A convolutional layer is to extract features from the down-sampled version of the collocated CTU.
- A convolutional layer and 3 ResBlocks are to extract and enhance features of the collocated CTU, as shown above.
- Thus, the problem is different from video SR whose input is merely low-resolution frames.
- Feature combination at both low and high resolution is helpful due to the multi-scale exploitation.
- There is no motion compensation for the colocated CTU.
2.2. Multi-Frame Up-Sampling CNN (CNN-BiSR)
- There are two reference lists for bi-directional prediction, list0 & list1.
- CNN-BiSR is similar to CNN-UniSR but with more inputs., i.e. the collocated CTU in the reference frame of list1, and the down-sampled version of this collocated CTU
- For chroma, the upsampling CNN in Li TCSVT’18 is used.
- There are also some enhancements in the codec, which related to merge candidates and motion vector scaling. (But I will not talk about these here since they are non-CNN stuffs.)
- QP is adjust to QP-6 to prefer higher quality and higher bitrate.
- Training set: 84 sequences from CDVI database, 10 ones from SJTU, which have 1,500,000 samples.
- HM-12.1 is used.
3. Experimental Results
- For class A-E, 2.6% to 3.8% BD-rate (Y) reduction is obtained for RA, LDB and LDP configurations, respectively.
- For SDR, even higher of 5.1% to 6.8% BD-rate (Y) reduction is obtained for RA, LDB and LDP configurations, respectively.
3.2. Hitting Ratios
- Hitting ratio has the same conclusion that more CTUs use the proposed method than those in Class A-D.
3.3. Ablation Study
- When only CNN-SR is used, 6.3% and 4.6% BD-rate reduction is obtained for RA and LDB configurations, respectively.
- When only CNN-BiSR/CNN-UniSR is used, Only little further BD-rate reduction is obtained for RA and LDB configurations.
3.3. Time Analysis
- Since the CNN is not optimized for computational efficiency, both the encoding and decoding time increases largely.
3.4. Model Size & Memory
- When stored on disk, the CNN-SR, CNN-UniSR, CNN-BiSR, and chroma up-sampling CNN models occupy 1.35M, 1.68M, 2.01M, and 0.49M bytes, respectively.
- When running the CNN-based up-sampling, without any optimization, the extra bytes of memory are 212.5M, 257.1M, 293.3M, and 16M bytes for CNN-SR, CNNUniSR, CNN-BiSR, and chroma up-sampling CNN, respectively.
There are still a lot of details and results not yet shown here. Please feel free to read the paper if interested.
During the days of coronavirus, A challenge of writing 30/35/40 stories again for this month has been accomplished. Let me challenge 45 stories!! This is the 44th story in this month.. Thanks for visiting my story..
[2019 TCSVT] [CNN-SR & CNN-UniSR & CNN-BiSR]
Convolutional Neural Network-Based Block Up-Sampling for HEVC