Reading: CNN-SR & CNN-UniSR & CNN-BiSR — Block Upsampling (HEVC Inter Prediction)

Modified EDSR, 3.8%, 2.6%, 3.5% BD-Rate Reduction Under RA, LDB & LDP Configurations

Sik-Ho Tsang
5 min readMay 30, 2020
Overall Framework

In this story, “Convolutional Neural Network-Based Block Up-Sampling for HEVC” (CNN-SR & CNN-UniSR & CNN-BiSR), is presented. I read this because I work on video coding research. This paper extends the idea of another TCSVT paper, Li TCSVT’18, in 2018 TCSVT. In this paper:

  • The coding block is downsampled before encoding to save more bits so that coding efficiency can be improved.
  • The overall framework is similar to Li TCSVT’18 except that there is no second-stage upsampling in this paper. (Thus, I will not cover the framework here.)
  • Also, depending on the frame type, different networks with different inputs are used. And they are called CNN-SR, CNN-UniSR and CNN-BiSR.

This is a paper in 2019 TCSVT where TCSVT has high impact factor of 4.046. (Sik-Ho Tsang @ Medium)


  1. Single-Frame Up-Sampling CNN (CNN-SR)
  2. Multi-Frame Up-Sampling CNN (CNN-UniSR & CNN-BiSR)
  3. Experimental Results

1. Single-Frame Up-Sampling CNN (CNN-SR)

Single-Frame Up-Sampling CNN

1.1. Network Architecture

  • EDSR is revised to become CNN-SR. The entire CNN-SR can be divided into four functional units.
  • The first unit, from Conv1 to Conv2 and Sum, performs feature extraction and enhancement at low resolution.
  • There are 6 residue-learning blocks (ResBlocks), each of which consists of two convolutional layers separated by a ReLU function and a Sum layer.
  • The second unit, i.e. the Deconv layer, performs resolution change.
  • The third unit, from Conv3 to Conv4 and Sum, performs feature enhancement at high resolution, including 8 ResBlocks herein.
  • The fourth unit, Conv5, performs the reconstruction from feature maps to pixels.

1.2. Differences from EDSR

  • First, a deconvolution layer is adopted to fulfill resolution change, slightly better than the convolution-shuffle layer in EDSR.
  • Second, the resolution change unit is moved to the middle of the entire network.
  • Third, the number of convolutional filters, in each convolutional layer, is decreased from 64 to 32.

2. Multi-Frame Up-Sampling CNN (CNN-UniSR & CNN-BiSR)

2.1. Multi-Frame Up-Sampling CNN (CNN-UniSR)

Multi-Frame Up-Sampling CNN (CNN-UniSR)
  • CNN-UniSR is similar to CNN-SR but with more inputs.
  • Specifically, in addition to the low-resolution reconstructed CTU, its collocated CTU in the reference frame, and the down-sampled version of the collocated CTU, are also input to CNN-UniSR.
  • A convolutional layer is to extract features from the down-sampled version of the collocated CTU.
  • A convolutional layer and 3 ResBlocks are to extract and enhance features of the collocated CTU, as shown above.
  • Thus, the problem is different from video SR whose input is merely low-resolution frames.
  • Feature combination at both low and high resolution is helpful due to the multi-scale exploitation.
  • There is no motion compensation for the colocated CTU.

2.2. Multi-Frame Up-Sampling CNN (CNN-BiSR)

Multi-Frame Up-Sampling CNN (CNN-BiSR)
  • There are two reference lists for bi-directional prediction, list0 & list1.
  • CNN-BiSR is similar to CNN-UniSR but with more inputs., i.e. the collocated CTU in the reference frame of list1, and the down-sampled version of this collocated CTU

2.3. Others

  • For chroma, the upsampling CNN in Li TCSVT’18 is used.
  • There are also some enhancements in the codec, which related to merge candidates and motion vector scaling. (But I will not talk about these here since they are non-CNN stuffs.)
  • QP is adjust to QP-6 to prefer higher quality and higher bitrate.
  • Training set: 84 sequences from CDVI database, 10 ones from SJTU, which have 1,500,000 samples.
  • HM-12.1 is used.

3. Experimental Results

3.1. BD-Rate

BD-Rate (%) Compared to HEVC (S stands for Y-SSIM)
  • For class A-E, 2.6% to 3.8% BD-rate (Y) reduction is obtained for RA, LDB and LDP configurations, respectively.
  • For SDR, even higher of 5.1% to 6.8% BD-rate (Y) reduction is obtained for RA, LDB and LDP configurations, respectively.

3.2. Hitting Ratios

Hitting ratio
  • Hitting ratio has the same conclusion that more CTUs use the proposed method than those in Class A-D.

3.3. Ablation Study

Ablation Study
  • When only CNN-SR is used, 6.3% and 4.6% BD-rate reduction is obtained for RA and LDB configurations, respectively.
  • When only CNN-BiSR/CNN-UniSR is used, Only little further BD-rate reduction is obtained for RA and LDB configurations.

3.3. Time Analysis

Time Analysis
  • Since the CNN is not optimized for computational efficiency, both the encoding and decoding time increases largely.

3.4. Model Size & Memory

  • When stored on disk, the CNN-SR, CNN-UniSR, CNN-BiSR, and chroma up-sampling CNN models occupy 1.35M, 1.68M, 2.01M, and 0.49M bytes, respectively.
  • When running the CNN-based up-sampling, without any optimization, the extra bytes of memory are 212.5M, 257.1M, 293.3M, and 16M bytes for CNN-SR, CNNUniSR, CNN-BiSR, and chroma up-sampling CNN, respectively.

There are still a lot of details and results not yet shown here. Please feel free to read the paper if interested.

During the days of coronavirus, A challenge of writing 30/35/40 stories again for this month has been accomplished. Let me challenge 45 stories!! This is the 44th story in this month.. Thanks for visiting my story..



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.