Review: Li TCSVT’18 — CNN Upsampling for HEVC Intra Coding (HEVC Intra Prediction)

Average 5.5% BD-rate reduction on common test sequences and average 9.0% BD-rate reduction on ultrahigh definition test sequences

6 min readApr 13, 2020

In this story, a CNN Upsampling for HEVC Intra Coding is briefly reviewed. By downsampling the block before compression, bitrate saving can be achieved. On the other hand, with CNN upsampling, quality can be preserved. Thus, some CNN concepts and knowledge related to super resolution are utilized here for video coding. This is a paper in 2018 TCSVT where TCSVT has a high impact factor of 4.046. (Sik-Ho Tsang @ Medium)

Outline

Overall Framework
CNN for Luma Up-Sampling
CNN for Chroma Up-Sampling
Experimental Results

1. Overall Framework

The above figure depicts the flowchart of the proposed intra frame coding scheme.
An input frame is divided into blocks while for each block the best coding mode is decided. The block size input into CNN is called Coding Tree Unit (CTU), i.e. consisting of 64 × 64 luma samples (Y) and 2 channels of 32 × 32 chroma samples (U and V, or Cb and Cr), due to the YUV 4:2:0 format.
There are two stages for the overall framework.

1.1. First Stage

Each CTU has two paths. One is low resolution path. One is full resolution path.
At the low resolution path, the down-sampling is performed for each CTU using the fixed filters. Each down-sampled CTU can choose either CNN-based up-sampling, or the fixed, discrete cosine transform based interpolation filters (DCTIF). (DCTIF is already adopted in HEVC for fractional pixel interpolation for motion compensation.)
DCTIF is used for smooth regions while CNN is much more complicated than DCTIF, CNN is expected to deal with complex image regions
such as structures.
There are two mode decisions at the first stage.
First, the up-sampling method is decided by comparing the up-sampled results of both methods (DCTIF and CNN) with the original CTU, and choosing the result with less distortion.
The second mode decision is to choose low-resolution coding or full-resolution coding for each CTU, which is performed by comparing the rate distortion (R-D) costs of both coding modes.

1.2. Second Stage

**Left**: For the first stage, bottom and right boundaries are not available during up-sampling. **Right**: For the second stage, all boundaries are available for up-sampling.

In the first stage, the up-sampling at this stage can use the top and left boundaries but cannot use the bottom and right ones as they are not compressed yet.
The second stage refines the region of each up-sampled CTU around its bottom and right boundaries.
The second stage of up-sampling is performed for only the CTUs that have chosen the low-resolution coding mode, and the up-sampling method (CNN-based or DCTIF) is already decided in the first stage. The up-sampling result of the second stage just replaces that of the first stage.

2. CNN for Luma Up-sampling

A five-layer CNN for up-sampling, more complex than SRCNN (to deal with coding distortion) but much simpler than VDSR (to reduce computational cost).
Multi-Scale Feature Extraction: There are two layers designed to extract multi-scale features from the input LR block.
Deconvolution: The deconvolution layer is used to enlarge the multi
scale feature maps and the enlarged features are then used to reconstruct HR image, then it is in the middle.
Multi-Scale Reconstruction: The fourth layer, similar to the second, performs multi-scale fusion by using two sets of convolutional kernels with different sizes.
This layer takes into account both long- and short-range contextual information for reconstruction.
Residual Learning: the down-sampled block is up-sampled
by a fixed interpolation filter (DCTIF) and then added to the reconstruction produced by the five-layer CNN.
It is supposed to learn the difference between an original block and its degraded version.

3. CNN for Chroma Up-Sampling

Incorporating Luma Information: There is still correlation between Y and Cb/Cr. The luma component is further down-sampled to the same size as chroma to simplify the network design.
Then, cross-channel features can be extracted by the first layer, and processed by the following layers sequentially.
Joint Training of Cb and Cr: It is believed the high similarity between Cb and Cr can help reduce the amount of required models.
Specifically, the CNN outputs the reconstructed Cb and Cr simultaneously.

4. Experimental Results

4.1. BD-rate

**BD-rate (%) for each sequence using HM 12.1**

Significant bits saving is achieved compared with the HEVC anchor, especially at low bit rates, leading to on average 5.5% BD-rate reduction on common test sequences and on average 9.0% BD-rate reduction on ultrahigh definition test sequences.

4.2. Visualized Results

CTUs with green block are coded at low resolution and up-sampled using CNN, CTUs with red block are also coded at low resolution but up-sampled using DCTIF, and other CTUs are coded at full resolution.
As we can see, there are a lot of green blocks meaning that many blocks use the proposed CNN up-sampling approach.

4.3. Luma Up-Sampling Variants

PSNR obtained by the proposed CNN for luma up-sampling is higher than those by VDSR.

4.4. Chroma Up-Sampling Variants

PSNR obtained by the proposed CNN for chroma up-sampling is higher than those by DCTIF.
PSNR obtained by the proposed CNN with the use of luma as input for chroma up-sampling is higher than the one without the use of luma as input.

4.5. Second Stage Up-Sampling

**Percentage of CTUs that benefit from the second stage up-sampling and average MSE**

There are 75% of CTUs benefits from the second stage up-sampling.
And the average MSE is smaller.

0.8% BD-rate reduction is achieved by the second stage up-sampling.

4.6. Complexity

One drawback of CNN-based up-sampling methods is the high computational complexity compared to simple interpolation filters such as DCTIF.
In the current implementation, the CNN is not optimized for computational speed, and thus the encoding/decoding time is much longer than that of the highly optimized HEVC anchor.

Some deep learning based super resolution techniques are brought and applied to improve the coding efficiency of video compression.

Reference

[2018 TCSVT] [Li TCSVT’18]
Convolutional Neural Network-Based Block Up-Sampling for Intra Frame Coding

Codec Post-Processing

[ARCNN] [Lin DCC’16] [IFCNN] [Li ICME’17] [VRCNN] [DCAD] [DS-CNN] [Lu CVPRW’19] [Wang APSIPA ASC’19] [RHCNN]