Review: VRCNN-ext — Multiple Variable-filter-size Residue-learning blocks (Codec Filtering)

Outperforms VRCNN, 9.2%, 9.6%, 7.4% BD-Rate Reduction Under All-Intra, Low-Delay B & Random Access Configuration

Sik-Ho Tsang
6 min readApr 24, 2020

In this story, VRCNN-ext, using multiple variable-filter-size residue-learning blocks, as an in-loop filter for artifact reduction, by University of Science and Technology of China, is reviewed.

In-loop filter is used to enhance the video frame quality before the video frame is used for viewing or prediction. With higher quality, better prediction can be obtained for the next frame. Bitrate can also be reduced due to the better prediction. (To know more about in-loop filter, please read DRN.)

This is a paper in 2018 VCIP. (Sik-Ho Tsang @ Medium)


  1. Flowchart of Proposed Multi-Level Scheme
  2. VRCNN-ext Network Architecture
  3. Experimental Results

1. Flowchart of Proposed Multi-Level Scheme

Flowchart of Proposed Multi-Level Scheme
  • There are different types of frames/slices in video coding.
  • I slice (Intra): is encoded without utilizing any information from any other slices.
  • P slice (Predictive): is encoded utilizing information from previous slices/frames.
  • B slice (Bi-directional): is encoded utilizing information from both previous and future slices/frames.
  • If it is I-slice, whole frame is filtered by the proposed network.
  • If it is P/B-slice, for each non-overlapping 64×64 CTU (Coding Tree Unit), a decision is made with a flag coded. If yes, the CTU is filtered by the proposed network.
  • If not, for each coding unit (CU) within the CTU, a CU classification is processed to decide whether the CU is is filtered by the proposed network. This CU classification is embedded at both encoder and decoder such that there is no additional flag to indicate the use of filtering by the proposed network. (For more information about CU, please read Section 2 in IPCNN)

2. VRCNN-ext Network Architecture

VRCNN-ext, which is composed of multiple Variable-filter-size Residue-learning blocks (VR blocks)
VR Block Configuration

2.1. VR Blocks

  • A VRCNN-like network is used.
  • Except for the first layer and the last layer, which are both convolutional layers, the middle part of VRCNN-ext is composed of multiple variable-filter-size residue-learning blocks (VR blocks).
  • VR block is similar to residue block (in ResNet), by introducing residue learning into small blocks, the entire network was observed to be much easier to train.
  • Each VR block consists of two convolutional layers, and each layer has filters of two different kernel sizes, as shown in the table above.
  • In this paper, 10 VR blocks are used, resulting in a CNN with 22 layers.

2.2. The Reason of Applying VR Blocks for Artifact Reduction

  • Since the compression artifacts in HEVC are caused by quantization of transformed coefficients, and the transform is block wise, the quantization error of one coefficient affects only the pixels in the same block.
  • HEVC adopts 4×4, 8×8, 16×16, up to 32×32 DCT, and allows the choice of discrete sine transform (DST) at 4×4, which shall be taken into account to reduce the quantization error.
  • Therefore, we propose to adopt variable filter size in the residue block.

2.3. Some Training Details

  • DIV2K dataset is used for training.
  • Multiple models for different Quantization Parameters (QPs) are trained. Since different QPs, different video quality/bitrates are obtained.
  • HEVC reference software HM is used to compress the original images, under all-intra configuration, with DF and SAO turned off, at four QPs: 22, 27, 32, and 37. For each QP, a separate network is trained out.
  • But the same model is used for I/P/B slices.
  • The model for QP27 is trained first, then we the models of QP22, QP32 and QP37 are initialized using the trained model for QP27, and fine-tune them.

2.4. In loop Filtering in I Slice

  • A validation set, which consists of 1000 natural images from the UCID dataset.
Percentage of CUs for which the CNN-based filter improves quality, for I slices
  • For luma, more than 80% CUs are refined by the CNN-based filter. For chroma, the percentage is lower but still exceeds 50%. It may be due to the training data that are only the luma.
  • It is safe to apply the trained filter on the entire I slice.

2.5. In loop Filtering in P/B Slice

  • CNN-based filter on the entire inter slice could not ensure the quality improvement. Maybe due to the less artifacts in inter slices, CNN-based filter produced over-smooth result at some regions. Thus, filtering is applied on CTU and CU basis adaptively.
  • If CNN-based filtering is not applied on the whole CTU. If CTU flag is off, not to use filtering at CTU level, then filtering is adaptively on/off at CU level according to the CU classification by decision trees.
  • Features input to the decision trees:
  • Basic information: base QP (usually the QP of intra slices), CU size, and luma/chroma (luma and chroma are decided separately).
  • Motion information: the largest motion vector inside the CU.
  • Residue information: its minimal and maximal values, range, mean, and variance.
  • Reconstruction information: its minimal and maximal values, range, mean, and variance.
  • 26 video sequences are used for training.
  • Three different kinds of decision trees, i.e. random forest, extra tree, and gradient boosting tree, to train three classifiers. An ensemble is used to predict the use of filtering.

3. Experimental Results

  • HM-16.0 is used.
  • For intra slices, we turned off DF and SAO, and use the CNN-based filter for the entire slice.
  • For inter slices, we turned on DF and SAO, and put the CNN-based filter between DF and SAO, with the proposed CTU-level and CU-level control.
  • AI: All slices/frames are encoded as I-slices/frames.
  • LDB/RA: Most of the slices/frames are encoded as B-slices/frames where the main difference of LDB and RA is that slices/frames are encoded at different coding orders.
  • Average 9.2%, 9.6%, and 7.4% BD-rate reduction under AI, LDB, and RA configurations.
  • But under the same condition, VRCNN achieves only on average 4.6% BD-rate reduction (Y, AI).
  • However, the encoding/decoding time is increased largely when operating on CPU.
RD (Rate Distortion) Curves
  • RD curves also show that the proposed HEVC using VRCNN-ext improves the conventional HEVC by large margin.

During the days of coronavirus, I hope to write 30 stories in this month to give myself a small challenge. This is the 23rd story in this month. Thanks for visiting my story…



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.