Review — Yue VCIP’20: A Mixed Appearance-based and Coding Distortion-based CNN Fusion Approach for In-loop Filtering in Video Coding (HEVC Filtering)

Outperforms VRCNN, MRRN and SEFCNN.

Sik-Ho Tsang
5 min readMar 27, 2021

In this story, A Mixed Appearance-based and Coding Distortion-based CNN Fusion Approach for In-loop Filtering in Video Coding, (Yue VCIP’20), by University of Electronoc Science & Technology of China, and Shandong University. is reviewed. In this paper:

  • The appearance-based CNN filtering using U-Net-like structure extracts global appearance information where the coding distortion-based filtering focuses on the local information.
  • Both are then fused together to restore the reconstructed image.

This is a paper in 2020 VCIP. (Sik-Ho Tsang @ Medium)


  1. Overall Framework
  2. Global appearance-based CNN filtering
  3. Local coding distortion-based CNN filtering
  4. Mixed Fusion
  5. Experimental Results

1. Overall Framework

Framework of the proposed mixed global appearance-based and local coding distortion-based CNN fusion for in-loop filtering
  • Due to the quantization process used in the lossy video coding, the reconstructed video frames contain noise, and can be restored by the image denoising. It uses the global appearance information to restore the unnatural noisy frames.
  • On the other hand, the reconstructed video frames before in-loop filters are generated from a fixed and same line of operations including block partitioning, intra/inter prediction, transform, etc.
  • The distortions of different blocks in different frames and videos share similar characteristics such as blocking artifacts. Accordingly, CNNs can be applied to learn such shared local features and restore the undistorted frames.

The proposed approach, as shown above, consists of two CNN branches and an extra skip connection.

2. Global appearance-based CNN filtering

Global appearance-based CNN filtering
  • A U-Net-like CNN is used.
  • It consists of four down-scale operations with pooling to increase the receptive field in order to capture the global information, and four up-scale operations with up-sampling to produce the pixel-level information.
  • At each down-scale, max pooling of 2×2 is used.
  • At each up-scale, the opposite operation is applied with up-sampled resolution of features. (But don’t know it uses deconvolution or just simply bilinear operation.)
  • The skip connection concatenates the features obtained in the downscale process to the features of the same resolution.
  • The filter size of the convolutional layer is 3×3.
  • The channel number of the first convolutional layer is 64, and then increased or decreased according to the down-scale and up-scale operations.
  • ReLU and batch normalization are used.

3. Local coding distortion-based CNN filtering

Local coding distortion-based CNN filtering
  • 20 convolutional layers are used.
  • The filter size of each layer is 3×3 and the channel number of all the filters are 64.
  • ReLU and batch normalization are used.
  • This network focuses on learning the local distortion introduced by the coding process instead of the global image appearance.

3. Mixed Fusion

  • Considering the first two convolutional layers of the left branch and right branch are the same, these layers are shared in both branches to reduce complexity. Moreover, in the end, the last two convolutional layers are also the same and thus shared.
  • Denote the input reconstructed frame by X. First, X is processed with the shared two convolutional layers f1():
  • Then the extracted features X1 are processed by the left branch L() and the right branch R():
  • The features XL, XR from the two branches are then concatenated Cat() and processed by the last two shared convolutional layers f2().
  • The final output is the expected residual (distortion) Re.
  • Re is added with the input by the extra skip connection shown in the first figure, producing the filtered output.

4. Experimental Results

  • HM-16.9 is used with all intra configuration.
  • First 50 frames from Class A-E are all used for tests.
  • For training, the DIV2K dataset is used, containing 900 images.
  • The images are first converted to YUV420 file format same to the test sequences, then coded with HM (in-loop filters turned off) to produce the training data. The Y component is used for training and testing.
  • Four models are trained, each for one QP.
  • For the training at QP 22 and 27, transfer learning is used with the models trained with QP 32.
BD-Rate (%)
  • The proposed method achieves 11.20%, 8.64%, 10.93%, 10.28%, 15.27% BD-rate savings against the in-loop filters in HM for Class A, B, C, D, E, respectively.
  • Compared with the existing VRCNN [19], MRRN [12] and SEFCNN [9], the proposed method achieves the best performance with 11.26% BD-rate saving on average.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.