Reading: Zhang ICME’20 — Enhancing VVC Through Cnn-Based Post-Processing (VVC Codec Filtering)

Average 3.90% and 4.13% BD-Rate Reduction Using PSNR and VMAF Respectively

In this story, Enhancing VVC Through Cnn-Based Post-Processing (Zhang ICME’20), by University of Bristol, is presented. I read this because I work on video coding research. In this paper:

  • A Convolutional Neural Network (CNN) based post-processing approach for video compression, is applied at the decoder to improve the reconstruction quality.

This is a paper in 2020 ICME. (Sik-Ho Tsang @ Medium)


  1. Network Architecture
  2. Some Training Details
  3. Experimental Results

1. Network Architecture

  • The network is a compressed RGB image block with a spatial resolution of 96×96 and a bit depth of 10, while the target is the corresponding original colour block in the same format.
  • This CNN architecture has been employed in [23, 24] for spatial resolution and/or bit depth up-sampling, and was modified based on the generator (SRResNet) of SRGAN [29].
  • It contains 2N+2 convolutional layers, all of which have 3×3 kernels, 64 feature maps and a stride value of 1, except the last convolutional layer (with 3 feature maps instead).
  • Between the first and the last convolutional layers, there are N identical residual blocks, each of which contains two convolutional layers and a parametric ReLU (PReLU) activation function in between them.
  • Skip connections are employed:
  1. between the input of each residual block and the output of the second convolutional layer of in the same residual block.
  2. between the input of the first residual block and the output of the Nth residual block.
  3. between the input of the CNN and the output of the last convolutional layer.

2. Some Training Details

  • One hundred and eight source video sequences were used to train the employed CNN, selected from publicly available databases, including BVI-HFR [30], BVI-Texture [31], Harmonic 4K [32] and Netflix Chimera [33].
  • All were 10 bits per second, YCbCr 4:2:0, raw video clips.
  • These sequences were truncated to a length of 64 frames, and down-sampled to three lower resolutions, 1920×1080, 960×540 and 480×270, to increase content diversity.
  • All of these 432 sequences (108 sources ×4 resolutions) were encoded by VVC VTM 4.0.1 [34] using the RA configuration of the JVET CTC.
  • Other coding parameters include: five base quantisation parameter (QP) values 22, 27, 32, 27 and 42; Main10 profile; and a fixed intra period of 64.
  • Videos are segmented into 96×96 colour image blocks (converted to the RGB space). This results in approximately 500,000 image block pairs in total for five QPs.
  • Block rotation has been applied to achieve data augmentation and model generalisation.
  • Five different CNN models (ModelQP22, ModelQP27, ModelQP32, ModelQP37 andModelQP42) were obtained.
  • They are used at different QP bands:
  • L1 loss is used.
  • During testing, 96×96 overlapping blocks, with 4 pixels overlapped, are generated, and the final output is obtained by averaging.

3. Experimental Results

3.1. BD-Rate

  • Average 3.90% and 4.13% BD-rate reduction using PSNR and VMAF respectively are obtained, outperforms two methods proposed in standard documents: N0254 and O0079.

3.2. RD Curves

  • It is found that the proposed method performs better at low bit rate condition.
  • This is consistent to the BD-rate table that there is larger BD-rate reduction for high QP bands (H-QPs).

3.3. Subjective Quality

  • The proposed method exhibits fewer blocking artefacts.

3.4. Number of Residual Blocks

  • When N decreases, the average PSNR gain is decreased as does the relative complexity, which is approximately linear.

This is the 4th story in this month.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store