Review: Lu CVPRW’19 — Multi-Scale Spatial Priors (Codec Filtering)

Multi-Scale Spatial Priors CNN for Versatile Video Coding (VVC)

4 min readMar 7, 2020

In this story, a post processing module using Multi-Scale Spatial Priors for Versatile Video Coding (VVC) standard, by Nanjing University, is briefly reviewed. This is a 2019 CVPRW paper, which attended the Workshop and Challenge on Learned Image Compression (CLIC) in 2019 CVPR. (Sik-Ho Tsang @ Medium)

Outline

Multi-Scale Spatial Priors
Some Training Details
Experimental Results

1. Multi-Scale Spatial Priors

1.1. Convolutional Neural Network (CNN) as Post Processing Module

First, the input RGB is converted into YUV4:4:4 or YUV4:2:0 for encoding.
Then, the encoded bitstream is sent to decoder (or end-user).
After decoding, the decoded YUV is converted back into RGB.
Finally, RGB is enhanced by the post processing module which is a CNN.

1.2. Multi-Scale Spatial Priors CNN Network Architecture

Scale-wise convolution kernel sizes are utilized to capture the multi-scale priors spatially.
3×3 Conv for 1/16 of the original image, 5×5 Conv for 1/4 of the original image, and 7×7 Conv for the original image.
Authors claim that such operation can extract features from different scales more precisely.
And it is a suitable convolutional patch sizes which coincides with the variable-size Coding Unit (CU) in Versatile Video Coding (VVC).
Four modified Residual Blocks (originated from ResNet) are used at different sizes.
256 output channels for each convolutional layer at 1/16 of the original dimension, 128 channels at 1/4 of the original dimension, and 64 channels at the original dimension, are used.

1.3. Loss Function

MSE is used initially.
Then L1 norm is to replace MSE for fine-tuning.

2. Some Training Details

The image restoration network uses the training dataset called DIV2K for training.
The images compressed by the intra coding filters of the VVC as the inputs and the original images as the labels.
Several QPs (e.g., 25, 30, 35, etc) are adopted as the variables to fit different segments of bit rate.
i7-7700K CPU and a NVIDIA Quadro P5000 GPU, Adam optimizer, batch size of 16, etc.
The network is trained using a transfer learning manner. Models of higher QPs are trained based on parameters from models of lower QPs.
e.g. network parameters at QP 22 is used to derive network models at QP 27.

3. Experimental Results

Test Dataset P/M with 330 images totally released by the Computer Vision Lab of ETH Zurich.

The proposed network achieves 0.3 dB gains at each bit rate point and average 6.5% BD-Rate reduction over default VVC Intra.
However, ARCNN is almost overlapped with the VVC Intra.

The proposed network achieves 0.5 dB gains at each bit rate point and corresponding average 12.2% BD-Rate reduction.

As for the above images, the PSNR gains are achieved by 0.2 dB, 0.2 dB, 0.25 dB and 0.15 dB, respectively.

The BD-Rate has been respectively reduced by 4.35%, 4.03%, 4.56% and 2.99% against VVC with YUV 4:4:4 input.
The corresponding challenge leaderboard is at: http://challenge.compression.cc/leaderboard/lowrate/valid/
It seems that there are many SOTA approach afterwards. Thus, I cannot find their approach (Team name: NJUVisionPSNR) in the leaderboard already.