Review: Lu CVPRW’19 — Multi-Scale Spatial Priors (Codec Filtering)

Multi-Scale Spatial Priors CNN for Versatile Video Coding (VVC)

Sik-Ho Tsang
4 min readMar 7, 2020

In this story, a post processing module using Multi-Scale Spatial Priors for Versatile Video Coding (VVC) standard, by Nanjing University, is briefly reviewed. This is a 2019 CVPRW paper, which attended the Workshop and Challenge on Learned Image Compression (CLIC) in 2019 CVPR. (Sik-Ho Tsang @ Medium)

Outline

  1. Multi-Scale Spatial Priors
  2. Some Training Details
  3. Experimental Results

1. Multi-Scale Spatial Priors

1.1. Convolutional Neural Network (CNN) as Post Processing Module

Convolutional Neural Network (CNN) as Post Processing Module
  • First, the input RGB is converted into YUV4:4:4 or YUV4:2:0 for encoding.
  • Then, the encoded bitstream is sent to decoder (or end-user).
  • After decoding, the decoded YUV is converted back into RGB.
  • Finally, RGB is enhanced by the post processing module which is a CNN.

1.2. Multi-Scale Spatial Priors CNN Network Architecture

Multi-Scale Spatial Priors CNN Network Architecture
  • Scale-wise convolution kernel sizes are utilized to capture the multi-scale priors spatially.
  • 3×3 Conv for 1/16 of the original image, 5×5 Conv for 1/4 of the original image, and 7×7 Conv for the original image.
  • Authors claim that such operation can extract features from different scales more precisely.
  • And it is a suitable convolutional patch sizes which coincides with the variable-size Coding Unit (CU) in Versatile Video Coding (VVC).
  • Four modified Residual Blocks (originated from ResNet) are used at different sizes.
  • 256 output channels for each convolutional layer at 1/16 of the original dimension, 128 channels at 1/4 of the original dimension, and 64 channels at the original dimension, are used.

1.3. Loss Function

  • MSE is used initially.
  • Then L1 norm is to replace MSE for fine-tuning.

2. Some Training Details

  • The image restoration network uses the training dataset called DIV2K for training.
  • The images compressed by the intra coding filters of the VVC as the inputs and the original images as the labels.
  • Several QPs (e.g., 25, 30, 35, etc) are adopted as the variables to fit different segments of bit rate.
  • i7-7700K CPU and a NVIDIA Quadro P5000 GPU, Adam optimizer, batch size of 16, etc.
  • The network is trained using a transfer learning manner. Models of higher QPs are trained based on parameters from models of lower QPs.
  • e.g. network parameters at QP 22 is used to derive network models at QP 27.

3. Experimental Results

  • Test Dataset P/M with 330 images totally released by the Computer Vision Lab of ETH Zurich.
YUV4:4:4
  • The proposed network achieves 0.3 dB gains at each bit rate point and average 6.5% BD-Rate reduction over default VVC Intra.
  • However, ARCNN is almost overlapped with the VVC Intra.
YUV4:2:0
  • The proposed network achieves 0.5 dB gains at each bit rate point and corresponding average 12.2% BD-Rate reduction.
Four image snapshots
  • As for the above images, the PSNR gains are achieved by 0.2 dB, 0.2 dB, 0.25 dB and 0.15 dB, respectively.
Four image snapshots
  • The BD-Rate has been respectively reduced by 4.35%, 4.03%, 4.56% and 2.99% against VVC with YUV 4:4:4 input.
  • The corresponding challenge leaderboard is at: http://challenge.compression.cc/leaderboard/lowrate/valid/
  • It seems that there are many SOTA approach afterwards. Thus, I cannot find their approach (Team name: NJUVisionPSNR) in the leaderboard already.

Reference

[2019 CVPRW] [Lu CVPRW’19]
Learned Image Restoration for VVC Intra Coding

My Previous Reviews

Image Classification [LeNet] [AlexNet] [Maxout] [NIN] [ZFNet] [VGGNet] [Highway] [SPPNet] [PReLU-Net] [STN] [DeepImage] [SqueezeNet] [GoogLeNet / Inception-v1] [BN-Inception / Inception-v2] [Inception-v3] [Inception-v4] [Xception] [MobileNetV1] [ResNet] [Pre-Activation ResNet] [RiR] [RoR] [Stochastic Depth] [WRN] [ResNet-38] [Shake-Shake] [FractalNet] [Trimps-Soushen] [PolyNet] [ResNeXt] [DenseNet] [PyramidNet] [DRN] [DPN] [Residual Attention Network] [DMRNet / DFN-MR] [IGCNet / IGCV1] [MSDNet] [ShuffleNet V1] [SENet] [NASNet] [MobileNetV2]

Object Detection [OverFeat] [R-CNN] [Fast R-CNN] [Faster R-CNN] [MR-CNN & S-CNN] [DeepID-Net] [CRAFT] [R-FCN] [ION] [MultiPathNet] [NoC] [Hikvision] [GBD-Net / GBD-v1 & GBD-v2] [G-RMI] [TDM] [SSD] [DSSD] [YOLOv1] [YOLOv2 / YOLO9000] [YOLOv3] [FPN] [RetinaNet] [DCN]

Semantic Segmentation [FCN] [DeconvNet] [DeepLabv1 & DeepLabv2] [CRF-RNN] [SegNet] [ParseNet] [DilatedNet] [DRN] [RefineNet] [GCN] [PSPNet] [DeepLabv3] [ResNet-38] [ResNet-DUC-HDC] [LC] [FC-DenseNet] [IDW-CNN] [DIS] [SDN] [DeepLabv3+]

Biomedical Image Segmentation [CUMedVision1] [CUMedVision2 / DCAN] [U-Net] [CFS-FCN] [U-Net+ResNet] [MultiChannel] [V-Net] [3D U-Net] [M²FCN] [SA] [QSA+QNT] [3D U-Net+ResNet] [Cascaded 3D U-Net] [Attention U-Net] [RU-Net & R2U-Net] [VoxResNet] [DenseVoxNet][UNet++] [H-DenseUNet] [DUNet]

Instance Segmentation [SDS] [Hypercolumn] [DeepMask] [SharpMask] [MultiPathNet] [MNC] [InstanceFCN] [FCIS]

Super Resolution [SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [SRDenseNet] [SR+STN]

Human Pose Estimation [DeepPose] [Tompson NIPS’14] [Tompson CVPR’15] [CPM] [FCGN] [IEF]

Codec Post-Processing [ARCNN] [Lin DCC’16] [IFCNN] [Li ICME’17] [VRCNN] [DCAD] [DS-CNN] [Lu CVPRW’19]

Generative Adversarial Network [GAN]

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet