Reading: Jia TIP’19 — Content-Aware Convolutional Neural Network for In-Loop Filtering (Codec Filtering)

Using Modified AlexNet as Discriminative Network. Outperforms VRCNN & VDSR. 4.1%, 6.0%, 4.7% & 6.0% BD-Rate Reduction Under AI, LD, LDP & RA Cconfigurations Respectively.

Sik-Ho Tsang
6 min readJun 22, 2020

In this story, “Content-Aware Convolutional Neural Network for In-Loop Filtering in High Efficiency Video Coding” (Jia ‘19), by Peking University, City University of Hong Kong, University of the Chinese Academy of Sciences, and Hikvision Research Institute, is presented. I read this because I work on video coding research. In this paper:

  • Content-aware multimodel filtering mechanism is realized by the restoration of different regions with different CNN models under the guidance of the discriminative network.

This is a paper in 2019 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

  1. Single CNN Model Network Architecture
  2. Discriminative Network with Multiple CNN Models
  3. Experimental Results

1. Single CNN Model Network Architecture

Fig. 1: Different Network Architectures
PSNR for Different Network Depth N
  • BSDS-500 dataset is used. 380 images for training and 10 images for validation while remaining are for testing.
  • Firstly, using architecture in (a), it is found that N=9 obtains highest PSNR. (I believe all are standard 3×3 convolution for those connection blocks in this step..)
Fig. 2: Different connection blocks
  • Then, different connection blocks as shown above are put into Fig. 1(b).
  • ⊗ means concatenation here, and dropout has the probability of 1 (That means no residual path for Fig. 2(a) and Fig. 2(b)?)
PSNR for Different Connection Blocks
  • It is found that the one in Fig. 2(a) has the best performance.
  • The above connection blocks are inspired by Inception.
  • Finally, the 4th dropout_3×3_5×5 unit in Fig. 1(b) is reduced into single 3×3 conv. layer to further reduce the number of conv. kernel parameter, as in Fig. 1(c).
  • Negligible performance loss is observed.
  • Below shows the details of the network architecture:
Details of the network architecture

2. Discriminative Network with Multiple CNN Models

2.1. Motivation

Green CTUs: Improvement after filtering, Red CTUs: Performance loss after filtering
  • However, it is found that some CTUs have improvements but also some CTUs have performance loss.
  • Thus, multiple CNN candidate models are used. As such, each CTU can adaptively select the optimal CNN model to achieve better restoration performance.

2.2. Discriminative Network for Model Selection

Discrimnet: Discriminative Network
Details of the network architecture
  • A light-weighted modification of AlexNet, called Discrimnet, is used.
  • There are 5 conv. layers (variable receptive fields, 11×11, 5×5 and 3×3) and 2 max-pooling layers with kernel size 3×3.
  • Batch normalization is used after each pooling layer for faster convergence.
  • The numbers of feature map for each conv. layer are 96, 96, 192, 192, 128.
  • ReLU is used as the activation function after all conv. layers and fully connected (fc) layers (except for the last fc layer).
  • The stride value for four conv. layers are 4, 2, 1, 1.
  • With the CTU as input, the output is N-dimensional vector.
  • N is the number of multi CNN.
  • UCID dataset is used as training data, 1200 for training, remaining for validation.
  • The CTU will then go through the CNN based on the highest activation obtained from the N-dimensional vector.
  • But how to label it? Iterative training is used to label it.

2.3. Iterative Training

Iterative Training
  • Firstly, An initial single CNN model is first trained with the training data from BSDS-500.
  • Then, the quality difference in terms of peak-signal-noise-ratio (PSNR) before and after single CNN filtering, i.e. PSNR differences are recorded for all training CTUs.
  • All training samples are ranked in the descending order.
  • The ranked training samples are equally partitioned into N-folds.
  • Each fold of the partitioned training samples is utilized to fine-tune the single CNN model.
  • K iterations are performed for training. K=2.
  • During inference, a CTU-level flag is added for the on/off of CNN filtering.
  • HM-16.9 is used.

3. Experimental Results

3.1. BD-Rate

Different Values of N
  • Single Model N=1: 3.0%, 3.9%, 3.7% and 3.9% BD-rate reductions can be achieved for AI, LDB, LDP and RA configurations, respectively.
  • When N is increasing, BD-rate reduction becomes larger and larger.
N=8
  • When N=8, 4.1%, 6.0%, 4.7% and 6.0% BD-rate reductions on average for luma channel under four different coding configurations respectively.

3.2. RD Curves

RD Curves
  • It seems that the proposed approach works more efficiently at low bitrate situation thant at high bitrate situation.

3.3. SOTA Comparison

SOTA Comparison Under RA configuration
  • The proposed approach outperforms VDSR and VRCNN.

3.4. Visual Quality

(a) Original; (b) VRCNN; (c) The proposed scheme (N = 8).
(a) Original; (b) VRCNN; (c) The proposed scheme (N = 8).
  • The proposed scheme can efficiently remove different kinds of compression artifacts.
  • Scrupulous observers may find that the degraded structures during block-based coding can also be recovered by the proposed scheme, e.g., the texture of floor and straight-lines.

3.5. Complexity

Complexity
  • All used GPU, the encoding complexity overhead is 113%, while the decoding overhead is 11656%.
  • The size of each CNN model is 1.38MB while each Discrimnet model is 10.80MB. Hence, it takes 14.94-20.6MB to store the trained models for each QP interval.
  • Regarding VDSR and VRCNN, the model size is 2.54MB and 0.21MB respectively.
  • As for GPU memory bandwidth usage, 370-1428MB run-time GPU memory are needed when N ranging from 1 to 8.
  • VDSR consumes 1022 MB run-time GPU resources while that value of VRCNN is 155 MB.

This is the 32nd story in this month!

--

--

Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.