Reading: Jia TIP’19 — Content-Aware Convolutional Neural Network for In-Loop Filtering (Codec Filtering)

Using Modified AlexNet as Discriminative Network. Outperforms VRCNN & VDSR. 4.1%, 6.0%, 4.7% & 6.0% BD-Rate Reduction Under AI, LD, LDP & RA Cconfigurations Respectively.

6 min readJun 22, 2020

In this story, “Content-Aware Convolutional Neural Network for In-Loop Filtering in High Efficiency Video Coding” (Jia ‘19), by Peking University, City University of Hong Kong, University of the Chinese Academy of Sciences, and Hikvision Research Institute, is presented. I read this because I work on video coding research. In this paper:

Content-aware multimodel filtering mechanism is realized by the restoration of different regions with different CNN models under the guidance of the discriminative network.

This is a paper in 2019 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

Single CNN Model Network Architecture
Discriminative Network with Multiple CNN Models
Experimental Results

1. Single CNN Model Network Architecture

**Fig. 1: Different Network Architectures**

PSNR for Different Network Depth N

BSDS-500 dataset is used. 380 images for training and 10 images for validation while remaining are for testing.
Firstly, using architecture in (a), it is found that N=9 obtains highest PSNR. (I believe all are standard 3×3 convolution for those connection blocks in this step..)

Then, different connection blocks as shown above are put into Fig. 1(b).
⊗ means concatenation here, and dropout has the probability of 1 (That means no residual path for Fig. 2(a) and Fig. 2(b)?)

PSNR for Different Connection Blocks

It is found that the one in Fig. 2(a) has the best performance.
The above connection blocks are inspired by Inception.
Finally, the 4th dropout_3×3_5×5 unit in Fig. 1(b) is reduced into single 3×3 conv. layer to further reduce the number of conv. kernel parameter, as in Fig. 1(c).
Negligible performance loss is observed.
Below shows the details of the network architecture:

2. Discriminative Network with Multiple CNN Models

2.1. Motivation

**Green CTUs: Improvement after filtering, Red CTUs: Performance loss after filtering**

However, it is found that some CTUs have improvements but also some CTUs have performance loss.
Thus, multiple CNN candidate models are used. As such, each CTU can adaptively select the optimal CNN model to achieve better restoration performance.

2.2. Discriminative Network for Model Selection

A light-weighted modification of AlexNet, called Discrimnet, is used.
There are 5 conv. layers (variable receptive fields, 11×11, 5×5 and 3×3) and 2 max-pooling layers with kernel size 3×3.
Batch normalization is used after each pooling layer for faster convergence.
The numbers of feature map for each conv. layer are 96, 96, 192, 192, 128.
ReLU is used as the activation function after all conv. layers and fully connected (fc) layers (except for the last fc layer).
The stride value for four conv. layers are 4, 2, 1, 1.
With the CTU as input, the output is N-dimensional vector.
N is the number of multi CNN.
UCID dataset is used as training data, 1200 for training, remaining for validation.
The CTU will then go through the CNN based on the highest activation obtained from the N-dimensional vector.
But how to label it? Iterative training is used to label it.

2.3. Iterative Training

Firstly, An initial single CNN model is first trained with the training data from BSDS-500.
Then, the quality difference in terms of peak-signal-noise-ratio (PSNR) before and after single CNN filtering, i.e. PSNR differences are recorded for all training CTUs.
All training samples are ranked in the descending order.
The ranked training samples are equally partitioned into N-folds.
Each fold of the partitioned training samples is utilized to fine-tune the single CNN model.
K iterations are performed for training. K=2.
During inference, a CTU-level flag is added for the on/off of CNN filtering.
HM-16.9 is used.

3. Experimental Results

3.1. BD-Rate

Single Model N=1: 3.0%, 3.9%, 3.7% and 3.9% BD-rate reductions can be achieved for AI, LDB, LDP and RA configurations, respectively.
When N is increasing, BD-rate reduction becomes larger and larger.

When N=8, 4.1%, 6.0%, 4.7% and 6.0% BD-rate reductions on average for luma channel under four different coding configurations respectively.

3.2. RD Curves

It seems that the proposed approach works more efficiently at low bitrate situation thant at high bitrate situation.

3.3. SOTA Comparison

**SOTA Comparison Under RA configuration**

The proposed approach outperforms VDSR and VRCNN.

3.4. Visual Quality

**(a) Original; (b)** **VRCNN; (c) The proposed scheme (N = 8).**

The proposed scheme can efficiently remove different kinds of compression artifacts.
Scrupulous observers may find that the degraded structures during block-based coding can also be recovered by the proposed scheme, e.g., the texture of floor and straight-lines.

3.5. Complexity

All used GPU, the encoding complexity overhead is 113%, while the decoding overhead is 11656%.
The size of each CNN model is 1.38MB while each Discrimnet model is 10.80MB. Hence, it takes 14.94-20.6MB to store the trained models for each QP interval.
Regarding VDSR and VRCNN, the model size is 2.54MB and 0.21MB respectively.
As for GPU memory bandwidth usage, 370-1428MB run-time GPU memory are needed when N ranging from 1 to 8.
VDSR consumes 1022 MB run-time GPU resources while that value of VRCNN is 155 MB.

This is the 32nd story in this month!

Reference

[2019 TIP] [Jia TIP’19]
Content-Aware Convolutional Neural Network for In-Loop Filtering in High Efficiency Video Coding

Codec Filtering

JPEG [ARCNN] [RED-Net] [DnCNN] [Li ICME’17] [MemNet] [MWCNN]
HEVC [Lin DCC’16] [IFCNN] [VRCNN] [DCAD] [MMS-net] [DRN] [Lee ICCE’18] [DS-CNN] [CNNF] [RHCNN] [VRCNN-ext] [S-CNN & C-CNN] [MLSDRN] [ARTN] [Double-Input CNN] [CNNIF & CNNMC] [B-DRRN] [Residual-VRN] [AResNet] [Liu PCS’19] [DIA_Net] [RRCNN] [QE-CNN] [Jia TIP’19] [EDCNN] [VRCNN-BN] [MACNN]
3D-HEVC [RSVE+POST]
AVS3 [Lin PCS’19]
VVC [Lu CVPRW’19] [Wang APSIPA ASC’19] [ADCNN]