Reading: Jia TIP’19 — Content-Aware Convolutional Neural Network for In-Loop Filtering (Codec Filtering)
Using Modified AlexNet as Discriminative Network. Outperforms VRCNN & VDSR. 4.1%, 6.0%, 4.7% & 6.0% BD-Rate Reduction Under AI, LD, LDP & RA Cconfigurations Respectively.
In this story, “Content-Aware Convolutional Neural Network for In-Loop Filtering in High Efficiency Video Coding” (Jia ‘19), by Peking University, City University of Hong Kong, University of the Chinese Academy of Sciences, and Hikvision Research Institute, is presented. I read this because I work on video coding research. In this paper:
- Content-aware multimodel filtering mechanism is realized by the restoration of different regions with different CNN models under the guidance of the discriminative network.
This is a paper in 2019 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)
Outline
- Single CNN Model Network Architecture
- Discriminative Network with Multiple CNN Models
- Experimental Results
1. Single CNN Model Network Architecture
- BSDS-500 dataset is used. 380 images for training and 10 images for validation while remaining are for testing.
- Firstly, using architecture in (a), it is found that N=9 obtains highest PSNR. (I believe all are standard 3×3 convolution for those connection blocks in this step..)
- Then, different connection blocks as shown above are put into Fig. 1(b).
- ⊗ means concatenation here, and dropout has the probability of 1 (That means no residual path for Fig. 2(a) and Fig. 2(b)?)
- It is found that the one in Fig. 2(a) has the best performance.
- The above connection blocks are inspired by Inception.
- Finally, the 4th dropout_3×3_5×5 unit in Fig. 1(b) is reduced into single 3×3 conv. layer to further reduce the number of conv. kernel parameter, as in Fig. 1(c).
- Negligible performance loss is observed.
- Below shows the details of the network architecture:
2. Discriminative Network with Multiple CNN Models
2.1. Motivation
- However, it is found that some CTUs have improvements but also some CTUs have performance loss.
- Thus, multiple CNN candidate models are used. As such, each CTU can adaptively select the optimal CNN model to achieve better restoration performance.
2.2. Discriminative Network for Model Selection
- A light-weighted modification of AlexNet, called Discrimnet, is used.
- There are 5 conv. layers (variable receptive fields, 11×11, 5×5 and 3×3) and 2 max-pooling layers with kernel size 3×3.
- Batch normalization is used after each pooling layer for faster convergence.
- The numbers of feature map for each conv. layer are 96, 96, 192, 192, 128.
- ReLU is used as the activation function after all conv. layers and fully connected (fc) layers (except for the last fc layer).
- The stride value for four conv. layers are 4, 2, 1, 1.
- With the CTU as input, the output is N-dimensional vector.
- N is the number of multi CNN.
- UCID dataset is used as training data, 1200 for training, remaining for validation.
- The CTU will then go through the CNN based on the highest activation obtained from the N-dimensional vector.
- But how to label it? Iterative training is used to label it.
2.3. Iterative Training
- Firstly, An initial single CNN model is first trained with the training data from BSDS-500.
- Then, the quality difference in terms of peak-signal-noise-ratio (PSNR) before and after single CNN filtering, i.e. PSNR differences are recorded for all training CTUs.
- All training samples are ranked in the descending order.
- The ranked training samples are equally partitioned into N-folds.
- Each fold of the partitioned training samples is utilized to fine-tune the single CNN model.
- K iterations are performed for training. K=2.
- During inference, a CTU-level flag is added for the on/off of CNN filtering.
- HM-16.9 is used.
3. Experimental Results
3.1. BD-Rate
- Single Model N=1: 3.0%, 3.9%, 3.7% and 3.9% BD-rate reductions can be achieved for AI, LDB, LDP and RA configurations, respectively.
- When N is increasing, BD-rate reduction becomes larger and larger.
- When N=8, 4.1%, 6.0%, 4.7% and 6.0% BD-rate reductions on average for luma channel under four different coding configurations respectively.
3.2. RD Curves
- The proposed scheme can efficiently remove different kinds of compression artifacts.
- Scrupulous observers may find that the degraded structures during block-based coding can also be recovered by the proposed scheme, e.g., the texture of floor and straight-lines.
3.5. Complexity
- All used GPU, the encoding complexity overhead is 113%, while the decoding overhead is 11656%.
- The size of each CNN model is 1.38MB while each Discrimnet model is 10.80MB. Hence, it takes 14.94-20.6MB to store the trained models for each QP interval.
- Regarding VDSR and VRCNN, the model size is 2.54MB and 0.21MB respectively.
- As for GPU memory bandwidth usage, 370-1428MB run-time GPU memory are needed when N ranging from 1 to 8.
- VDSR consumes 1022 MB run-time GPU resources while that value of VRCNN is 155 MB.
This is the 32nd story in this month!
Reference
[2019 TIP] [Jia TIP’19]
Content-Aware Convolutional Neural Network for In-Loop Filtering in High Efficiency Video Coding
Codec Filtering
JPEG [ARCNN] [RED-Net] [DnCNN] [Li ICME’17] [MemNet] [MWCNN]
HEVC [Lin DCC’16] [IFCNN] [VRCNN] [DCAD] [MMS-net] [DRN] [Lee ICCE’18] [DS-CNN] [CNNF] [RHCNN] [VRCNN-ext] [S-CNN & C-CNN] [MLSDRN] [ARTN] [Double-Input CNN] [CNNIF & CNNMC] [B-DRRN] [Residual-VRN] [AResNet] [Liu PCS’19] [DIA_Net] [RRCNN] [QE-CNN] [Jia TIP’19] [EDCNN] [VRCNN-BN] [MACNN]
3D-HEVC [RSVE+POST]
AVS3 [Lin PCS’19]
VVC [Lu CVPRW’19] [Wang APSIPA ASC’19] [ADCNN]