Review: S-CNN & C-CNN — Adaptively Iterative In-loop Filter in HEVC (Codec Filtering)

SRCNN-like Network, With Residual Learning Intorduced by ResNet, Up to 5.1% Bitrate Reduction is Obtained

6 min readApr 25, 2020

In this story, Simplfied CNN (S-CNN) & Complicated CNN (C-CNN), by National Taiwan Normal University, National Sun Yat-sen University, National Dong Hwa University and Yuan Ze University, is reviewed.

**The Proposed CNN Enhancement Mode is the In-loop Filter**

In-loop filter is used to enhance the video frame quality before the video frame is used for viewing or prediction. With higher quality, better prediction can be obtained for the next frame. Bitrate can also be reduced due to the better prediction. (To know more about in-loop filter, please read DRN.)

For S-CNN, an early termination mechanism is proposed to further reduce the HEVC encoding complexity.
With regard to C-CNN, a GPU-based heterogeneous architecture is proposed to accelerate CNN processing.

This is a paper in 2018 ACCESS. This is an open-access journal with high impact factor of 4.098. All papers can be downloaded free of charge even for non IEEE members. (Sik-Ho Tsang @ Medium)

Outline

Simplfied CNN (S-CNN)
Early Termination
Complicated CNN (C-CNN)
Experimental Results

1. Simplfied CNN (S-CNN)

The proposed S-CNN consists of only two operations: feature extraction and image reconstruction.
Feature extraction is applied to generate the feature maps from intra coding. Image reconstruction is used to reconstruct a residual image for the final output.
Non-linear mapping in SRCNN is not considered in the proposed S-CNN structure because high-dimensional vector mapping would dominate the high computational complexity when increased fully convolutional layer in CNN structure.
The filter size is set to 5×5.
With Y as the input image, we have:

W1 has a 5×5×32 vector size, where 32 denotes the number of filters. B1 is a 32-dimensional vector.
ReLU is used, i.e. the max(0,x) function.
W2 has a 32×5×5 vector size, and B2 is a single bias for image reconstruction.
Finally, F2(Y) denotes the output of the proposed S-CNN.
Another change from SRCNN is that, residual learning (in ResNet) is added to improve the CNN:

where H(Y) is the final output of the forward process of S-CNN.
The loss function is:

where N is the total number of training samples.
If the new reconstructed CTU (Coding Tree Unit, 64×64 block) has a lower SSE than CNN enhancement mode, an extra bit, called cnnflag, will be encoded in bitstreams to indicate the CNN enhancement mode should be applied.

2. Early Termination

If early termination is used, When the MSE is lower than the threshold T, the CNN enhancement mode is skipped.
T is adaptively updated. (But I do not want to focus on this in the story.)
For simplicity, if MSE is small enough, CTUs will not go through the CNN.
If MSE is large, CTUs will go through the CNN.
If MSE is very large, CTUs will go through the CNN for multiple times.

3. Complicated CNN (C-CNN)

A deeper network is used in C-CNN.
GPU is used in C-CNN.

The filter size f is set to 5, and the number of filters, n1 and n2, are set to 64 and 32, respectively. S is the spatial size of CTU.

Thus, GPU parallel processes the operations of fully convolutional and activation functions to generate the n1 feature maps.
Furthermore, the n2 feature maps are generated by the total of n1 feature maps in the second convolutional layer (mapping layer).
C++ AMP is used to implement the proposed C-CNN parallel. The C-CNN model is integrated in HM.

4. Experimental Results

430 raw images from RAISE dataset are used for CNN training.

4.1. S-CNN

**BD-Rate (%) (BDBR) and Encoding Time (%) of S-CNN Compared to the Convnentional HEVC HM-16.12**

The proposed S-CNN-based method can achieve, on average, a 3.1% bitrate reduction with a 37% increase in time.
When the proposed intra coding method shows support for an early termination mechanism, encoding time only increases by 23%, while the BDBR slightly decreases to 2.8%.

**Usage Rate (%) and Yield Rate (%) of S-CNN**

The usage rate indicates how many CTUs are applied in CNN enhancement mode.
The usage rates of the proposed method with early termination are, on average, 80.1%, 79.1%, 73.0%, and 67.7%, corresponding to QP {27, 32, 37, 42}.
Class C and D sequences achieve high usage rates, which suggests the proposed early termination mechanism is not suitable for low resolution video encoding.
The yield rate shows that the number of CTUs benefits from the CNN enhancement mode.
CTUs are improved by about 75.3% to 84.6% with CNN enhancement mode. When applying the early termination mechanism, the yield rate slightly decreases to between 61.8% and 75.4%.

4.2. C-CNN

**BD-Rate (%) (BDBR) and Encoding Time (%) of C-CNN Compared to the Convnentional HEVC HM-16.12**

The bitrate saving reaches 5.1% on average compared to the HEVC standard.
Meanwhile, the encoding time increases by 1254% on average due to the high computational complexity of the second convolutional layer.
With the use of GPU, increased encoding time is decreased to 98%. The execution time of C-CNN on GPU can achieve about 8.8–17.8× speed-up.

4.3. Visual Quality

**Visual Quality Comparison. Top: BQSquare with QP = 32, Bottom: PartyScene with QP = 37.**

The above figure shows the visual quality of the proposed intra coding method. We can observe that the proposed method is able to visibly reduce the coding artifacts of ringing, blocking, and blurring.

During the days of coronavirus, I hope to write 30 stories in this month to give myself a small challenge. This is the 25th story in this month. Thanks for visiting my story…

Reference

[2018 ACCESS] [S-CNN & C-CNN]
HEVC Intra Frame Coding Based on Convolutional Neural Network

Codec Filtering

JPEG: [ARCNN] [RED-Net] [Li ICME’17]
HEVC:[Lin DCC’16] [IFCNN] [VRCNN] [DCAD] [MMS-net] [DRN] [Lee ICCE’18] [DS-CNN] [RHCNN] [VRCNN-ext] [S-CNN & C-CNN]
VVC: [Lu CVPRW’19] [Wang APSIPA ASC’19]