Review: RHCNN — Residual Highway Convolutional Neural Network (Codec Filtering)

Outperforms ARCNN

5 min readMar 17, 2020

--

In this story, RHCNN (Residual Highway Convolutional Neural Network), by Tsinghua Univeristy, Chinese Academy of Sciences and Peking University, is reviewed. I read this paper because I am working on video coding research.

RHCNN is a CNN which acts as a high-dimensional in-loop filter to improve the reconstructed image quality. This is a 2018 TIP paper where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

What is In-loop Filter
Choices of Basic Unit
Entire Network Architecture
Experimental Results

1. What is In-loop Filter

In-loop filter is placed before reconstruction of the frames.
At the encoder side, the reconstructed frame after in-loop filter will achieve a higher video quality. This filtered frame will be used for predicting other remaining pixels within the frame or predicting pixels of other frames.
Thus, with the use of in-loop filter, PSNR of the whole video increases.
At the decoder side, the same in-loop filter will be used so that the decoded frames are equivalent to the reconstructed frame, i.e. no mismatch between the encoder and the decoder.

2. Choices of Basic Unit

**Types of highway units. (a) Proposed, (b) Constant Scaling, (c) Dropout Shortcut, (d) Convolution Shortcut (X: concatenation, +: addition)**

In this paper, CNN-based in-loop filter is proposed.
Different basic units are considered within the CNN network as shown above. And the optimal one is chosen based on the validation results.
(a) Highway unit: This network is based on the concept of Highway Network.
(b) Constant Scaling Unit: A 0.5 value of constant scaling is applied on the highway unit.
(c) Dropout Shortcut: A dropout with probability of 0.1 is applied to highway unit.
(d) Convolution Shortcut: A 3×3 convolution is applied to the shortcut instead of just using skip connection.

**Gain of PSNR (dB) for (a) Proposed, (b) Constant Scaling, (c) Dropout Shortcut, (d) Convolution Shortcut**

Six cascaded basic units are used within the network.
And finally, it is found that the proposed highway unit obtains the highest average and maximum gains of PSNR among four basic unit.
Thus, highway unit is used in the network.

3. Entire Network Architecture

3.1. Network Variants

**(a) 13-layer plain network. (b) 13-layer residual network. (c) 13-layer RHCNN.**

(a) 13-layer plain network: VGGNet-like networks with 13 convolutional layers as shown above.
(b) 13-layer residual network: ResNet-like networks, with the use of skip connection used as shown above to avoid gradient vanishing problem.
(c) 13-layer RHCNN: Finally, the proposed RHCNN, with the use of highway unit, except the short skip connection, there is also a long skip connection.

**Gain of PSNR (dB) for (a) 13-layer plain network. (b) 13-layer residual network. (c) 13-layer RHCNN.**

Of course, RHCNN obtains the best result.

3.2. Various Network Depths

**Different Network Depths (Different Colors) Along Iterations.**

As shown above, 13-layer RHCNN converges the fastest and gains the highest PSNR at last during training.
The deepest 17-layer RHCNN seems not very stable, and reaches to a medium PSNR value.
Besides, 9-layer and 11-layer RHCNNs perform badly. They look like converging to a local optimum.

**Gain of PSNR (dB) for RHCNN with Various Depth**

Same observation as the figure that 13-layer RHCNN obtains the highest PSNR improvement.

4. Experimental Results

4.1. Dataset

Training Set: 15 standard video sequences with varying resolutions. 50 frames are extracted according to the rule that picks one frame per five frames.
Validation Set: 300 frames from another 6 video sequences which are different from the frames in the training set as validation images.
Testing Set: 11 standard video sequences, none of which is the same as training images and validation images.
Loss Function: The basic MSE loss is used:

4.2. Combined Model or Separate Models for I, P & B Frames

Using single combined model for P and B frames only achieves 3 to 4% bitrate saving.
When separate models for P and B frames are used, a much better bitrate saving of 4 to 6% can be achieved.

4.3. Comparison with ARCNN

Using RHCNN, 5.7% BDBR is achieved while using ARCNN, only 1.72% BDBR is achieved.
Similar results for P (5.68% vs 2.00%) and B (4.35% vs 1.04%) frames.