Review: RHCNN — Residual Highway Convolutional Neural Network (Codec Filtering)

Outperforms ARCNN

Sik-Ho Tsang
5 min readMar 17, 2020

In this story, RHCNN (Residual Highway Convolutional Neural Network), by Tsinghua Univeristy, Chinese Academy of Sciences and Peking University, is reviewed. I read this paper because I am working on video coding research.

RHCNN is a CNN which acts as a high-dimensional in-loop filter to improve the reconstructed image quality. This is a 2018 TIP paper where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

  1. What is In-loop Filter
  2. Choices of Basic Unit
  3. Entire Network Architecture
  4. Experimental Results

1. What is In-loop Filter

The pipeline of hybrid video coding
  • In-loop filter is placed before reconstruction of the frames.
  • At the encoder side, the reconstructed frame after in-loop filter will achieve a higher video quality. This filtered frame will be used for predicting other remaining pixels within the frame or predicting pixels of other frames.
  • Thus, with the use of in-loop filter, PSNR of the whole video increases.
  • At the decoder side, the same in-loop filter will be used so that the decoded frames are equivalent to the reconstructed frame, i.e. no mismatch between the encoder and the decoder.

2. Choices of Basic Unit

Types of highway units. (a) Proposed, (b) Constant Scaling, (c) Dropout Shortcut, (d) Convolution Shortcut (X: concatenation, +: addition)
  • In this paper, CNN-based in-loop filter is proposed.
  • Different basic units are considered within the CNN network as shown above. And the optimal one is chosen based on the validation results.
  • (a) Highway unit: This network is based on the concept of Highway Network.
  • (b) Constant Scaling Unit: A 0.5 value of constant scaling is applied on the highway unit.
  • (c) Dropout Shortcut: A dropout with probability of 0.1 is applied to highway unit.
  • (d) Convolution Shortcut: A 3×3 convolution is applied to the shortcut instead of just using skip connection.
Gain of PSNR (dB) for (a) Proposed, (b) Constant Scaling, (c) Dropout Shortcut, (d) Convolution Shortcut
  • Six cascaded basic units are used within the network.
  • And finally, it is found that the proposed highway unit obtains the highest average and maximum gains of PSNR among four basic unit.
  • Thus, highway unit is used in the network.

3. Entire Network Architecture

3.1. Network Variants

(a) 13-layer plain network. (b) 13-layer residual network. (c) 13-layer RHCNN.
  • (a) 13-layer plain network: VGGNet-like networks with 13 convolutional layers as shown above.
  • (b) 13-layer residual network: ResNet-like networks, with the use of skip connection used as shown above to avoid gradient vanishing problem.
  • (c) 13-layer RHCNN: Finally, the proposed RHCNN, with the use of highway unit, except the short skip connection, there is also a long skip connection.
Gain of PSNR (dB) for (a) 13-layer plain network. (b) 13-layer residual network. (c) 13-layer RHCNN.
  • Of course, RHCNN obtains the best result.

3.2. Various Network Depths

Different Network Depths (Different Colors) Along Iterations.
  • As shown above, 13-layer RHCNN converges the fastest and gains the highest PSNR at last during training.
  • The deepest 17-layer RHCNN seems not very stable, and reaches to a medium PSNR value.
  • Besides, 9-layer and 11-layer RHCNNs perform badly. They look like converging to a local optimum.
Gain of PSNR (dB) for RHCNN with Various Depth
  • Same observation as the figure that 13-layer RHCNN obtains the highest PSNR improvement.

4. Experimental Results

4.1. Dataset

Training and Validation Sets
  • Training Set: 15 standard video sequences with varying resolutions. 50 frames are extracted according to the rule that picks one frame per five frames.
  • Validation Set: 300 frames from another 6 video sequences which are different from the frames in the training set as validation images.
  • Testing Set: 11 standard video sequences, none of which is the same as training images and validation images.
  • Loss Function: The basic MSE loss is used:

4.2. Combined Model or Separate Models for I, P & B Frames

  • Using single combined model for P and B frames only achieves 3 to 4% bitrate saving.
  • When separate models for P and B frames are used, a much better bitrate saving of 4 to 6% can be achieved.

4.3. Comparison with ARCNN

Results for I frames
  • Using RHCNN, 5.7% BDBR is achieved while using ARCNN, only 1.72% BDBR is achieved.
  • Similar results for P (5.68% vs 2.00%) and B (4.35% vs 1.04%) frames.

4.3. Qualitative Results

(a) Groundtruth, (b) Original HEVC, © HEVC with ARCNN, (d) HEVC with RHCNN
  • As in (d) above, RHCNN can reconstruct a more similar image as (a) Groundtruth.

4.4. Time Measurement

Running Time in Seconds Per Frame
  • RHCNN needs much time even used GPU parallel, less than 3 times of the original HEVC (HM 12).

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet