Review: RHCNN — Residual Highway Convolutional Neural Network (Codec Filtering)
In this story, RHCNN (Residual Highway Convolutional Neural Network), by Tsinghua Univeristy, Chinese Academy of Sciences and Peking University, is reviewed. I read this paper because I am working on video coding research.
RHCNN is a CNN which acts as a high-dimensional in-loop filter to improve the reconstructed image quality. This is a 2018 TIP paper where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)
- What is In-loop Filter
- Choices of Basic Unit
- Entire Network Architecture
- Experimental Results
1. What is In-loop Filter
- In-loop filter is placed before reconstruction of the frames.
- At the encoder side, the reconstructed frame after in-loop filter will achieve a higher video quality. This filtered frame will be used for predicting other remaining pixels within the frame or predicting pixels of other frames.
- Thus, with the use of in-loop filter, PSNR of the whole video increases.
- At the decoder side, the same in-loop filter will be used so that the decoded frames are equivalent to the reconstructed frame, i.e. no mismatch between the encoder and the decoder.
2. Choices of Basic Unit
- In this paper, CNN-based in-loop filter is proposed.
- Different basic units are considered within the CNN network as shown above. And the optimal one is chosen based on the validation results.
- (a) Highway unit: This network is based on the concept of Highway Network.
- (b) Constant Scaling Unit: A 0.5 value of constant scaling is applied on the highway unit.
- (c) Dropout Shortcut: A dropout with probability of 0.1 is applied to highway unit.
- (d) Convolution Shortcut: A 3×3 convolution is applied to the shortcut instead of just using skip connection.
- Six cascaded basic units are used within the network.
- And finally, it is found that the proposed highway unit obtains the highest average and maximum gains of PSNR among four basic unit.
- Thus, highway unit is used in the network.
3. Entire Network Architecture
3.1. Network Variants
- (a) 13-layer plain network: VGGNet-like networks with 13 convolutional layers as shown above.
- (b) 13-layer residual network: ResNet-like networks, with the use of skip connection used as shown above to avoid gradient vanishing problem.
- (c) 13-layer RHCNN: Finally, the proposed RHCNN, with the use of highway unit, except the short skip connection, there is also a long skip connection.
- Of course, RHCNN obtains the best result.
3.2. Various Network Depths
- As shown above, 13-layer RHCNN converges the fastest and gains the highest PSNR at last during training.
- The deepest 17-layer RHCNN seems not very stable, and reaches to a medium PSNR value.
- Besides, 9-layer and 11-layer RHCNNs perform badly. They look like converging to a local optimum.
- Same observation as the figure that 13-layer RHCNN obtains the highest PSNR improvement.
4. Experimental Results
- Training Set: 15 standard video sequences with varying resolutions. 50 frames are extracted according to the rule that picks one frame per five frames.
- Validation Set: 300 frames from another 6 video sequences which are different from the frames in the training set as validation images.
- Testing Set: 11 standard video sequences, none of which is the same as training images and validation images.
- Loss Function: The basic MSE loss is used:
4.2. Combined Model or Separate Models for I, P & B Frames
- Using single combined model for P and B frames only achieves 3 to 4% bitrate saving.
- When separate models for P and B frames are used, a much better bitrate saving of 4 to 6% can be achieved.
4.3. Comparison with ARCNN
- Using RHCNN, 5.7% BDBR is achieved while using ARCNN, only 1.72% BDBR is achieved.
- Similar results for P (5.68% vs 2.00%) and B (4.35% vs 1.04%) frames.
4.3. Qualitative Results
- As in (d) above, RHCNN can reconstruct a more similar image as (a) Groundtruth.
4.4. Time Measurement
- RHCNN needs much time even used GPU parallel, less than 3 times of the original HEVC (HM 12).