Reading: Double-Input CNN — Mean-based Mask (MM)+Add-based Fusion (AF) (Codec Filtering)

With Mask Input, Outperforms VRCNN and QE-CNN.

5 min readMay 17, 2020

In this story, Enhancing HEVC Compressed Videos with a Partition-masked Convolutional Neural Network (Double-Input CNN), by Shanghai Jiao Tong University, and University of Maryland, is described. I read this because I work on video coding research. (The name, Double-Input CNN, is named by another paper. Thus, I also call it Double-Input CNN here.)

In contrast to existing CNN-based approaches, which only take the decoded frame as the input to the CNN, the proposed approach considers the coding unit (CU) size information and combines it with the distorted decoded frame as input such that the degradation introduced by HEVC is reduced more efficiently. This is a paper in 2018 ICIP. Sik-Ho Tsang @ Medium)

Outline

Framework Variants
Double-Input CNN Network Architecture
Experimental Results

1. Framework Variants

Since the block-wise transform and quantization are performed in HEVC during encoding, the quality degradation of compressed frames is highly related to the coding unit splitting. Thus, the partition information contains useful clues for eliminating the artifacts present during the encoding.
And there are numerous ways to generate and fuse this information with the decoded frame.

1.1. Mask Generation

**(a) original image with partition information (b) Mean-based mask (c) Boundary-based mask**

(b) Mean-based mask (MM): Each partition block in a frame is filled with the mean value of all decoded pixels inside this partition.
(c) Boundary-based mask (BM): The boundary pixels between partitions are filled with value 1 and the rest non-boundary pixels are filled with value 0. The width of the boundary is set to 2.

1.2. Mask-frame Fusion Strategies

(a) Add-based fusion (AF): First extract the feature maps of the mask using CNN and then combine it with the feature maps of the input frame using element-wise add layer.
(b) Concatenate-based fusion (CF): Concatenate the mask and frame as the input to the CNN. Then the two-channel image is fed to CNN.
(c) Early fusion (EF): First extract the features of mask only using three convolutional layers and integrate it into the network.

The above mask generation and fusion strategies will be studied in the experimental results section.

2. Double-Input CNN Network Architecture

This CNN contains two streams in the feature extracting stage so as to extract features for the decoded frame and its corresponding mask, respectively.
Each residual block in the feature extracting stage has two convolutional layers with 3×3 kernels and 64 feature maps, followed by batch-normalization layers and ReLU, as shown in the grey block at the bottom right of the figure.
Then, the feature maps of the mask and decoded frame are fused by the add-based fusion strategy and are fed to the rest three convolutional layers.
MSE is used as loss function.

3. Experimental Results

3.1. Some Training Details

The dataset is derived from 600 video clips with various resolutions, as shown above. (But authors did not mention explicitly which dataset they use for training.)
All raw video clips are encoded by HM-16.0 at Low-delay P at QP=22, 27, 32, and 37.
An individual CNN is trained for each QP. First train the QP 37 one, then train others by fine-tuning QP 37 one.

3.2. Ablation Study

**ΔPSNR Obtained by Double-Input CNN Variants**

1-in: No mask input, obtains the lowest PSNR improvement.
2-in+BM+AF: cannot provide noticeable improvement (0.08 dB over 1-in). This is because only marking boundary pixels in a mask is less effective in highlighting the partition modes in a frame.
Comparatively, the concatenate-fusion (2-in+MM+CF) and early-fusion (2-in+MM+EF) strategies obtains few gains similar to 2-in+BM+AF. This is probably because these fusion strategies are less compatible with the CNN model used in this paper.
Comparatively, the mean-based mask (2-in+MM+AF) can obtain more obvious PSNR improvement (0.15 dB over 1-in).

3.3. SOTA Comparison

**BD-Rate (%) Compared to Original HEVC**

The full version of our approach (our+2-in+MM+AF) achieves the best performance overall the compared methods.
Specifically, it can obtain over 9.76% BD-rate reduction from standard HEVC and 4% BD-rate reduction when compared with the state-of-the-art QECNN.
When integrating our partition-mask strategy, the VRCNN+MM+AF can also obtain 3% BD-rate improvement over the original VRCNN method.
The baseline single-input method (our+1-in) can also obtain satisfactory results when compared with the existing methods (VRCNN, QECNN-P).

During the days of coronavirus, let me have a challenge of writing 30 stories again for this month ..? Is it good? This is the 24th story in this month. Thanks for visiting my story..

Reference

[2018 ICIP] [Double-Input CNN]
Enhancing HEVC Compressed Videos with a Partition-masked Convolutional Neural Network

Codec Filtering

JPEG [ARCNN] [RED-Net] [DnCNN] [Li ICME’17] [MemNet] [MWCNN]
HEVC [Lin DCC’16] [IFCNN] [VRCNN] [DCAD] [MMS-net] [DRN] [Lee ICCE’18] [DS-CNN] [RHCNN] [VRCNN-ext] [S-CNN & C-CNN] [MLSDRN] [Double-Input CNN] [Liu PCS’19] [QE-CNN] [EDCNN]
3D-HEVC [RSVE+POST]
AVS3 [Lin PCS’19]
VVC [Lu CVPRW’19] [Wang APSIPA ASC’19]