Reading: CNN-CR — CNN for Image Compact-Resolution (HEVC Intra)

VDSR-Like Network, Outperforms EDSR & Li TCSVT’18

6 min readJun 23, 2020

In this story, Learning a Convolutional Neural Network for Image Compact-Resolution (CNN-CR), by University of Science and Technology of China, and University of Missouri-Kansas City, is presented. I read this because I work on video coding research. In this paper:

Image CR provides a low-resolution version of a high-resolution image.
Two applications of image CR can be realized, i.e., low-bit-rate image compression and image retargeting.
Image/video compression can encode the CR image instead of the full resolution image to save the bitrate.
Image retargeting can retarget the CR image to different display devices with higher visual quality.

This is a paper in 2019 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

CNN-CR: Loss Function
CNN-CR: Network Architecture
Separate Training & Joint Training & Progressive Training
Application Realizations
Experimental Results
Results for Image Retargeting
Results for Image/Video Compression

1. CNN-CR: Loss Function

There is no “ground-truth” for the compact-resolved image, instead two loss functions are defined.
Reconstruction Loss: Denote the original image as x, the mapping function of image CR as f and the mapping function of up-scaling as g, then the reconstruction loss is defined as:

Thus, f and g are learned jointly, i.e. joint learning.
Regularization Loss: It is used to ensure the visual quality of the compact-resolved image. The low-resolution image generated by F should be smooth and have no aliasing.

Bicubic down-sampling is used to approximate the function F.
Combined Loss:

where λ is a parameter that controls the relative weight of the regularization loss.

λ is set as 0.7, which achieves better tradeoff between the visual quality of the compact-resolved image and the final reconstruction quality.

With the above loss function, image CR can generate a low-resolution image which can better preserve high frequency components of the original CR, so that when the CR image is upsampled, higher quality is obtained.

2. CNN-CR: Network Architecture

CNN-CR consists of several convolutional layers, all of which except the first and the last are of the same configuration: 64 filters with kernel size 3×3, followed by ReLU.
The first layer operates on the input image and serves as a resolution decreasing layer. For example, in the case of 2× down-sizing, the filters in the first layer will be equipped with stride = 2.
The last layer is used for generating the compact-resolved image, thus contains a single filter with kernel size 3×3.
It is similar to VDSR but with downsampling.

After trying different network depths, 10 layers are selected.

And downsizing using convolution with stride of 2 is used instead of pooling due to better performance as shown above.

Residual learning is used which can have better training and faster convergence.

3. Separate Training & Joint Training

3.1. Separate Training

Separate training is used where up-scaling is bilinear. Only CNN-CR is trained:

3.2. Joint Training

EDSR is used as CNN-SR. Therefore both CNN-CR and CNN-SR can be trained jointly:

3.3. Progressive Training

First, CNN-SR is trained using Separate Training.
Then, by fixing the parameters of CNN-SR, CNN-CR is trained.
Finally, the entire end-to-end network is fine-tuned.

With progressive training, the performance is better than direct training, i.e. training the network end-to-end at the very beginning.

4. Application Realizations

4.1. Image Retargeting

Retargeting in general refers to the task of changing resolution to suit for different display devices.
The only issue is how to provide arbitrary resolution in CNN-CR, which can be solved by replacing the first layer of CNN-CR with a differentiable re-sampling layer.

4.2. Image/Video Compression

**(a) Frame-level down- and up-sampling coding scheme. (b) Block-level adaptive down- and up-sampling coding scheme**

(a) Frame-level: Whole frame is downsized by CNN-CR and encoded. Then it is super-resolved by CNN-SR.
(b) Block-level: CTU is used as basis.
First, each CTU can be either down-sampled and coded, or directly coded at native resolution.
Second, if coded at low resolution, either CNN-CR or simple down-sampling filter can be used for down-sizing.

5. Experimental Results

Whole DIV2K dataset is used for training.

5.1. PSNR for CNN-CRSep (Sep means Separate Training)

CNN-CRSep outperforms bicubic down-sampling, and achieves on average 1.25 dB improvement.
CNN-CRSep + bicubic up-sampling also performs better than bicubic down-sampling + bicubic up-sampling. This shows that the preserved information introduced by CNN-CRSep can boost the reconstruction quality.

(d): Bicubic downsampling then bilinear upsampling
(e): CNN-CNSep then Bilinear upsampling. It has sharper image.

5.2. PSNR for CNN-CRJoint (Joint means Joint Training)

CNN-CRJoint plus CNN-SR can outperform the EDSR one [4] by a considerable margin.

6. Results for Image Retargeting

**CNN-CRJoint is compared to seam carving and bicubic down-sampling, respectively.**

30 subjects participate.
5 discrete levels of scores are given: −2, −1, 0, 1, 2, standing for better, slightly better, indistinguishable, slightly worse, and worse, respectively.
CNN-CRJoint has higher scores compared to one representative method called Seam Carving.

7. Results for Image/Video Compression

7.1. RD Curves

HM-12.1 is used.
Both frame-level and block level approoches work well at low bitrate condition.
Frame-level one has large PSNR drop at high bitrate while block-level one still can maintain the coding performance at high bitrate due to adaptive mode switching based on rate distortion optimization (RDO).

6.2. BD-Rate

Frame-level method brings on average 7.0% and 3.1% BD-rate reduction for HEVC and UHD test sequences, respectively.
Block-level method brings on average 6.9% and 10.4% BD-rate reduction for HEVC and UHD test sequences, respectively.
Both outperform Li TCSVT’18 [41].