Reading: CNN-CR — CNN for Image Compact-Resolution (HEVC Intra)

VDSR-Like Network, Outperforms EDSR & Li TCSVT’18

Sik-Ho Tsang
6 min readJun 23, 2020

In this story, Learning a Convolutional Neural Network for Image Compact-Resolution (CNN-CR), by University of Science and Technology of China, and University of Missouri-Kansas City, is presented. I read this because I work on video coding research. In this paper:

  • Image CR provides a low-resolution version of a high-resolution image.
  • Two applications of image CR can be realized, i.e., low-bit-rate image compression and image retargeting.
  • Image/video compression can encode the CR image instead of the full resolution image to save the bitrate.
  • Image retargeting can retarget the CR image to different display devices with higher visual quality.

This is a paper in 2019 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)

Outline

  1. CNN-CR: Loss Function
  2. CNN-CR: Network Architecture
  3. Separate Training & Joint Training & Progressive Training
  4. Application Realizations
  5. Experimental Results
  6. Results for Image Retargeting
  7. Results for Image/Video Compression

1. CNN-CR: Loss Function

CNN-CR: Loss Function
  • There is no “ground-truth” for the compact-resolved image, instead two loss functions are defined.
  • Reconstruction Loss: Denote the original image as x, the mapping function of image CR as f and the mapping function of up-scaling as g, then the reconstruction loss is defined as:
  • Thus, f and g are learned jointly, i.e. joint learning.
  • Regularization Loss: It is used to ensure the visual quality of the compact-resolved image. The low-resolution image generated by F should be smooth and have no aliasing.
  • Bicubic down-sampling is used to approximate the function F.
  • Combined Loss:
  • where λ is a parameter that controls the relative weight of the regularization loss.
Different Values of λ
  • λ is set as 0.7, which achieves better tradeoff between the visual quality of the compact-resolved image and the final reconstruction quality.

With the above loss function, image CR can generate a low-resolution image which can better preserve high frequency components of the original CR, so that when the CR image is upsampled, higher quality is obtained.

2. CNN-CR: Network Architecture

CNN-CR: Network Architecture
  • CNN-CR consists of several convolutional layers, all of which except the first and the last are of the same configuration: 64 filters with kernel size 3×3, followed by ReLU.
  • The first layer operates on the input image and serves as a resolution decreasing layer. For example, in the case of 2× down-sizing, the filters in the first layer will be equipped with stride = 2.
  • The last layer is used for generating the compact-resolved image, thus contains a single filter with kernel size 3×3.
  • It is similar to VDSR but with downsampling.
Different Network Depths
  • After trying different network depths, 10 layers are selected.
Different Downsizing
  • And downsizing using convolution with stride of 2 is used instead of pooling due to better performance as shown above.
Residual Learning
  • Residual learning is used which can have better training and faster convergence.

3. Separate Training & Joint Training

3.1. Separate Training

  • Separate training is used where up-scaling is bilinear. Only CNN-CR is trained:

3.2. Joint Training

  • EDSR is used as CNN-SR. Therefore both CNN-CR and CNN-SR can be trained jointly:

3.3. Progressive Training

  1. First, CNN-SR is trained using Separate Training.
  2. Then, by fixing the parameters of CNN-SR, CNN-CR is trained.
  3. Finally, the entire end-to-end network is fine-tuned.
Progressive Training
  • With progressive training, the performance is better than direct training, i.e. training the network end-to-end at the very beginning.

4. Application Realizations

4.1. Image Retargeting

  • Retargeting in general refers to the task of changing resolution to suit for different display devices.
  • The only issue is how to provide arbitrary resolution in CNN-CR, which can be solved by replacing the first layer of CNN-CR with a differentiable re-sampling layer.

4.2. Image/Video Compression

(a) Frame-level down- and up-sampling coding scheme. (b) Block-level adaptive down- and up-sampling coding scheme
  • (a) Frame-level: Whole frame is downsized by CNN-CR and encoded. Then it is super-resolved by CNN-SR.
  • (b) Block-level: CTU is used as basis.
  • First, each CTU can be either down-sampled and coded, or directly coded at native resolution.
  • Second, if coded at low resolution, either CNN-CR or simple down-sampling filter can be used for down-sizing.

5. Experimental Results

  • Whole DIV2K dataset is used for training.

5.1. PSNR for CNN-CRSep (Sep means Separate Training)

PSNR on Test Sets
  • CNN-CRSep outperforms bicubic down-sampling, and achieves on average 1.25 dB improvement.
  • CNN-CRSep + bicubic up-sampling also performs better than bicubic down-sampling + bicubic up-sampling. This shows that the preserved information introduced by CNN-CRSep can boost the reconstruction quality.
  • (d): Bicubic downsampling then bilinear upsampling
  • (e): CNN-CNSep then Bilinear upsampling. It has sharper image.

5.2. PSNR for CNN-CRJoint (Joint means Joint Training)

PSNR on Test Sets
  • CNN-CRJoint plus CNN-SR can outperform the EDSR one [4] by a considerable margin.

6. Results for Image Retargeting

CNN-CRJoint is compared to seam carving and bicubic down-sampling, respectively.
  • 30 subjects participate.
  • 5 discrete levels of scores are given: −2, −1, 0, 1, 2, standing for better, slightly better, indistinguishable, slightly worse, and worse, respectively.
  • CNN-CRJoint has higher scores compared to one representative method called Seam Carving.

7. Results for Image/Video Compression

7.1. RD Curves

RD Curves
  • HM-12.1 is used.
  • Both frame-level and block level approoches work well at low bitrate condition.
  • Frame-level one has large PSNR drop at high bitrate while block-level one still can maintain the coding performance at high bitrate due to adaptive mode switching based on rate distortion optimization (RDO).

6.2. BD-Rate

BD-Rate
  • Frame-level method brings on average 7.0% and 3.1% BD-rate reduction for HEVC and UHD test sequences, respectively.
  • Block-level method brings on average 6.9% and 10.4% BD-rate reduction for HEVC and UHD test sequences, respectively.
  • Both outperform Li TCSVT’18 [41].

6.3. Hitting Ratios

Hitting Ratios
  • Certain amount of CUs choosing the proposed method for encoding.

6.4. Computational Complexity

Computational Complexity Using GPU
  • Using GPU, the encoding/decoding time is still increased by large amount.

This is the 33rd story in this month!

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet