Reading: Klopp TIP’20 — Low Complexity CNNs to Lift Non-Local Redundancies (HEVC Inter)

Outperforms Jia TIP’19, up to 6.8% coding gain Chroma, up to 14.4% on Luma

Sik-Ho Tsang
3 min readJul 11, 2020

In this story, Low Complexity CNNs to Lift Non-Local Redundancies in Video Coding (Klopp TIP’20), by National Taiwan University, is briefly presented. I read this because I work on video coding research. In this paper:

  • CNN is designed with a particular emphasis on low memory and computational footprint.
  • The parameters of those networks are trained on the fly, at encoding time, to predict the residual signal from the decoded video signal.
  • The model is then quantized and compressed, sent to decoder as well.
  • Therefore, it is a kind of online learning scheme in video coding.

This is a paper in 2020 TIP where TIP has a high impact factor of 6.79. I just briefly talk about the concept of it. (Sik-Ho Tsang @ Medium).

Outline

  1. Overall Scheme
  2. Network Architecture
  3. Experimental Results
Overall Scheme
  • There are step number as shown in the figure:
  1. The video signal is split into groups of pictures (GoP), the residuals of which are jointly predicted by a CNN that is trained on the fly.
  2. The CNN parameters are quantised.
  3. And the resulting CNN is tested for coding gains on the GoP.
  4. If the test is positive, its parameters are compressed before they are added to the bit stream of the underlying video codec. The dashed arrows/boxes indicate data transfer/operations that are only carried out in streaming scenarios where access to data signalled for previous frames is granted at the decoder.
  5. In such a streaming scenario, previously signalled parameters are first tested on the following GoP,
  6. before fine-tuning on that GoP commences.
  7. Quantisation of those fine-tuned parameters is followed by
  8. another test to compare if higher gains can be achieved.
  9. If this is the case, the difference between new and old parameters is compressed and added to the bit stream.

2. Network Architecture

Network Architecture
  • MobileNet architecture is used to factorise the convolutional layers. A very shallow network is used since the model needs to be compressed.
Pixel Packing
  • A patch sized PH×PW of the input image is rearranged as a vector with PH×PW elements. PH=PW=1 means no packing.
  • Batch normalization is used.
  • During training, 32-byte floating point format is used.
  • During testing, the weights are quantized.

3. Experimental Results

SOTA Comparison
  • The proposed approach has a large margin on BD-rate gain against Jia TIP’19 [37]. (I don’t know why they compare with a loop-filter approach though they both improve the coding gain..)
  • (There are a lot of ablation experiments such GOP, pixel packing, and different configurations in the paper. Please feel free to read the paper.)

This is the 6th story in this month.

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet