Reading: Klopp TIP’20 — Low Complexity CNNs to Lift Non-Local Redundancies (HEVC Inter)
Outperforms Jia TIP’19, up to 6.8% coding gain Chroma, up to 14.4% on Luma
In this story, Low Complexity CNNs to Lift Non-Local Redundancies in Video Coding (Klopp TIP’20), by National Taiwan University, is briefly presented. I read this because I work on video coding research. In this paper:
- CNN is designed with a particular emphasis on low memory and computational footprint.
- The parameters of those networks are trained on the fly, at encoding time, to predict the residual signal from the decoded video signal.
- The model is then quantized and compressed, sent to decoder as well.
- Therefore, it is a kind of online learning scheme in video coding.
This is a paper in 2020 TIP where TIP has a high impact factor of 6.79. I just briefly talk about the concept of it. (Sik-Ho Tsang @ Medium).
Outline
- Overall Scheme
- Network Architecture
- Experimental Results
- There are step number as shown in the figure:
- The video signal is split into groups of pictures (GoP), the residuals of which are jointly predicted by a CNN that is trained on the fly.
- The CNN parameters are quantised.
- And the resulting CNN is tested for coding gains on the GoP.
- If the test is positive, its parameters are compressed before they are added to the bit stream of the underlying video codec. The dashed arrows/boxes indicate data transfer/operations that are only carried out in streaming scenarios where access to data signalled for previous frames is granted at the decoder.
- In such a streaming scenario, previously signalled parameters are first tested on the following GoP,
- before fine-tuning on that GoP commences.
- Quantisation of those fine-tuned parameters is followed by
- another test to compare if higher gains can be achieved.
- If this is the case, the difference between new and old parameters is compressed and added to the bit stream.
2. Network Architecture
- MobileNet architecture is used to factorise the convolutional layers. A very shallow network is used since the model needs to be compressed.
- A patch sized PH×PW of the input image is rearranged as a vector with PH×PW elements. PH=PW=1 means no packing.
- Batch normalization is used.
- During training, 32-byte floating point format is used.
- During testing, the weights are quantized.
3. Experimental Results
- The proposed approach has a large margin on BD-rate gain against Jia TIP’19 [37]. (I don’t know why they compare with a loop-filter approach though they both improve the coding gain..)
- (There are a lot of ablation experiments such GOP, pixel packing, and different configurations in the paper. Please feel free to read the paper.)
This is the 6th story in this month.
References
[2020 TIP] [Klopp TIP’20]
Utilising Low Complexity CNNs to Lift Non-Local Redundancies in Video Coding
Paper Website: https://video.ee.ntu.edu.tw/cnnvc/
Codec Inter Prediction
H.264 [DRNFRUC & DRNWCMC]
HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [GVTCNN] [Ibrahim ISM’18] [VC-LAPGAN] [VI-CNN] [CNNMCR] [FRUC+DVRF] [FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [Xia ISCAS’19] [Zhang ICIP’19] [ES] [GVCNN] [FRCNN] [Pham ACCESS’19] [CNNInvIF / InvIF] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame] [U+DVPN] [Multi-Scale CNN] [Klopp TIP’20]
VVC [FRUC+DVRF+VECNN] [ScratchCNN]