Reading: Puri EUSIPCO’17 — CNN-Based Transform Index Prediction (HEVC Intra Prediction)

LeNet-Like Network, 0.2% Average BD-Rate Gain & Up to 0.59% BD-Rate Gain

Sik-Ho Tsang
5 min readJun 12, 2020

In this story, “CNN-Based Transform Index Prediction in Multiple Transforms Framework to Assist Entropy Coding” (Puri EUSIPCO’17), by Technicolor, Universit´e de Nantes, and IRCCyN, is briefly presented. I read this because I work on video coding research.

  • Since HEVC uses multiple transforms, a transform index is needed to be encoded to indicate which transform is used.
  • In this paper, a CNN-based approach is to predict the most probable transform so as to reduce the coding bits for the transform index.

This is a paper in 2017 EUSIPCO. (Sik-Ho Tsang @ Medium)

Outline

  1. Conventional Transform Index Coding
  2. Proposed CNN-Based Transform Index Coding
  3. Experimental Results

1. Conventional Transform Index Coding

1.1. MDTC [5]

  • In [5], multiple transform competition scheme (MDTC) is proposed.
  • A transform is selected among all transform based on the minimum rate-distortion (RD) cost during RD optimization.
  • A transform index is coded to indicate the choice amongst N+1 transforms to the decoder for proper reconstruction of the block.
  • This is done by first coding a flag that indicates whether the DCT/DST transform is used or not.
  • If the flag stipulates it is not, the offline learned transforms are used and a fixed length coding is used.
  • This scheme clearly favors the DCT/DST as it requires fewer bits to encode.

1.2. Fixed Length Coding

  • An alternative way of signaling the transform choice would be to directly binarize the transform index using a fixed length coding, to indicate N+1 transform candidates on b bits where:
  • These bits are entropy coded using CABAC.
  • No flavor towards DCT/DST.
  • It is used as baseline for comparison in the coming experimental result section.

2. Proposed CNN-Based Transform Index Coding

2.1. Overall Scheme

Overall Scheme
  • A 4×4 luma residual block X is input.
  • There are multiple transforms can be selected from T0 to TN.
  • T0 is DST while others are offline learned transforms (T1 to TN).
  • Each of them are tried to be transformed by Ti then quantized (Q).
  • The quantized transformed coefficients are input into CNN.
  • A vector p of probabilities of predicting a particular transform index i.
  • The vector p is utilized to construct a truncated unary code which is simply done by rearranging the probabilities in p in the decreasing order and using minimum bits (1 bit) for the transform index that is predicted with highest probability and maximum bits (N bits) for least probable transform index.
  • For example, when N=3, there are 4 transforms. Suppose T2 is selected, and the CNN output is [0.15, 0.1, 0.45, 0.30].
  • By using the truncated entropy coding, 1 bit of ‘0’ is coded for the transform index.
  • Another example, Suppose T0 is selected, and the CNN output is [0.30, 0.1, 0.45, 0.15].
  • Then, 2 bits of ‘10’ are coded for the transform index.

2.2. Network Architecture

Network Architecture
  • 4×4 coefficient block is as input.
  • The first convolutional layer takes coefficient block of size 4×4 as input and is passed through 32 filters of size 2×2 and a stride of one.
  • The second convolution layer operates over the output of the first layer which uses 64 filters of size 2×2 and stride of one.
  • A max-pooling layer is used to reduce the size to 2×2×64.
  • This is then fed to the fully connected layers with 36 perceptron.
  • The final softmax layer outputs the probabilities.
  • Keras is used.
Detailed Architecture

3. Experimental Results

3.1. Training

  • HM-15.0 is used with All-Intra configuration.
  • Training Set: Zurich Building dataset [16] which contains over 1000 images in PNG format that are converted to a YUV format of resolution 640×480.
  • Only coefficient blocks with at-least 3 non-zero coefficients are considered.
  • The coefficient blocks where the above and left samples are not available are not taken into account.
  • Imbalanced classes are avoided by manually balancing the number of coefficients in each class.
  • Four CNN-models are trained on the four major intra-prediction modes (IPM), namely DC, Planar, Vertical and Horizontal.
  • Batch size of 32 is used and the number of iterations on the data set is set as 20.

3.2. Loss Curves

  • Both training and validation loss are reduced during training.

3.3. BD-Rate

  • Only the first frame is encoded.
BD-Rate (%) When N=1
  • EP: encodes the bits b equi-probably (bypass mode).
  • CTXT: utilizes entropy coding with CABAC context (regular mode) when coding the bits.
  • NoIndex: Index not coded to show the upper bound.
  • CNN: obtains largest coding gain of 1.76%.
BD-Rate (%) When N=3
  • Similar for N=3, CNN outperforms EP and CTXT.
  • An average gain of around 0.2% and a maximum gain up to 0.59% are achieved.

This is the 15th story in this month!

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet