Reading: Puri EUSIPCO’17 — CNN-Based Transform Index Prediction (HEVC Intra Prediction)

LeNet-Like Network, 0.2% Average BD-Rate Gain & Up to 0.59% BD-Rate Gain

5 min readJun 12, 2020

In this story, “CNN-Based Transform Index Prediction in Multiple Transforms Framework to Assist Entropy Coding” (Puri EUSIPCO’17), by Technicolor, Universit´e de Nantes, and IRCCyN, is briefly presented. I read this because I work on video coding research.

Since HEVC uses multiple transforms, a transform index is needed to be encoded to indicate which transform is used.
In this paper, a CNN-based approach is to predict the most probable transform so as to reduce the coding bits for the transform index.

This is a paper in 2017 EUSIPCO. (Sik-Ho Tsang @ Medium)

Outline

Conventional Transform Index Coding
Proposed CNN-Based Transform Index Coding
Experimental Results

1. Conventional Transform Index Coding

1.1. MDTC [5]

In [5], multiple transform competition scheme (MDTC) is proposed.
A transform is selected among all transform based on the minimum rate-distortion (RD) cost during RD optimization.
A transform index is coded to indicate the choice amongst N+1 transforms to the decoder for proper reconstruction of the block.
This is done by first coding a flag that indicates whether the DCT/DST transform is used or not.
If the flag stipulates it is not, the offline learned transforms are used and a fixed length coding is used.
This scheme clearly favors the DCT/DST as it requires fewer bits to encode.

1.2. Fixed Length Coding

An alternative way of signaling the transform choice would be to directly binarize the transform index using a fixed length coding, to indicate N+1 transform candidates on b bits where:

These bits are entropy coded using CABAC.
No flavor towards DCT/DST.
It is used as baseline for comparison in the coming experimental result section.

2. Proposed CNN-Based Transform Index Coding

2.1. Overall Scheme

A 4×4 luma residual block X is input.
There are multiple transforms can be selected from T0 to TN.
T0 is DST while others are offline learned transforms (T1 to TN).
Each of them are tried to be transformed by Ti then quantized (Q).
The quantized transformed coefficients are input into CNN.
A vector p of probabilities of predicting a particular transform index i.
The vector p is utilized to construct a truncated unary code which is simply done by rearranging the probabilities in p in the decreasing order and using minimum bits (1 bit) for the transform index that is predicted with highest probability and maximum bits (N bits) for least probable transform index.
For example, when N=3, there are 4 transforms. Suppose T2 is selected, and the CNN output is [0.15, 0.1, 0.45, 0.30].

By using the truncated entropy coding, 1 bit of ‘0’ is coded for the transform index.
Another example, Suppose T0 is selected, and the CNN output is [0.30, 0.1, 0.45, 0.15].

Then, 2 bits of ‘10’ are coded for the transform index.

2.2. Network Architecture

4×4 coefficient block is as input.
The first convolutional layer takes coefficient block of size 4×4 as input and is passed through 32 filters of size 2×2 and a stride of one.
The second convolution layer operates over the output of the first layer which uses 64 filters of size 2×2 and stride of one.
A max-pooling layer is used to reduce the size to 2×2×64.
This is then fed to the fully connected layers with 36 perceptron.
The final softmax layer outputs the probabilities.
Keras is used.

3. Experimental Results

3.1. Training

HM-15.0 is used with All-Intra configuration.
Training Set: Zurich Building dataset [16] which contains over 1000 images in PNG format that are converted to a YUV format of resolution 640×480.
Only coefficient blocks with at-least 3 non-zero coefficients are considered.
The coefficient blocks where the above and left samples are not available are not taken into account.
Imbalanced classes are avoided by manually balancing the number of coefficients in each class.
Four CNN-models are trained on the four major intra-prediction modes (IPM), namely DC, Planar, Vertical and Horizontal.
Batch size of 32 is used and the number of iterations on the data set is set as 20.

3.2. Loss Curves

Both training and validation loss are reduced during training.

3.3. BD-Rate

Only the first frame is encoded.

EP: encodes the bits b equi-probably (bypass mode).
CTXT: utilizes entropy coding with CABAC context (regular mode) when coding the bits.
NoIndex: Index not coded to show the upper bound.
CNN: obtains largest coding gain of 1.76%.