Reading: Wei VCIP'17 — LSTM Method for Predicting CU Splitting in H.264 to HEVC Transcoding (Fast Prediction)

62.39% Encoding Time Reduction, 0.0543 dB Decrease in PSNR

5 min readMay 24, 2020

**Temporal Similarity of CU Partition, the correlation of CU partition across frames is high**

In this story, An LSTM Method for Predicting CU Splitting in H.264 to HEVC Transcoding (Wei VCIP’17), by Beihang University, and Collaborative Innovation Center of Geospatial Technology, is briefly presented.

H.264-to-HEVC Transcoding is to decode the H.264 bitstream and encode it as HEVC format. There are many situations for it: Providing higher compression ratio as a newer codec is used, or decoder side only got HEVC decoder that we need to transcode before sending it out, or it is transcoded by cloud service after a H.264 video bitstream is uploaded, etc. In this paper:

The features of H.264, including residual, macroblock (MB) partition and bit allocation, are employed as the input to the LSTM (Long Short-Term Memory).
The output is the CU splitting decision for HEVC.

This is a paper in 2017 VCIP. (Sik-Ho Tsang @ Medium)

Outline

Correlation Analysis
Hierarchical LSTM Method
Experimental Results

1. Correlation Analysis

**CC values between features of H.264 bitstream and CU splitting patterns**

18 standard test video sequences of JCT-VC and 93 collected raw video sequences are encoded by H.264 (JM 19.0), at four QPs (22, 27, 32, 37).
The features of H.264 include MV (Motion Vector), residual, MB (Macroblock) partition and bit allocation are collected.
Correlation Coefficient (CC) for each feature at each QP is calculated as shown above, where fi denotes the feature of H.264 corresponding to the i-th CU in HEVC, and gi means the ground truth of the i-th CU splitting pattern. the symbol with the bar is the mean value.
Features with high correlation can be used for machine/deep learning.
CC values of bit allocation, MB partition and residual are larger than 0.4 for 64×64 CU.
Besides, CC values are above 0.3 for bit allocation, MB partition and residual when CU is 16×16.

2. Hierarchical LSTM Method

2.1. Hierarchical LSTM Classifiers

There are three different LSTM classifiers, corresponding to three levels of CU partition, i.e. 64×64, 32×32 and 16×16.
The inputs to those LSTM classifiers are H.264 features i.e., bit allocation, residual and MB partition.
If the time step of LSTM is M, the inputs to an LSTM classifier are a sequence of features coming from M frames. It is set as 30 by validation.
The outputs are the splitting decision of CUs in those frames.
When the current CU is decided to be split, the next level LSTM classifier is activated. Otherwise, it is not activated to save the computation.

2.2. LSTM (Long Short-Term Memory)

One LSTM unit consists of one cell and three gates (input, forget and output).
The input gate brings new information to the whole network.
The forget gate determines whether the information is forgotten or discarded in the network.
The output gate decides which piece of message is sent to the next LSTM unit.
Wi, Wf, and Wo are the weights of input, forget and output gates, and bi, bf and bo are their corresponding biases. σ(·) is a sigmoid function.

2.3. LSTM Used in this Paper

Ft is the input H.264 features of the current CU in the t-th frame.
ht−1 is the output of CU splitting patterns for the (t−1)-th LSTM unit.
The loss function of sigmoid cross entropy is employed:

Assume that yi indicates the ground truth of the i-th CU, in which yi = 1 means that the i-th CU is to be split and yi = 0 is opposite. ai is the output modelled by the sigmoid function.
For training the LSTM classifier of level 1, 144-dimensional feature vectors, consisting of 16 elements of bit allocation, 64 elements of MB partition and 64 elements of residual, are delivered to each LSTM unit as the inputs.
The residual feature is the sum of the absolute value of residual in each 8×8 CU.
Similarly, the LSTM classifiers of level 2 and level 3 have 36-dimensional and 9-dimensional input feature vectors.

3. Experimental Results

**Encoding Time Difference (%) and PSNR Difference (dB)**

956,555 samples from 93 collected raw video sequences were divided into non-overlap training set (900,000 samples) and validation set (56,555 samples). And M is set to 30.
18 standard test video sequences of JCT-VC are used for evaluation.
JM 19.0 is used for H.264 and HM 16.0 is used for HEVC.
low delay IPPP structure is used.
ΔT of the proposed LSTM method is 62.39% which is higher than [12], which is a TCSVT paper, with only decrease in 0.0543 dB.
The LSTM module averagely takes only 0.15% time of the original transcoder.

I have just discovered the transaction version, “Fast H.264 to HEVC Transcoding: A Deep Learning Method”, after writing the whole story…lol. Perhaps I will write a new story for it in which only main differences are mentioned…

During the days of coronavirus, A challenge of writing 30 stories again for this month has been accomplished. A new target of 35 stories is set by now. This is the 34th story in this month.. Thanks for visiting my story..

Reference

[2017 VCIP] [Wei VCIP’17]
An LSTM Method for Predicting CU Splitting in H.264 to HEVC Transcoding

Codec Fast Prediction

H.264 to HEVC [Wei VCIP’17]
HEVC [Yu ICIP’15 / Liu ISCAS’16 / Liu TIP’16] [Laude PCS’16] [Katayama ICICT’18]
VVC [Jin VCIP’17] [Jin PCM’17] [Wang ICIP’18] [Pooling-Variable CNN]