Reading: H-LSTM — Hierarchical LSTM for Fast H.264 to HEVC Transcoding (Fast Codec Prediction)

59.60% Time Reduction With Only 1.158% Increase in BD-BR (BD-Rate)

Sik-Ho Tsang
6 min readMay 26, 2020
Transcoding Framework Using H-LSTM

In this story, “Fast H.264 to HEVC Transcoding: A Deep Learning Method” (H-LSTM), by Beihang University, is presented. I read this because I work on video coding research. This paper extended its conference paper (i.e. Wei VCIP’17) that I presented last time:

  • First, a large-scale H.264 to HEVC transcoding database is built.
  • Second, the correlation between the HEVC CTU partition and H.264 features, and both temporal and spatial-temporal similarities of the CTU partition across video frames, are analyzed.
  • Third, a deep learning architecture of a hierarchical long short-term memory (H-LSTM) network is proposed to predict the CTU partition of HEVC.

This is a paper in 2019 TCSVT where TCSVT has a high impact factor of 4.046. And I will mainly describe the new stuffs against Wei VCIP’17 in this story. So, it is better to read Wei VCIP’17 first. (Sik-Ho Tsang @ Medium)

It is also a TMM featured article in the month of July 2019!!!
https://signalprocessingsociety.org/publications-resources/ieee-transactions-multimedia/fast-h264-hevc-transcoding-deep-learning-method

Outline

  1. A Large-sScale H.264 to HEVC transcoding (HHT) Database
  2. H.264 Feature Analysis
  3. Proposed H-LSTM
  4. Experimental Results

1. A Large-sScale H.264 to HEVC transcoding (HHT) Database

Details of Collected Video Frames
  • The HHT database contains the CTU partition data of 93 raw sequences compressed by inter-mode HEVC at Quantization Parameter (QP) = 22, 27, 32 and 37.
  • The above table shows that the resolutions of those raw video sequences are diverse, ranging from 352×240 to 2048×1080.
  • There are in total 33,042 frames in the HHT database.
  • The raw video sequences were encoded by the H.264 reference software JM 19.0 with the default configuration file of encoder_baseline.cfg at four QPs = {22, 27, 32, 37}. As a result, 372 compressed H.264 video streams were obtained.
  • Subsequently, all 372 H.264 video streams were decoded.
  • Four common features of H.264, including MV, residual, MB partition and bit allocation, were extracted from the H.264 video streams for the database.
  • The decoded streams were encoded by the HEVC reference software HM 16.0.
  • HEVC encoding is with the default configuration file encoder_low_delay_P_main.cfg at QP = {22, 27, 32 and 37}.
  • In HEVC encoding, the HEVC features of CU, PU and TU partitions were obtained for the database, viewed as the groundtruth.
Splitting & Non-Splitting Samples
  • Finally, HHT database contains a total of 268,640,788 CU samples, including 36.09% splitting samples and 63.91% non-splitting samples.

2. H.264 Feature Analysis

2.1. Overall Correlation Coefficient (CC)

CC between H.264 features and CTU partition
  • The statistical values of correlation coefficient (CC) between H.264 features and CTU partition can be found in the above figure.
  • Baseline means the CC between H.264 features and randomly generated HEVC CTU partition pattern.
  • H.264 features are much higher than baseline.
  • Also, the CC values of 64×64 and 32×32 CUs are larger than those of 16×16 CUs.
Examples of temporal similarity
  • The blocks that have the same CTU partition as the previous reference frame are drawn in blue.

2.2. Temporal Similarity

The CC values of the CTU partition between two frames at various distances
  • The CC values of the CTU partition between two frames at various distances are obtained, ranging from 1 group of pictures (GOP) to 25 GOPs.
  • The CC values are obtained from the co-located units from two frames.
  • There exists similarity correlation of CTU partition across HEVC video frames and the correlation decays alongside the increased distances of two frames. Thus, the CTU partition of HEVC in previous frames can be applied to predict CTU partition.

2.3. Spatial-Temporal Similarity

The CC values of the CTU partition between two frames at various distances
  • The CC values are obtained between the CU at one frame and the eight neighboring CUs at the previous frame.
  • Similarly, CC curves for such spatial-temporal similarity along with the increased distance between two frames.
  • The CC values are all above 0.4 for the first GOP.

3. Proposed H-LSTM

3.1. Overall Architecture

H-LSTM Structure
  • For this part, it is very similar to the conference version except that MV is also used as H.264 features. Thus, the feature vector is different as well.
  • Also, the figure drawn is much beautiful here.
  • (For more details, please read Wei VCIP’17.)

3.2. Bi-Threshold Decision Scheme

  • In the test stage, bi-threshold decision scheme is proposed which is not appeared in Wei VCIP’17.
  • If the output probability from the network > Threshold 1, split.
  • If the output probability from the network <Threshold 2, not split.
  • Otherwise, if the output probability is in between threshold 1 and 2, the conventional full RDO (Rate Distortion Optimization) is performed.
  • The threshold pairs used for 64×64, 32×32 and 16×16 are [0.35,0.65], [0.3,0.7] and [0.2,0.8] respectively.

4. Experimental Results

4.1. BD-Rate

4.1.1 Low Delay P

  • H-LSTM obtains 59.60% time reduction with only 1.158% BD-BR (BD-rate) increase.

4.1.2 Random Access

  • BD-BR (BD-rate) of 1.528% is obtained which is lowest.
  • 55.40% average time reduction is obtained which is largest.

4.2. DMOS

DMOS at QP=27
  • 15 non-expert subjects.
  • The average DMOS by H-LSTM (ours) is close to the original transcoder.

4.3. Prediction Accuracy

  • Without bi-threshold, 82.5% accuracy is obtained.
  • With bi-threshold, even higher of 91.7% accuracy is obtained.

4.4. Time Analysis

  • The running time of the H-LSTM model is less than 2% of the original transcoding time.
  • The H-LSTM model consumes 0.54% and 0.53% of the original transcoding time for 2560×1600 and 1920×1080, respectively.

4.5. Contribution of Features

Results of Using Only One Feature
  • The features of MV, MB partition, bit allocation and residual reduce transcoding complexity by 44.52%, 57.77%, 43.03% and 51.92%, respectively.
  • Meanwhile, the BDPSNR results of single features are all above −0.10 dB.
  • Such results show that each feature contributes in H.264 to HEVC.

During the days of coronavirus, Challenges of writing 30 and 35 stories again for this month have been accomplished. Let me challenge 40 stories!! This is the 37th story in this month.. Thanks for visiting my story..

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

Responses (1)