Reading: ETH-CNN & ETH-LSTM — Reducing Complexity of HEVC (Fast HEVC Intra & Inter Prediction)

39.76% to 59.74%, and 43.14% to 64.07% Time Reduction with Only 1.722% and 1.483% BD-Rate Increase for LDB & RA Configurations Respectively, Outperforms Liu TIP’16 and Li ICME’17

Sik-Ho Tsang
5 min readMay 30, 2020

In this story, ETH-CNN & ETH-LSTM, by Beihang University, and Imperial College London, is presented. I read this because I work on video coding research. This paper extends the conference paper Li ICME’17, which involves the LSTM in the network architecture, which is also a first attempt to use LSTM for predicting CU partition in HEVC. This is a paper in 2018 TIP. (Sik-Ho Tsang @ Medium)

Outline

  1. CPH-Inter Database
  2. ETH-CNN Network Architecture
  3. ETH-LSTM Network Architecture
  4. Experimental Results

1. CPH-Inter Database

  • CPH-Intra Database has been proposed in Li ICME’17.
  • In this paper, CPH-Inter database is proposed.
  • 111 raw video sequences were selected, therein consisting of 6 sequences at 1080p (1920 × 1080) from [37], 18 sequences of Classes A ∼ E from the Joint Collaborative Team on Video Coding (JCT-VC) standard test set [38], and 87 sequences from Xiph.org [39] at different resolutions.
  • If the durations of the sequences are longer than 10 seconds, they were clipped to be 10 seconds.
  • They are divided into non-overlapping training (83 sequences), validation (10 sequences) and test (18 sequences) sets.
  • The sequences in our CPH-Inter database were encoded by HM 16.5 at four QPs {22, 27, 32, 37}, using LDP, LDB and RA configurations.
  • 12 sub-databases were obtained for each configuration, corresponding to different QPs and CU sizes.
  • In total 307,831,288, 275,163,224 and 232,095,164 samples were collected for the LDP, LDB and RA configurations in the CPH-Inter database, respectively.

2. ETH-CNN Network Architecture

ETH-CNN Network Architecture

2.1. Network Architecture

  • Preprocessing layers: The raw CTU is preprocessed by mean removal and down-sampling in three parallel branches B1 to B3, corresponding to three levels of HCPM.
  • where HCPM (Hierarchical CU Partition Map) is the label map at the output. Each entry represents split or not split for that particular CU where split=1 and not split=0.
  • At branches B1 and B2, CTUs are down-sampled to 16×16 and 32×32, respectively.
  • Convolutional layers: The data are convoluted with 4 × 4 kernels (16 filters in total) at the first convolutional layer to extract the low-level features.
  • At the second and third layers, feature maps are sequentially convoluted twice with 2×2 kernels (24 filters for the second layer and 32 filters for the third layer) to generate features at a higher level.
  • Concatenating layer: All feature maps at three branches, yielded from the second and third convolutional layers, are concatenated together and then flattened into a vector a.
  • Fully connected layers: All features in the concatenated vector a are processed in three branches. In each branch, the vectorized features of the concatenating layer pass through three fully connected layers, including two hidden layers and one output layer.
  • The output layer produces HCPM as the output of ETH-CNN.
  • ReLU is used except the sigmoid is used at the output layer.
  • Cross-entropy loss function for the HPCM is used:
  • where H is the cross entropy between ground-truth and predicted labels.

2.2. Bi-Threshold Decision Scheme

  • For better tradeoff between complexity and performance, bi-threshold decision scheme is used.
  • P(U)>α1, split; P(U)≤α2, non-split; otherwise, the conventional full RDO is performed.

3. ETH-LSTM Network Architecture

ETH-LSTM Network Architecture
  • The input to ETH-LSTM is the residue of each CTU.
  • The features extracted by ETH-CNN are fed into ETH-LSTM.
  • There are 3 LSTM cells, each corresponds to three levels of HPCM.
  • At each level, two fully connected layers follow the LSTM cells, which also include the QP value and the order of frame t at GOP.
  • The output of the second fully connected layer is the probabilities of CU splitting, which are binarized to predict HCPM.
  • The LSTM state is passed to another LSTM of the frame at another time instant.

4. Experimental Results

4.1. BD-Rate Under AI Configuration

BD-Rate Under AI Configuration on CPH-Intra Test Set
  • Using ETH-CNN, 1.386% BD-rate reduction is obtained with 64.01% to 70.52% time reduction.
BD-Rate Under AI Configuration on HEVC Test Set
  • Using ETH-CNN, 2.247% BD-rate reduction is obtained with 56.92% to 66.47% time reduction.

4.2. BD-Rate Under LDP Configuration

BD-Rate Under LDP Configuration on HEVC Test Set
  • Using ETH-LSTM, 1.495% BD-rate reduction is obtained with 43.84% to 62.94% time reduction.

4.3. BD-Rate Under LDB & RA Configurations

BD-Rate Under LDB & RA Configurations on HEVC Test Set
  • Again, using ETH-LSTM obtains the lowest BD-rate increase with large amount of time reduction.

4.4. Ablation Study

Ablation Study
  • For AI, ETH-CNN outperforms Liu TIP’16 and Li ICME’17.
  • For LDP, ETH-CNN using residual CTUs as input outperforms ETH-CNN using original CTUs as input. With also the aid of LSTM, ETH-LSTM performs best.

4.5. Running Time

(a) AI configuration. (b) LDP configuration.
  • Both ETH-CNN & ETH-LSTM consume less than 1% of the time required by the original HM.

Since this is a TIP transaction paper, there are still a lot of details and results skipped here. Please feel free to read the paper if interested.

During the days of coronavirus, A challenge of writing 30/35/40/45 stories again for this month has been accomplished. This is the 45th story in this month..!! Let me challenge 50 stories… or take a rest and watch Netflix first?? Thanks for visiting my story..

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.

No responses yet