Reading: ETH-CNN & ETH-LSTM — Reducing Complexity of HEVC (Fast HEVC Intra & Inter Prediction)

39.76% to 59.74%, and 43.14% to 64.07% Time Reduction with Only 1.722% and 1.483% BD-Rate Increase for LDB & RA Configurations Respectively, Outperforms Liu TIP’16 and Li ICME’17

5 min readMay 30, 2020

In this story, ETH-CNN & ETH-LSTM, by Beihang University, and Imperial College London, is presented. I read this because I work on video coding research. This paper extends the conference paper Li ICME’17, which involves the LSTM in the network architecture, which is also a first attempt to use LSTM for predicting CU partition in HEVC. This is a paper in 2018 TIP. (Sik-Ho Tsang @ Medium)

Outline

CPH-Inter Database
ETH-CNN Network Architecture
ETH-LSTM Network Architecture
Experimental Results

1. CPH-Inter Database

CPH-Intra Database has been proposed in Li ICME’17.
In this paper, CPH-Inter database is proposed.
111 raw video sequences were selected, therein consisting of 6 sequences at 1080p (1920 × 1080) from [37], 18 sequences of Classes A ∼ E from the Joint Collaborative Team on Video Coding (JCT-VC) standard test set [38], and 87 sequences from Xiph.org [39] at different resolutions.
If the durations of the sequences are longer than 10 seconds, they were clipped to be 10 seconds.
They are divided into non-overlapping training (83 sequences), validation (10 sequences) and test (18 sequences) sets.
The sequences in our CPH-Inter database were encoded by HM 16.5 at four QPs {22, 27, 32, 37}, using LDP, LDB and RA configurations.
12 sub-databases were obtained for each configuration, corresponding to different QPs and CU sizes.
In total 307,831,288, 275,163,224 and 232,095,164 samples were collected for the LDP, LDB and RA configurations in the CPH-Inter database, respectively.

2. ETH-CNN Network Architecture

2.1. Network Architecture

Preprocessing layers: The raw CTU is preprocessed by mean removal and down-sampling in three parallel branches B1 to B3, corresponding to three levels of HCPM.
where HCPM (Hierarchical CU Partition Map) is the label map at the output. Each entry represents split or not split for that particular CU where split=1 and not split=0.
At branches B1 and B2, CTUs are down-sampled to 16×16 and 32×32, respectively.
Convolutional layers: The data are convoluted with 4 × 4 kernels (16 filters in total) at the first convolutional layer to extract the low-level features.
At the second and third layers, feature maps are sequentially convoluted twice with 2×2 kernels (24 filters for the second layer and 32 filters for the third layer) to generate features at a higher level.
Concatenating layer: All feature maps at three branches, yielded from the second and third convolutional layers, are concatenated together and then flattened into a vector a.
Fully connected layers: All features in the concatenated vector a are processed in three branches. In each branch, the vectorized features of the concatenating layer pass through three fully connected layers, including two hidden layers and one output layer.
The output layer produces HCPM as the output of ETH-CNN.
ReLU is used except the sigmoid is used at the output layer.
Cross-entropy loss function for the HPCM is used:

where H is the cross entropy between ground-truth and predicted labels.

2.2. Bi-Threshold Decision Scheme

For better tradeoff between complexity and performance, bi-threshold decision scheme is used.
P(U)>α1, split; P(U)≤α2, non-split; otherwise, the conventional full RDO is performed.

3. ETH-LSTM Network Architecture

The input to ETH-LSTM is the residue of each CTU.
The features extracted by ETH-CNN are fed into ETH-LSTM.
There are 3 LSTM cells, each corresponds to three levels of HPCM.
At each level, two fully connected layers follow the LSTM cells, which also include the QP value and the order of frame t at GOP.
The output of the second fully connected layer is the probabilities of CU splitting, which are binarized to predict HCPM.
The LSTM state is passed to another LSTM of the frame at another time instant.

4. Experimental Results

4.1. BD-Rate Under AI Configuration

**BD-Rate Under AI Configuration on CPH-Intra Test Set**

Using ETH-CNN, 1.386% BD-rate reduction is obtained with 64.01% to 70.52% time reduction.

**BD-Rate Under AI Configuration on HEVC Test Set**

Using ETH-CNN, 2.247% BD-rate reduction is obtained with 56.92% to 66.47% time reduction.

4.2. BD-Rate Under LDP Configuration

**BD-Rate Under LDP Configuration on HEVC Test Set**

Using ETH-LSTM, 1.495% BD-rate reduction is obtained with 43.84% to 62.94% time reduction.

4.3. BD-Rate Under LDB & RA Configurations

**BD-Rate Under LDB & RA Configurations on HEVC Test Set**

Again, using ETH-LSTM obtains the lowest BD-rate increase with large amount of time reduction.

4.4. Ablation Study

For AI, ETH-CNN outperforms Liu TIP’16 and Li ICME’17.
For LDP, ETH-CNN using residual CTUs as input outperforms ETH-CNN using original CTUs as input. With also the aid of LSTM, ETH-LSTM performs best.

4.5. Running Time

**(a) AI configuration. (b) LDP configuration.**

Both ETH-CNN & ETH-LSTM consume less than 1% of the time required by the original HM.

Since this is a TIP transaction paper, there are still a lot of details and results skipped here. Please feel free to read the paper if interested.

During the days of coronavirus, A challenge of writing 30/35/40/45 stories again for this month has been accomplished. This is the 45th story in this month..!! Let me challenge 50 stories… or take a rest and watch Netflix first?? Thanks for visiting my story..

Reference

[2018 TIP] [ETH-CNN & ETH-LSTM]
Reducing Complexity of HEVC: A Deep Learning Approach

Codec Fast Prediction

H.264 to HEVC [Wei VCIP’17] [H-LSTM]
HEVC [Yu ICIP’15 / Liu ISCAS’16 / Liu TIP’16] [Laude PCS’16] [Li ICME’17] [Katayama ICICT’18] [Chang DCC’18] [ETH-CNN & ETH-LSTM] [Zhang RCAR’19]
VVC [Jin VCIP’17] [Jin PCM’17] [Wang ICIP’18] [Pooling-Variable CNN]