Reading: ETH-CNN & ETH-LSTM — Reducing Complexity of HEVC (Fast HEVC Intra & Inter Prediction)
39.76% to 59.74%, and 43.14% to 64.07% Time Reduction with Only 1.722% and 1.483% BD-Rate Increase for LDB & RA Configurations Respectively, Outperforms Liu TIP’16 and Li ICME’17
In this story, ETH-CNN & ETH-LSTM, by Beihang University, and Imperial College London, is presented. I read this because I work on video coding research. This paper extends the conference paper Li ICME’17, which involves the LSTM in the network architecture, which is also a first attempt to use LSTM for predicting CU partition in HEVC. This is a paper in 2018 TIP. (Sik-Ho Tsang @ Medium)
Outline
- CPH-Inter Database
- ETH-CNN Network Architecture
- ETH-LSTM Network Architecture
- Experimental Results
1. CPH-Inter Database
- CPH-Intra Database has been proposed in Li ICME’17.
- In this paper, CPH-Inter database is proposed.
- 111 raw video sequences were selected, therein consisting of 6 sequences at 1080p (1920 × 1080) from [37], 18 sequences of Classes A ∼ E from the Joint Collaborative Team on Video Coding (JCT-VC) standard test set [38], and 87 sequences from Xiph.org [39] at different resolutions.
- If the durations of the sequences are longer than 10 seconds, they were clipped to be 10 seconds.
- They are divided into non-overlapping training (83 sequences), validation (10 sequences) and test (18 sequences) sets.
- The sequences in our CPH-Inter database were encoded by HM 16.5 at four QPs {22, 27, 32, 37}, using LDP, LDB and RA configurations.
- 12 sub-databases were obtained for each configuration, corresponding to different QPs and CU sizes.
- In total 307,831,288, 275,163,224 and 232,095,164 samples were collected for the LDP, LDB and RA configurations in the CPH-Inter database, respectively.
2. ETH-CNN Network Architecture
2.1. Network Architecture
- Preprocessing layers: The raw CTU is preprocessed by mean removal and down-sampling in three parallel branches B1 to B3, corresponding to three levels of HCPM.
- where HCPM (Hierarchical CU Partition Map) is the label map at the output. Each entry represents split or not split for that particular CU where split=1 and not split=0.
- At branches B1 and B2, CTUs are down-sampled to 16×16 and 32×32, respectively.
- Convolutional layers: The data are convoluted with 4 × 4 kernels (16 filters in total) at the first convolutional layer to extract the low-level features.
- At the second and third layers, feature maps are sequentially convoluted twice with 2×2 kernels (24 filters for the second layer and 32 filters for the third layer) to generate features at a higher level.
- Concatenating layer: All feature maps at three branches, yielded from the second and third convolutional layers, are concatenated together and then flattened into a vector a.
- Fully connected layers: All features in the concatenated vector a are processed in three branches. In each branch, the vectorized features of the concatenating layer pass through three fully connected layers, including two hidden layers and one output layer.
- The output layer produces HCPM as the output of ETH-CNN.
- ReLU is used except the sigmoid is used at the output layer.
- Cross-entropy loss function for the HPCM is used:
- where H is the cross entropy between ground-truth and predicted labels.
2.2. Bi-Threshold Decision Scheme
- For better tradeoff between complexity and performance, bi-threshold decision scheme is used.
- P(U)>α1, split; P(U)≤α2, non-split; otherwise, the conventional full RDO is performed.
3. ETH-LSTM Network Architecture
- The input to ETH-LSTM is the residue of each CTU.
- The features extracted by ETH-CNN are fed into ETH-LSTM.
- There are 3 LSTM cells, each corresponds to three levels of HPCM.
- At each level, two fully connected layers follow the LSTM cells, which also include the QP value and the order of frame t at GOP.
- The output of the second fully connected layer is the probabilities of CU splitting, which are binarized to predict HCPM.
- The LSTM state is passed to another LSTM of the frame at another time instant.
4. Experimental Results
4.1. BD-Rate Under AI Configuration
- Using ETH-CNN, 1.386% BD-rate reduction is obtained with 64.01% to 70.52% time reduction.
- Using ETH-CNN, 2.247% BD-rate reduction is obtained with 56.92% to 66.47% time reduction.
4.2. BD-Rate Under LDP Configuration
- Using ETH-LSTM, 1.495% BD-rate reduction is obtained with 43.84% to 62.94% time reduction.
4.3. BD-Rate Under LDB & RA Configurations
- Again, using ETH-LSTM obtains the lowest BD-rate increase with large amount of time reduction.
4.4. Ablation Study
- For AI, ETH-CNN outperforms Liu TIP’16 and Li ICME’17.
- For LDP, ETH-CNN using residual CTUs as input outperforms ETH-CNN using original CTUs as input. With also the aid of LSTM, ETH-LSTM performs best.
4.5. Running Time
- Both ETH-CNN & ETH-LSTM consume less than 1% of the time required by the original HM.
Since this is a TIP transaction paper, there are still a lot of details and results skipped here. Please feel free to read the paper if interested.
During the days of coronavirus, A challenge of writing 30/35/40/45 stories again for this month has been accomplished. This is the 45th story in this month..!! Let me challenge 50 stories… or take a rest and watch Netflix first?? Thanks for visiting my story..
Reference
[2018 TIP] [ETH-CNN & ETH-LSTM]
Reducing Complexity of HEVC: A Deep Learning Approach
Codec Fast Prediction
H.264 to HEVC [Wei VCIP’17] [H-LSTM]
HEVC [Yu ICIP’15 / Liu ISCAS’16 / Liu TIP’16] [Laude PCS’16] [Li ICME’17] [Katayama ICICT’18] [Chang DCC’18] [ETH-CNN & ETH-LSTM] [Zhang RCAR’19]
VVC [Jin VCIP’17] [Jin PCM’17] [Wang ICIP’18] [Pooling-Variable CNN]