Reading: Li ICME’17 — Three-Branch Deep CNN for Complexity Reduction on Intra-Mode HEVC (Fast HEVC Prediction)

62.25% and 69.06% Average Time Reduction With Negligible BD-rate of 2.12% and 1.38%, Outperforms Liu ISCAS’16

In this story, “A deep convolutional neural network approach for complexity reduction on intra-mode HEVC” (Li ICME’17), by Beihang University, and Imperial College London, is presented. I read this because I work on video coding research. In this paper:

  • Firstly, a large-scale database with diversiform patterns of CTU partition is established.
  • Secondly, the partition problem is modelled as a three-level classification problem.
  • Lastly, a deep CNN structure with various sizes of convolutional kernels is developed.

This is a paper in 2017 ICME. (Sik-Ho Tsang @ Medium)


  1. CTU Partition of Intra-mode HEVC (CPIH) Database
  2. Network Architecture
  3. Experimental Results

1. CTU Partition of Intra-mode HEVC (CPIH) Database

  • To the of the authors’ best knowledge, this database is the first one on CTU partition patterns. (
  • First, 2000 images at resolution 4928×3264 are selected from Raw Images Dataset (RAISE).
  • These 2000 images are randomly divided into training (1700 images), validation (100 images) and test (200 images) sets.
  • Furthermore, each set is equally divided into four subsets: one subset is with original resolution and the other three subsets are down-sampled to be 2880×1920, 1536×1024 and 768×512 to support different resolutions.
  • (For knowledge of video coding and HEVC, please feel free to read Sections 1 & 2 of IPCNN.)
  • All images are encoded by the HEVC reference software HM using four Quantization Parameters (QPs) of {22, 27, 32, 37}.
  • After encoding, the binary labels indicating splitting (=1) and non-splitting (=0) are obtained for all CUs.
  • Finally, 12 sub-databases are established according to QP and CU size, on account that 4 QPs are applied and CUs with 3 different sizes (64×64, 32×32 and 16×16).
  • The above table shows the details. In total, 110,405,784 samples are gathered, ensuring the sufficiency of training data, and the percentages of splitting and non-splitting CUs are 49.2% and 50.8%, respectively.

2. Network Architecture

Network Architecture
  • Three classifiers Sl are trained for different CU sizes of 64×64 (U), 32×32 (Ui) and 16×16 (Ui,j).
  • The only difference among the three separate CNN models is kernel sizes of the first convolutional layer, pertinent to different CU sizes.
  • Input layer: The input to one CNN model is the wl×wl matrices, where wl ∈ {64, 32, 16}.
  • Convolutional layers: For the 1-st convolutional layer, three branches of filters C1−1, C1−2 and C1−3 with kernel sizes of wl/8×wl/8 , wl/4×wl/4 and wl/2×wl/2 applied in parallel to extract low-level features of CU splitting. The stride is the same as the kernel which makes them non-overlap convolutions.
  • Following the 1-st convolutional layer, feature maps are half-scaled by convoluting with nonoverlapping 2×2 kernels, until the size of final feature maps reaching 2 × 2.
  • Other layers: All feature maps, yielded from the last convolutional layer, are concatenated together and then converted into a vector, through the concatenation layer.
  • This vector then goes through the fully-connected layers, including two hidden layers and one output layer, with dropout of 50% is used.
  • ReLU is used for all layers except Sigmoid is used at output layer since it is binary label.
  • The details for three classifiers are as shown above.
  • It is mentioned that Liu ISCAS’16 only got 1,224 trainable parameters which might cause underfitting while the proposed networks here increase the trainable parameters largely as shown above.
  • Standard cross entropy loss is used where R is mini-batch size:

3. Experimental Results

  • Test Sets: All 200 images of the testing set of our CPIH database, and all 18 video sequences of the Joint Collaborative Team on Video Coding (JCT-VC) standard test set.
Video Test Set
  • 60.91% to 67.20% average time reduction is achieved with 2.12 BD-rate increase, which outperforms SVM approach [13] and Liu ISCAS’16 [21].
CPIH Image Test Set
  • Similar results for CPIH image test set.
  • 64.86% to 73.10% average time reduction is achieved with 1.38 BD-rate increase, which outperforms SVM approach [13] and Liu ISCAS’16 [21].

During the days of coronavirus, A challenge of writing 30/35/40 stories again for this month has been accomplished. Let me challenge 45 stories!! This is the 42nd story in this month.. Thanks for visiting my story..




PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

An Ultra Effective and a Guide to getting a job in Machine Learning

Review — VAE-GAN: Autoencoding beyond pixels using a learned similarity metric

Practical Issues in Data Science Part 2: Distribution Shift (Part 2)

What exactly is Text Mining, Text Analytics and Natural Language Processing?

Mask R-CNN

Introduction to Image Processing: Morphological Operations on Musical Sheets

How long dependencies can LSTM & T-CNN really remember?

Concept of Shapley Value in Interpreting Machine Learning Models

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Sik-Ho Tsang

Sik-Ho Tsang

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List:

More from Medium

Review — ConvNeXt: A ConvNet for the 2020s

Ch 9. Vision Transformer Part I— Introduction and Fine-Tuning in PyTorch

LOTR: Face Landmark Localization Using Localization Transformer

ViT — An Image is worth 16x16 words: Transformers for Image Recognition at scale — ICLR’21