Reading: CNN-SENet — Fast Depth Intra Coding (Fast 3D-HEVC)

20.9% Encoding Time Reduction Without Any Significant Loss

Sik-Ho Tsang
3 min readJul 30, 2020


3D-HEVC (DIBR: Depth Image Based Rendering)

In this story, Fast Depth Inra Coding based on Layer-classification and CNN for 3D-HEVC (CNN-SENet), by Beijing University of Technology, is presented. I read this because I work on video coding research. This paper only got 1 page, that means there is not much details. In this paper:

  • A convolutional neural network (CNN) scheme based on layer-classification for fast depth intra coding is designed to determine the smoothest depth map.
  • Then, a CNN network incorporating SENet (CNN-SENet) structure is designed and trained.
  • Finally, the layer-classification model and the CNN-SENet network are combined to predict the coding unit (CU) partition of all coding units (CUs) for depth map.

This is a paper in 2020 DCC. (Sik-Ho Tsang @ Medium)


  • There are 3 modules as mentioned in the paper.
  1. Module 1: Layer-Classification Model
  2. Module 2: Network Incorporating SENet
  3. Module 3: CTU Partition Decision Unit

1. Module 1: Layer-Classification Model

  • The input is a 64×64 pixels block which is preprocessed by mean removal and down-sampling.
  • The first hidden layer (C1 layer) is a convolution layer with 16 feature maps.
  • The second hidden layer (C2 layer) is a convolution layer with 24 feature maps of 8×8 size.
  • The third hidden layer (C3 layer) is a convolution layer with 32 feature maps of 4×4 size.
  • Then the last two hidden layers successively perform fully connection.
  • When training the CNN, features after the two fully connected layers are randomly dropped out with probabilities of 50% and 20%, respectively.
  • Last, the output exhibit 16 prediction probabilities for further CTU partition decision in Module 3.

2. Module 2: Network Incorporating SENet

  • Module 2 represents the network structure of the SENet.
  • The branch of C3 represents the SENet structure.
  • In SENet, do global average pooling to C3, called it a Squeeze process.
  • After that, the output will go through two fully connected layers, referred to as excitation process.
  • Finally, sigmoid is used to limit the output to the range of [0,1]. And this value is multiplied as scale to 32 channels of C3 as the input data of the next level.
  • The SENet can enhance the important features and weaken the unimportant features by controlling the scale. It can make the extracted features more directivity.
  • (Please feel free to read SENet if interested.)

3. Module 3: CTU Partition Decision Unit

  • Module 3 is a CTU partition decision unit, which will decide on category and the 16 CU’s prediction probability.
  • First, the 16 outputs from Module 1 are recorded in the matrix output.
  • Then, different calculation methods are used to calculate the partition probability of different CU. (But there is no details about the calculation methods.)
  • The experimental results show that the proposed method can reduce 20.9% encoding time without any significant loss for the 3D video quality.

This is the 27th story in this month.



Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: for Twitter, LinkedIn, etc.