Reading: Kuanar JCCSP’19 — Object Detection Approach for Video Coding (Fast HEVC Prediction)
Encoding Time Saving Up to 66.89%, BD-Rate Loss Low to 1.31%, Outperforms Li ICME’17
In this paper, “Adaptive CU Mode Selection in HEVC Intra Prediction: A Deep Learning Approach” (Kuanar JCCSP’19), by University of Texas at Arlington, and University of Texas at Dallas, is briefly presented. In this paper:
- CNN-based Object Detection Network is used which includes Region Proposal Network (RPN).
- Regions are classified as homogeneous object, big object, granular texture, and small object.
- According to the object type, different fast approaches are applied.
This is a paper in 2019 JCCSP (Springer Journal of Circuits, Systems, and Signal Processing) with impact factor of 1.922. (Sik-Ho Tsang @ Medium)
Outline
- Overall Framework
- Network Architecture
- Experimental Results
1. Overall Framework
- For each CTU, each CTU and its sub-CUs are input to the CNN-based classifier, as shown in the flowchart above.
- The CU is classified as homogeneous object (class 0), big object (class 1), granular texture (class 2), and small object (class 3).
- In conventional HEVC, there are 35 intra prediction modes in total. (Please read Sections 1 & 2 for HEVC intra coding in IPCNN if interested.) According to the object type, different intra prediction modes, which are much fewer than 35, are applied, as shown in the Algorithm I (PUMS) above. Therefore, encoding time reduction can be reduced.
- For class 0, no more CU partitioning is applied since CUs are homogeneous. Thus, time saving is also achieved largely in this part.
- To classify the CU, CNN is used which is mentioned in the coming section.
2. Network Architecture
2.1. Object Detection Network
- A modified ZFNet is used for the object detection network where there are 5 conv layers and 3 fully connected layers.
- The layers learned series of features starting from low-level features (edges, blobs) to high-level features (object structures).
2.2. Region Proposal Network
- A two-stage object detection approach just like Faster R-CNN is used here.
- A range of anchor box scales from 16×16, 32×32, 64×64 to 128×128, as shown above.
- Each anchor box is accommodated with four aspect ratios {(0.5, 0.5), (1, 2), (2, 1), (1, 1)}.
- A 2×2 sliding window is used over all the positions on ConvNet5 feature maps at the object detection network.
- Each window is then mapped to a 128-dimensional vector which comprises 64 positive anchors and 64 negative anchors.
- The above 128-d vectors are then connected to FC-6 and FC-7 layers.
- The last FC-7 layer is fed into (1) a box classification layer with two probability scores, and (2) a bounding box regression layer with four coordinates.
- The above process is repeated for 16 times (4 box scales 4 aspect ratios) at a single position and continued for all 20×30 ConvNet5 features locations.
- Overall, the prediction process generates 20×30×16 possible anchor boxes. The total number of anchor boxes is reduced to 1000 by using non-maxima suppression technique and ignores boxes which cross the image boundary.
2.3. Texture Feature Calculation
- Also, as seen in the Object Detection Network figure, there is an additional branch.
- Textures have an important role in characterizing many natural objects, particularly for those objects that best qualify a pattern such as a granularity, static, and dynamic.
- The Fisher Vector (FV) pools the local features densely within the described regions, and is therefore more apt at describing textures.
- (FV has been widely used to aggregate the local descriptors of an image into a global representation in large-scale image retrieval.)
- So, on the top of the CNN Conv3 layer, the representations using the Fisher vector (FV) pooling is used, which is commonly done in the bag-of-words approach, with single scale of 26×26.
- The FV pools the local SIFT features densely within the described regions and is therefore apt at describing the image textures.
- A 256-dimensional local features are generated for Fisher vector computation and pools into a representation with 32 Gaussian components.
- Finally, it results into 16 K dimensional descriptors which are much higher than the FC-6 layer (1024-d). So the FV dimension is projected to 1024-d by using the PCA technique, and is merged at FC-6 layer.
2.4. Loss Function
- The objective loss function includes both feature classification and bounding box regression, which is similar to those object detection task network, and with also the FV loss:
- (There are more details about the loss function and also the training details. Please read the paper if interested. Here, I just want to focus on how authors make use of object detection network for speeding up the video coding process.)
3. Experimental Results
- Using the proposed approach, 66.89% time reduction is achieved with only BD-rate loss of 1.31%, which outperforms Li ICME’17 [20].
- For class F screen content sequences, using the proposed approach, 71.72% time reduction is achieved with only BD-rate loss of 1.84%, which outperforms Li ICME’17 [20].
- Compared to the conventional HEVC, using the proposed approach can reduce the mode checking by up to 48.325%.
During the days of coronavirus, A challenge of writing 30/35/40/45 stories again for this month has been accomplished. This is the 46th story in this month..!! Can I write 50 stories in this month?? Thanks for visiting my story..
Reference
[2019 JCSSP] [Kuanar JCSSP’19]
Adaptive CU Mode Selection in HEVC Intra Prediction: A Deep Learning Approach
Codec Fast Prediction
H.264 to HEVC [Wei VCIP’17] [H-LSTM]
HEVC [Yu ICIP’15 / Liu ISCAS’16 / Liu TIP’16] [Laude PCS’16] [Li ICME’17] [Katayama ICICT’18] [Chang DCC’18] [ETH-CNN & ETH-LSTM] [Zhang RCAR’19] [Kuanar JCSSP’19]
VVC [Jin VCIP’17] [Jin PCM’17] [Wang ICIP’18] [Pooling-Variable CNN]