Reading: Multi-Scale CNN — Deep Network-Based Frame Extrapolation (HEVC Inter Prediction)

5.3% and 2.8% BD-Rate Reduction Under LDP & LDB configurations

5 min readJun 6, 2020

In this story, “Deep Network-Based Frame Extrapolation With Reference Frame Alignment” (Multi-Scale CNN), by University of Science and Technology of China, and Peking University, is presented. (I just call it “Multi-Scale CNN” since they mentioned “our multi-scale CNN …” in the passage, … lol.) In this paper:

Reference frames are aligned.
The aligned frames are used to extrapolated by a trained deep network. With alignment, training difficulty is reduced and accuracy is improved.

This is a paper in 2020 TCSVT where TCSVT has a high impact factor of 4.046. (Sik-Ho Tsang @ Medium)

Outline

Multi-Scale CNN: Network Architecture
HEVC Implementation
Experimental Results

1. Multi-Scale CNN: Network Architecture

1.1. Motion Estimation

Suppose the current frame is It, and there are 4 previous frame from It-1 to It-4.
For scheme 2, The matching block (red dots) in It-1 is just the colocated block of the current block (red dots) in It.
For scheme 1, motion estimation (ME) is performed to find the matching block (red dots) in It-1 for the current block (red dots) in It.
Similar as scheme 1 for the matching blocks (red dots) in It-2, It-3, and It-4. But the difference here is that the motion vectors (MVs) here are not required to be encoded, since It-1 to It-4 are available at both encoder and decoder sides.
4 frames are considered and integer ME is performed using TZSearch.

1.2. Network Architecture

After that, the matching blocks with paddings (red dots + blue dots) are input into the network, i.e. Xt-1 to Xt-4.
This network is a fully convolutional network (FCN) with multi-scale structure, residual learning and deconvolution.
For each downsize operation, it is downsized by half.
Multi-scale structure: Each level captures motion presented at a particular scale, which helps generate frames in a coarse-to-fine fashion.
Up-scaling module: is used for reaching the same resolution between adjacent scales. The trained deconvolution layer is adopted with initializing it by bilinear interpolation for easy training.
Residual learning: to learn the difference between target frame with past frames for extrapolation.
ReLU is used except the last layer of each scale using tanh to ensure the residual values are within (-1, +1).

1.3. Loss Function

Multi-scale L1 loss is used with λ = 1.

2. HEVC Implementation

There are 2 schemes for motion compensation.
MCP: Replace the traditional MC with an extrapolation-based prediction, i.e. taking the extrapolation result as motion-compensation prediction signal.
REF: Perform ME and MC on the extrapolated frame, i.e. using the extrapolation result as a new reference frame.
There are 2 schemes for alignment between the current frame It and the previous frame It-1.
MEA: Use ME to align, at the cost of transmitting MV.
ColMEA: Uses the colocated block only, avoids delivering overhead.

3. Experimental Results

HM-12.0 is used.
A specific model is trained for different QPs.
Some sequences with various complex motion, texture, and slight noise for generating training data, including Marathon, Runners, RushHour, under LDP configuration with QP from 22, 27, 32 to 37.
500,000 training samples are collected.
During testing, VVC-5.0 is also tried.

3.1. BD-Rate

**BD-Rate (%) on HEVC Test Sequences Under LDP**

It is found that REF+ColMEA obtains the best performance of 5.3% BD-rate reduction.

**BD-Rate (%) on HEVC Test Sequences Under LDP and LDB**

Similar for LDB, 2.8% BD-rate reduction is achieved.
BD-rate reduction is smaller because frame extrapolation is only used for uni-directional prediction. Also, bi-directional prediction reduces the potential benefit of the proposed approach.

**BD-Rate (%) on HEVC Test Sequences Under RA**

The BD-rate reduction is even lower for RA because the frame differences are much larger that the extrapolated frames are at lower quality.

3.2. RD Curves

The gain is larger at high bitrate condition. Because for the area of complex motion, large block is partitioned into smaller blocks to make the inter prediction more accurate, especially at higher bit rates.
And Multi-Scale CNN characterizes complex motion more accurately, and provides better prediction, block partitioning is used less.

3.3. New Mode Usages

The extrapolated frame is selected for the regions of rich motion and object edge.

3.4. VVC

**BD-Rate (%) on HEVC Test Sequences Under LDB**

Only 1.07% BD-rate reduction is obtained since there are many new tools such as advanced block partitioning and affine motion estimation to improve the VVC coding efficiency.

3.5. Computational Time

The encoding time is from 155% to 313%.
The decoding time is from 936% to 20344%.
This is common for CNN applying in video coding for improving the coding efficiency.

There are still many results not yet presented. If interested, please read the paper. :)
This is the 6th story in this month!

Reference

[2020 TCVST] [Multi-Scale CNN]
Deep Network-Based Frame Extrapolation With Reference Frame Alignment

Codec Inter Prediction

H.264 [DRNFRUC & DRNWCMC]
HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [Ibrahim ISM’18] [VI-CNN] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [ES] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame] [U+DVPN] [Multi-Scale CNN]
VVC [FRUC+DVRF+VECNN]