Reading: GVTCNN — Group Variational Transformation Convolutional Neural Network (HEVC Inter)
Outperforms CNNIF, 1.9% BD-rate Reduction Obtained Under LDP Configuration
In this story, Group Variational Transformation Convolutional Neural Network (GVTCNN), by Peking University, is briefly presented. I read this because I work on video coding research. In this paper:
- GVTCNN is used to interpolate the sub-pel pixels based on the integer-pel pixels for inter coding.
This is a paper in 2018 DCC. (Sik-Ho Tsang @ Medium)
Outline
- GVTCNN: Network Architecture
- Sample Collection
- Experimental Results
1. GVTCNN: Network Architecture
- The integer-position sample Il is the input of the network.
- 3×3 kernel size for all the convolutional layers with PReLU used.
- Standard MSE loss function is used.
1.1. Shared Feature Map Extraction Component
- A feature map with 48 channels is initially generated from the integer-position sample, followed by 8 convolution layers with 10 channels which are lightweight and cost less to save the learnt parameters.
- The 10-th layer later derives a 48-channel feature map.
- After 9 convolutional layers with 3×3 kernel size, the receptive field of each point in the shared feature map is 19×19.
1.2. Group Variational Transformation
- The group variational transformation is further performed over the shared feature maps with a specific convolutional layer for each sub-pixel sample.
- Different residual maps are then generated and we obtain the final inferred sub-pixel position samples by adding the residual maps to the integer-position sample.
2. Sample Collection
- The training data is first blurred with a Gaussian filter to simulate the correlations between the integer-position sample and sub-pixel position samples.
- Sub-pixel position samples are later sampled from the blurred image.
- As for the input integer-position sample generation, an intermediate integer sample is previously down-sampled from the raw image. Then, the intermediate down-sampled version is coded by HEVC.
- Two networks are trained separately for 3 half-pixel position samples and 12 quarter-pixel position samples to better generate the samples at different sub-pixel levels, called GVTCNN-H and GVTCNN-Q respectively.
- For GVTCNN-H, 200 training images and 200 testing images in the set BSDS500 are used to train it. 3×3 Gaussian kernels with random standard deviations in the range [0.5, 0.6] are used for blurring.
- For GVTCNN-Q, 10 YUV sequences at size 1024×768 and 1920×1080 are used to extract 89 high resolution frames to synthesize training data. Standard deviations of Gaussian kernels ranges from [0.7, 0.8].
- A model is trained for each QP.
3. Experimental Results
- HM-16.15 is used under LDP configuration.
- 1.9% BD-rate reduction is obtained.
- CNNIF only interpolates half-pel pixels, thus GVTCNN-H is used for fair comparison.
- GVTCNN-H outperforms CNNIF with 2.4% BD-rate reduction.
- By training a model separately for each sub-pel position, the BD-rate reduction is still similar to the one using GVT. (Not so clear to me about the network architecture here.)
This is the 26th story in this month!
Reference
[2018 DCC] [GVTCNN]
A Group Variational Transformation Neural Network for Fractional Interpolation of Video Coding
Codec Inter Prediction
H.264 [DRNFRUC & DRNWCMC]
HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [GVTCNN] [Ibrahim ISM’18] [VC-LAPGAN] [VI-CNN] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [Zhang ICIP’19] [ES] [FRCNN] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame] [U+DVPN] [Multi-Scale CNN]
VVC [FRUC+DVRF+VECNN]