Reading: GVCNN — One-For-All Group Variation Convolutional Neural Network (HEVC Inter)
Outperforms CNNIF & Zhang VCIP’17. 2.2% Average BD-Rate Reduction Under LDP.
In this story, “One-for-All: Grouped Variation Network-Based Fractional Interpolation in Video Coding” (One-For-All GVCNN), by , is briefly presented since most of the stuffs has been mentioned in the conference version GVTCNN. I’ve just found this transaction paper recently. I read this because I work on video coding research. In this paper:
- GVCNN is used to interpolate the sub-pel pixels based on the integer-pel pixels for inter coding.
- One-For-All means one CNN is used to interpolate pixels of all 15 sub-pel positions from the pixels of integer-pel positions
This is a paper in 2019 TIP where TIP has a high impact factor of 6.79. (Sik-Ho Tsang @ Medium)
Outline
- Fractional Interpolation in HEVC
- GVCNN: Network Architecture
- Training Data Generation
- Experimental Results
1. Fractional Interpolation in HEVC
- Ai,j represent integer samples.
- hki,j (k ∈ {1, 2, 3}) and qki,j (k ∈ {1, 2, · · · , 12}) denote half-pixel positions and quarter-pixel positions, respectively.
- Given a reference block IA, whose pixels are regarded as integer samples (Ai,j), the half-pixel blocks Ihk and quarter-pixel blocks Iqk are interpolated from IA.
- For a reference block IA, the variations between its pixels and sub-pixels are demonstrated below:
- where ΔIhk denotes the variations of half-pixels.
- Similarly, the variations of quarter-pixels are constructed as:
- Thus, the mapping function becomes to learn the variations:
fh(·) and fq(·) represent the learned mapping between integer pixels and grouped variations of half- and quarter-pixel positions, respectively.
And GVCNN attempts to estimate the mapping function.
2. GVCNN: Network Architecture
- A feature map with 48 channels is initially generated from the integer-position sample, followed by 8 convolution layers with 10 channels which are lightweight and cost less to save the learnt parameters.
- The 10-th layer later derives a 48-channel feature map.
- After 9 convolutional layers with 3×3 kernel size, the receptive field of each point in the shared feature map is 19×19.
- The group variational transformation is further performed over the shared feature maps with a specific convolutional layer for each sub-pixel sample.
- Different residual maps are then generated and we obtain the final inferred sub-pixel position samples by adding the residual maps to the integer-position sample.
- The same network is used at GVTCNN.
3. Training Data Generation
- The training data is first blurred with a Gaussian filter to simulate the correlations between the integer-position sample and sub-pixel position samples.
- Sub-pixel position samples are later sampled from the blurred image.
- As for the input integer-position sample generation, an intermediate integer sample is previously down-sampled from the raw image. Then, the intermediate down-sampled version is coded by HEVC.
- The blurring operation is to reduce aliasing effects due to the down-sampling.
- Two networks are trained separately for 3 half-pixel position samples and 12 quarter-pixel position samples to better generate the samples at different sub-pixel levels, called GVCNN-H and GVCNN-Q respectively.
- This part is the same as GVTCNN.
- (And there is large portion of passages mathematical explanations mentioning about the performance bound for motion estimation. If interested, please read the paper.)
4. Experimental Results
4.1. BD-Rate
- HM-16.4 is used.
- 2.2%, 1.2% and 0.9% average BD-rate saving respectively under LDP, LDB and RA conditions.
- For UHD, the gain is much less.
- The explanation from authors is that, the sampling precision of the high-resolution test sequences is high enough and the signals of adjacent pixels are more continuous.
4.2. RD Curves
- The proposed approach is more efficient at high bitrate than at low bitrate.
4.3. SOTA Comparison
- GVCNN outperforms CNNIF & Zhang VCIP’17.
4.4. GVCNN Separate Models
- GVCNN-Separate: Separate models are trained for each sub-pel positions. But Only little gain of 0.1% is obtained.
4.5. Blurring
- Without blurring during training sample collection, 0.2% BD-rate loss is obtained.
4.6. Hitting Ratios
- There are 3% to 25% of CUs choosing GVCNN as shown above.
- And visualization of CUs choosing GVCNN is also shown.
This is the 30th story in this month!
Reference
[2019 TIP] [GVCNN]
One-for-All: Grouped Variation Network-Based Fractional Interpolation in Video Coding
Codec Inter Prediction
H.264 [DRNFRUC & DRNWCMC]
HEVC [CNNIF] [Zhang VCIP’17] [NNIP] [GVTCNN] [Ibrahim ISM’18] [VC-LAPGAN] [VI-CNN] [FRUC+DVRF][FRUC+DVRF+VECNN] [RSR] [Zhao ISCAS’18 & TCSVT’19] [Ma ISCAS’19] [Zhang ICIP’19] [ES] [GVCNN][FRCNN] [CNN-SR & CNN-UniSR & CNN-BiSR] [DeepFrame] [U+DVPN] [Multi-Scale CNN]
VVC [FRUC+DVRF+VECNN]