Review — DSepConv: Video Frame Interpolation via Deformable Separable Convolution (Video Frame Interpolation)

Using Deformable Convolution from DCNv1 and DCNv2, Outperforms AdaConv & SepConv

Left: SepConv, bad interpolation for the ball, Right: Proposed DSepConv, considers more relevant pixels far away from local grid (black rectangle) with a much smaller kernel size and performs better

Outline

  1. Adaptive Deformable Separable Convolution
  2. DSepConv: Network Architecture
  3. Loss Function
  4. Experimental Results

1. Adaptive Deformable Separable Convolution

1.1. Conventional Kernel-based Methods

Baseline kernel-based methods
  • For each pixel ˆI(x, y) to be synthesized, a pair of convolution kernels, K1 and K2, are estimated to adaptively convolve the local patches P1(x, y) and P2(x, y) centered at (x, y) from I1 and I2, respectively.
  • The interpolation process is:
  • In AdaConv, n=41 which induce heavy computational load.
  • In SepConv, the 2D kernel is approximated with two 1D kernel, i.e. the separable convolution:

1.2. Proposed Adaptive Deformable Separable Convolution

Proposed Adaptive Deformable Separable Convolution
  • Much smaller convolution kernels are used with additional offsets and masks, which allow to focus on fewer but more relevant pixels rather than all the pixels in a large neighborhood.
  • In Adaptive Deformable Separable Convolution, there are learnable offsets Δpi,j and modulation scalar Δmi,j that are estimated for each pixel located at pi,j in each patch:
  • Similar to SepConv, 1D separable kernels to approximate 2D kernels.

1.3. Comparison with Kernel-Based Methods and Flow-Based Methods

  • When Δp=0 and Δm=1, actually it is SepConv.
  • On the other hand, when n=1, the patches become single pixels via bilinear interpolation. In this case, the interpolation becomes:

2. DSepConv: Network Architecture

DSepConv: Network Architecture
  • The whole network can be divided into the following submodules: the encoder-decoder architecture, kernel estimator, offset estimator and mask estimator as illustrated in the above figure.

2.1. Encoder-decoder Architecture

  • Given two input frames, the encoder-decoder architecture aims to extract deep features for estimating kernels, masks and offsets for each output pixel.

2.2. Kernel Estimator

  • The kernel estimator consists of four parallel sub-networks with analogous structure to estimate vertical and horizontal 1D kernels.
  • For each sub-network, three 3×3 convolution layers with ReLU, a bilinear upsampling layer and another 3×3 convolution layer are stacked, yielding a tensor with n channels.

2.3. Offset Estimator

  • The offset estimator, sharing the same structure as the kernel one.
  • It contains four parallel sub-networks to learn two directional (vertical and horizontal) offsets.
  • With a specific kernel size n, there are n² pixels in each regular grid patch.

2.4. Mask Estimator

  • The design of mask estimator is similar, whose only difference is that the output channels are fed to a sigmoid layer.

2.5. Deformable Convolution

  • (To know more about the deformable convolution, please feel free to read DCNv1 and DCNv2 if interested.)

3. Loss Function

  • There are two losses.
  • The first loss measures the difference between the interpolated pixel color and the ground-truth color with the function:
  • And ρ(·) represents the Charbonnier penalty function.
  • The second loss function aims to sharpen the generated frame by penalizing the differences of frame gradient predictions:
  • The difference of these two absolute differences should be small, along up, down, left and right directions.
  • Finally, the total loss function is:

4. Experimental Results

4.1. Ablation Study

Quantitative evaluation on different network architecture: kernel size of N×N with (N×N + M) or without masks (N×N).
The effect of different network architectures
  • Larger kernel size enables the network take more pixels into consideration. However, it inevitably introduces an increase computation.
  • Larger kernel sizes like n = 7, 9, 11 are not considered as they increase the FLOPs of the network when n = 5 by 12.8%, 69.0% and 173.8%, respectively.
  • When increasing the kernel size from 1×1 to 5×5, the network has a PSNR gain of 0.65 dB and 1.04 dB on UCF101 and Vimeo90K, respectively.

4.2. Comparison with SOTA Approaches

Quantitative comparisons on UCF101 and Vimeo90K (The bold numbers and underlined numbers depict the best and the second best performances.)
  • Moreover, without relying on any extra complex information such as flow, context, edge or depth information, DSepConv shows strong performance on par or even better than the other state-of-the-art methods.
Visual comparisons on the Vimeo90K dataset.
  • Toflow can produce clear result in the man’s leg but some information is lost in the skateboard.
  • In contrast, DSepConv reconstructs them well.
  • Also, the motion is continuous between the frames except the subtitle “Snowy”. Both CyclicGen and DSepConv can generate clear results while the other methods can not handle the discontinuity well.
Visual comparisons on the UCF101 dataset.
Quantitative comparisons on Middlebury Evaluation dataset
  • DSepConv ranks the 3rd best performance among over 160 algorithms listed on the benchmark website.
Visual comparisons on the Middleburry dataset

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store