Review — DSepConv: Video Frame Interpolation via Deformable Separable Convolution (Video Frame Interpolation)

Using Deformable Convolution from DCNv1 and DCNv2, Outperforms AdaConv & SepConv

7 min readMay 16, 2021

**Left**: **SepConv**, bad interpolation for the ball, **Right**: **Proposed DSepConv**, considers more relevant pixels far away from local grid (black rectangle) with a much smaller kernel size and performs better

In this story, Video Frame Interpolation via Deformable Separable Convolution, (DSepConv), by Wuhan University, is reviewed.

Conventionally, when scene motion is larger than the predefined kernel size, kernel based methods yield poor results.

In this paper:

Deformable separable convolution (DSepConv) is used to adaptively estimate kernels, offsets and masks to allow the network to obtain information with much fewer but more relevant pixels.

This is a paper in 2020 AAAI. (Sik-Ho Tsang @ Medium)

Outline

Adaptive Deformable Separable Convolution
DSepConv: Network Architecture
Loss Function
Experimental Results

1. Adaptive Deformable Separable Convolution

1.1. Conventional Kernel-based Methods

Let us assume I1 and I2 to represent the two input frames,ˆI denotes the frame to be interpolated temporally in the midpoint of the two frames.
For each pixel ˆI(x, y) to be synthesized, a pair of convolution kernels, K1 and K2, are estimated to adaptively convolve the local patches P1(x, y) and P2(x, y) centered at (x, y) from I1 and I2, respectively.
The interpolation process is:

where K1 and K2 are the n×n 2D convolution kernels.
In AdaConv, n=41 which induce heavy computational load.
In SepConv, the 2D kernel is approximated with two 1D kernel, i.e. the separable convolution:

Thus, the number of kernel parameters from n² to 2n for each kernel.

Nonetheless, despite thousands of pixels have been considered, these methods are limited to motions up to n pixels between two input frames.

1.2. Proposed Adaptive Deformable Separable Convolution

As shown above, by using proposed Adaptive Deformable Separable Convolution can obtain pixels (pink points) outside the local neighborhood with additional learnable offsets (purple arrows), allowing us to better handle large motion.
Much smaller convolution kernels are used with additional offsets and masks, which allow to focus on fewer but more relevant pixels rather than all the pixels in a large neighborhood.
In Adaptive Deformable Separable Convolution, there are learnable offsets Δpi,j and modulation scalar Δmi,j that are estimated for each pixel located at pi,j in each patch:

With Δpi,j and Δmi,j, the relevant pixel that is far away can also be captured.
Consequently, a large motion object can also be interpolated.

As the offsets are typically fractional, pixels located at non-integral coordinates are bilinearly sampled.
Similar to SepConv, 1D separable kernels to approximate 2D kernels.

(As deformable convolutions are inspired from DCNv1 and DCNv2, please feel free to read them if interested.)

1.3. Comparison with Kernel-Based Methods and Flow-Based Methods

When Δp=0 and Δm=1, actually it is SepConv.
On the other hand, when n=1, the patches become single pixels via bilinear interpolation. In this case, the interpolation becomes:

which is a bi-directional warping function, where each offset Δx1, Δx2, Δy1, and Δy2 can be regarded as a component of optical flow, and k1, k2, Δm1, and Δm2 are occlusion masks.

Thus, kernel-based methods and conventional flow-based methods are specific instances of the proposed DSepConv.

2. DSepConv: Network Architecture

A fully convolutional neural network which is similar to SepConv.
The whole network can be divided into the following submodules: the encoder-decoder architecture, kernel estimator, offset estimator and mask estimator as illustrated in the above figure.

2.1. Encoder-decoder Architecture

Given two input frames, the encoder-decoder architecture aims to extract deep features for estimating kernels, masks and offsets for each output pixel.

A U-Net architecture which is the same as the AdaConv one, is used.
Skip connections are employed.

2.2. Kernel Estimator

The kernel estimator consists of four parallel sub-networks with analogous structure to estimate vertical and horizontal 1D kernels.
For each sub-network, three 3×3 convolution layers with ReLU, a bilinear upsampling layer and another 3×3 convolution layer are stacked, yielding a tensor with n channels.

Subsequently, the estimated four 1D kernels are used to approximate two 2D kernels.

2.3. Offset Estimator

The offset estimator, sharing the same structure as the kernel one.
It contains four parallel sub-networks to learn two directional (vertical and horizontal) offsets.
With a specific kernel size n, there are n² pixels in each regular grid patch.

Therefore, the number of the output channel in each sub-network is set to be n².

2.4. Mask Estimator

The design of mask estimator is similar, whose only difference is that the output channels are fed to a sigmoid layer.

There are two parallel sub-networks, each of which produces tensors with n² channels.

2.5. Deformable Convolution

The deformable convolution utilizes the estimated kernels, offsets and mask to adaptively convolve one input frame, yielding an intermediate interpolation result.

In the right part of the above figure, the intermediate results generated from deformable convolution look dimmer than the final result in brightness except area with occlusion (e.g. area around the red ball), suggesting the effectiveness of deformable convolution to handle motion and occlusion.
(To know more about the deformable convolution, please feel free to read DCNv1 and DCNv2 if interested.)

3. Loss Function

There are two losses.
The first loss measures the difference between the interpolated pixel color and the ground-truth color with the function:

where IGT is the ground truth, ˆI is the interpolated frame, andˆI’ is the temporal flipped interpolated frame.
And ρ(·) represents the Charbonnier penalty function.
The second loss function aims to sharpen the generated frame by penalizing the differences of frame gradient predictions:

As in the equation, the absolute difference between the predicted neighbor pixels and the absolute difference between the ground-truth neighbor pixels are first calculated.
The difference of these two absolute differences should be small, along up, down, left and right directions.
Finally, the total loss function is:

4. Experimental Results

4.1. Ablation Study

**Quantitative evaluation on different network architecture: kernel size of N×N with (N×N + M) or without masks (N×N).**

**The effect of different network architectures**

For each pixel to be synthesized, the kernel size n indicates how many pixels in the non-regular grid augmented with offsets could be referenced.
Larger kernel size enables the network take more pixels into consideration. However, it inevitably introduces an increase computation.
Larger kernel sizes like n = 7, 9, 11 are not considered as they increase the FLOPs of the network when n = 5 by 12.8%, 69.0% and 173.8%, respectively.
When increasing the kernel size from 1×1 to 5×5, the network has a PSNR gain of 0.65 dB and 1.04 dB on UCF101 and Vimeo90K, respectively.

Network with the mask estimator (M) gives a significant improvement in performance. This can be attributed to the capability of the modulation mechanism which adjusts offsets in perceiving input patches.

4.2. Comparison with SOTA Approaches

**Quantitative comparisons on UCF101 and Vimeo90K (The bold numbers and underlined numbers depict the best and the second best performances.)**

Sharing similar network structure and parameters, DSepConv outperforms the baseline SepConv-L1 by more than 0.3dB and 0.9dB (PSNR) on UCF101 and Vimeo90K datasets, respectively.
Moreover, without relying on any extra complex information such as flow, context, edge or depth information, DSepConv shows strong performance on par or even better than the other state-of-the-art methods.

**Visual comparisons on the Vimeo90K dataset.**

The MEMC-Net∗ and DAIN methods generate obvious artifacts despite they use a couple of sub-modules in their networks.
Toflow can produce clear result in the man’s leg but some information is lost in the skateboard.
In contrast, DSepConv reconstructs them well.
Also, the motion is continuous between the frames except the subtitle “Snowy”. Both CyclicGen and DSepConv can generate clear results while the other methods can not handle the discontinuity well.

**Visual comparisons on the UCF101 dataset.**

For UCF101, DSepConv can restore clear details of the football while the results generated by the other interpolation methods suffer from either artifacts or blur.

**Quantitative comparisons on Middlebury Evaluation dataset**

DSepConv performs favorably on most sequences and achieves the best average performance against the compared approaches.
DSepConv ranks the 3rd best performance among over 160 algorithms listed on the benchmark website.

**Visual comparisons on the Middleburry dataset**

DSepConv reconstructs the ball and the feet of the boy with a clear boundary while AdaConv, SepConv, ToFlow, MEMC-Net∗ and ADC suffer from some blur.

The proposed DSepConv is not constrained by neither the kernel size nor the accuracy of optical flow. However, like other kernel-based methods, it can only generate a single in-between frame.

Reference

[2020 AAAI] [DSepConv]
Video Frame Interpolation via Deformable Separable Convolution

Video Frame Interpolation

2017 [AdaConv] [SepConv] 2020 [DSepConv]