# Review — DSepConv: Video Frame Interpolation via Deformable Separable Convolution (Video Frame Interpolation)

In this story, **Video Frame Interpolation via Deformable Separable Convolution**, (DSepConv), by Wuhan University, is reviewed.

Conventionally, when scene motion is larger than the predefined kernel size, kernel based methods yield poor results.

In this paper:

**Deformable separable convolution (DSepConv)**is used to**adaptively estimate kernels, offsets and masks**to allow the network to**obtain information with much fewer but more relevant pixels.**

This is a paper in **2020 AAAI**. (Sik-Ho Tsang @ Medium)

# Outline

**Adaptive Deformable Separable Convolution****DSepConv: Network Architecture****Loss Function****Experimental Results**

**1. Adaptive Deformable Separable Convolution**

**1.1. Conventional Kernel-based Methods**

- Let us assume
and*I*1to represent the*I*2**two input frames**,**ˆ**denotes the*I***frame to be interpolated**temporally in the midpoint of the two frames. - For
**each pixel ˆ**to be synthesized,*I*(*x*,*y*)**a pair of convolution kernels,**, respectively.*K*1 and*K*2, are estimated to adaptively convolve the local patches*P*1(*x*,*y*) and*P*2(*x*,*y*) centered at (*x*,*y*) from*I*1 and*I*2 - The interpolation process is:

- where
are the*K*1 and*K*2*n*×*n*2D convolution kernels. - In
**AdaConv**,which induce*n*=41**heavy computational load**. - In
**SepConv**, the**2D kernel is approximated with two 1D kernel**, i.e. the**separable convolution**:

- Thus, the number of kernel parameters
**from**for each kernel.*n*² to 2*n*

Nonetheless, despite thousands of pixels have been considered, these methods are limited to motions up to

npixels between two input frames.

## 1.2. Proposed **Adaptive Deformable Separable Convolution**

- As shown above, by using proposed Adaptive Deformable Separable Convolution can obtain pixels (pink points) outside the local neighborhood with additional learnable offsets (purple arrows), allowing us to better handle large motion.
- Much smaller convolution kernels are used with additional offsets and masks, which allow to focus on fewer but more relevant pixels rather than all the pixels in a large neighborhood.
- In Adaptive Deformable Separable Convolution, there are
**learnable offsets Δ**and*pi*,*j***modulation scalar Δ**that are estimated for each pixel located at*mi*,*j**pi*,*j*in each patch:

With Δ, the relevant pixel that is far away can also be captured.pi,jand Δmi,jConsequently,

a large motion object can also be interpolated.

- As the offsets are typically fractional, pixels located at non-integral coordinates are bilinearly sampled.
- Similar to SepConv, 1D separable kernels to approximate 2D kernels.

- (As deformable convolutions are inspired from DCNv1 and DCNv2, please feel free to read them if interested.)

## 1.3. Comparison with Kernel-Based Methods and Flow-Based Methods

**When Δ**, actually it is*p*=0 and Δ*m*=1**SepConv**.- On the other hand,
**when**, the patches become single pixels via bilinear interpolation. In this case, the interpolation becomes:*n*=1

- which is a
**bi-directional warping function**, where each offset Δ*x*1, Δ*x*2, Δ*y*1, and Δ*y*2**optical flow**, and k1, k2, Δ*m*1, and Δ*m*2 are**occlusion masks**.

Thus, kernel-based methods and conventional flow-based methods are specific instances of the proposed DSepConv.

# 2. **DSepConv: Network Architecture**

- A fully convolutional neural network which is similar to SepConv.
- The whole network can be divided into the following submodules: the
**encoder-decoder architecture**,**kernel estimator**,**offset estimator**and**mask estimator**as illustrated in the above figure.

## 2.1. Encoder-decoder Architecture

- Given two input frames, the encoder-decoder architecture aims to extract deep features for estimating kernels, masks and offsets for each output pixel.

AU-Netarchitecture which is the same as theAdaConvone, is used.Skip connections are employed.

## 2.2. Kernel Estimator

- The kernel estimator consists of
**four parallel sub-networks**with analogous structure to**estimate vertical and horizontal 1D kernels.** - For each sub-network,
**three 3×3 convolution layers with ReLU, a bilinear upsampling layer**and**another 3×3 convolution layer**are stacked, yielding**a tensor with***n*channels.

Subsequently, the estimated four 1D kernels are used to approximate two 2D kernels.

## 2.3. Offset Estimator

- The offset estimator, sharing the same structure as the kernel one.
- It contains
**four parallel sub-networks**to**learn two directional (vertical and horizontal) offsets.** - With a specific kernel size
*n*, there are*n*² pixels in each regular grid patch.

Therefore, the number of the output channel in each sub-network is set to be

n².

## 2.4. Mask Estimator

- The design of mask estimator is similar, whose only difference is that the output channels are fed to a sigmoid layer.

There are

two parallel sub-networks, each of which produces tensors withn² channels.

## 2.5. Deformable Convolution

The deformable convolution utilizes the estimated kernels, offsets and mask to adaptively convolve one input frame,yielding an intermediate interpolation result.

- In the right part of the above figure, the intermediate results generated from deformable convolution look dimmer than the final result in brightness except area with occlusion (e.g. area around the red ball), suggesting the effectiveness of deformable convolution to handle motion and occlusion.
- (To know more about the deformable convolution, please feel free to read DCNv1 and DCNv2 if interested.)

**3. Loss Function**

- There are two losses.
**The first loss**measures**the difference between the interpolated pixel color and the ground-truth color**with the function:

- where
*IGT*is the ground truth, ˆ*I*is the interpolated frame, andˆ*I*’ is the temporal flipped interpolated frame. *And ρ*(·) represents the Charbonnier penalty function.**The second loss**function aims to**sharpen the generated frame by penalizing the differences of frame gradient predictions**:

- As in the equation, the absolute difference between the predicted neighbor pixels and the absolute difference between the ground-truth neighbor pixels are first calculated.
- The difference of these two absolute differences should be small, along up, down, left and right directions.
- Finally, the
**total loss function**is:

**4. Experimental Results**

## 4.1. Ablation Study

- For each pixel to be synthesized,
**the kernel size**could be referenced.*n*indicates how many pixels in the non-regular grid augmented with offsets **Larger kernel size**enables the network take**more pixels**into consideration. However, it inevitably introduces**an increase computation.**- Larger kernel sizes like
*n*= 7, 9, 11 are not considered as they increase the FLOPs of the network when*n*= 5 by 12.8%, 69.0% and 173.8%, respectively. - When increasing the kernel size from 1×1 to 5×5, the network has a PSNR gain of 0.65 dB and 1.04 dB on UCF101 and Vimeo90K, respectively.

Network with the mask estimator (This can be attributed to the capability of the modulation mechanism which adjusts offsets in perceiving input patches.M) gives a significant improvement in performance.

## 4.2. Comparison with SOTA Approaches

- Sharing similar network structure and parameters, DSepConv outperforms the baseline SepConv-L1 by more than 0.3dB and 0.9dB (PSNR) on UCF101 and Vimeo90K datasets, respectively.
- Moreover,
**without relying on any extra complex information**such as flow, context, edge or depth information, DSepConv shows strong performance**on par or even better than the other state-of-the-art methods.**

- The MEMC-Net∗ and DAIN methods generate obvious artifacts despite they use a couple of sub-modules in their networks.
- Toflow can produce clear result in the man’s leg but some information is lost in the skateboard.
- In contrast, DSepConv reconstructs them well.
- Also, the motion is continuous between the frames except the subtitle “Snowy”. Both CyclicGen and DSepConv can generate clear results while the other methods can not handle the discontinuity well.

- For UCF101, DSepConv can restore clear details of the football while the results generated by the other interpolation methods suffer from either artifacts or blur.

**DSepConv performs favorably on most sequences**and**achieves the best average performance**against the compared approaches.- DSepConv ranks the 3rd best performance among over 160 algorithms listed on the benchmark website.

The proposed DSepConv is not constrained by neither the kernel size nor the accuracy of optical flow. However, like other kernel-based methods, it can only generate a single in-between frame.

## Reference

[2020 AAAI] [DSepConv]

Video Frame Interpolation via Deformable Separable Convolution

# Video Frame Interpolation

**2017 **[AdaConv] [SepConv] **2020 **[DSepConv]