Reading: STMC / VESPCN — Spatial Temporal Networks and Motion Compensation / Video ESPCN (Video Super Resolution)

Utilize STN, Outperforms SRCNN, ESPCN, VSRnet.

this story, Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation (STMC / VESPCN), by Twitter, is briefly presented. Some papers called it STMC, and authors call it VESPCN in the paper themselves. In this paper:

  • Spatio-temporal subpixel convolution networks are introduced that effectively exploit temporal redundancies and improve reconstruction accuracy while maintaining real-time speed.
  • A novel joint motion compensation and video super-resolution algorithm that is orders of magnitude more efficient than competing methods, relying on a fast multi-resolution spatial transformer module that is end-to-end trainable.

This is a paper in 2017 CVPR with over 200 citations. (Sik-Ho Tsang @ Medium)


  1. VESPCN Overall Network Architecture
  2. Spatial Transformer Motion Compensation (STMC) Module
  3. Experimental Results

1. VESPCN Overall Network Architecture

VESPCN Overall Network Architecture
  • First, motion estimation is performed for the t-1 and t+1 frames, i.e. ILRt-1 and ILRt+1, then warped. This part is done by Spatial Transformer Motion Compensation (STMC).
  • After that, we got 2 motion compensated frames. With also the current frame ILRt, spatial-temporal ESPCN is performed to super revolve the frame as ISRt, which is the ESPCN with few frames as input.

2. Spatial Transformer Motion Compensation (STMC) Module

Spatial Transformer Motion Compensation (STMC) Module
  • Spatial Transformer Networks (STN) were initially shown to facilitate image classification by transforming images onto the same frame of reference.
  • Recently, it has been shown how spatial transformers can encode optical flow features.
  • First, a ×4 coarse estimate of the flow is obtained by early fusing the two input frames and downscaling spatial dimensions with ×2 strided convolutions.
  • The estimated flow is upscaled with sub-pixel convolution and the result Δc is applied to warp the target frame producing I’ct+1.
  • The warped image is then processed together with the coarse flow and the original images through a fine flow estimation module.
  • This uses a single strided convolution with stride ×2 and a final ×2 upscaling stage to obtain a finer flow map Δf.
  • The final motion compensated frame is obtained by warping the target frame with the total flow.
  • To train the STMC, the loss function using MSE plus Huber loss is used:
  • The Huber loss is to constrain the flow to behave smoothly in space. The Huber loss is approximated as:
  • The flow map is obtained by the above STMC module.
  • With STMC, the residual is much smaller.
  • Finally, together with the super resolution problem, the loss become:

3. Experimental Results

Performance on Vid4 videos
Results for ×3 SR on Vid4.
  • CDVL database is used for training. Vid4 is used for evaluation.
  • A 5 layer 3 frame network (5L-E3) and a 9 layer 3 frame network with motion compensation (9LE3-MC) are evaluated.
  • The metrics compared are PSNR, structural similarity (SSIM) and MOVIE indices.
  • The MOVIE index was designed as a metric measuring video quality that correlates with human perception and incorporates a notion of temporal consistency.

3.1. Quality

  • VESPCN surpasses any other methods in PSNR and SSIM by a large margin.
  • The above figure shows the temporal profiles on the row highlighted by a dashed line through 25 consecutive frames, demonstrating a better temporal coherence of the reconstruction proposed.
  • The great temporal coherence of VESPCN also explains the significant reduction in the MOVIE index.

3.2. Computation

  • SRCNN and VSRnet upsample LR images before attempting to super-resolve them, which considerably increases the required number of operations.
  • ESPCN ×4 runs at 29ms per frame on a K2 GPU.
  • The enhanced capabilities of spatio-temporal networks allow to reduce the network operations of VESPCN relative to ESPCN while still matching its accuracy.
  • VESPCN with 5L-E3, which reduces the number of operations by about 20% relative to ESPCN.

There are also in-depth discussions about early fusion, slow fusion and 3D convolution. Please feel free to read the paper if interested. :)

This is the 8th story in this month.

PhD, Researcher. I share what I've learnt and done. :) My LinkedIn:, My Paper Reading List: