Reading: STMC / VESPCN — Spatial Temporal Networks and Motion Compensation / Video ESPCN (Video Super Resolution)
In this story, Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation (STMC / VESPCN), by Twitter, is briefly presented. Some papers called it STMC, and authors call it VESPCN in the paper themselves. In this paper:
- Spatio-temporal subpixel convolution networks are introduced that effectively exploit temporal redundancies and improve reconstruction accuracy while maintaining real-time speed.
- A novel joint motion compensation and video super-resolution algorithm that is orders of magnitude more efficient than competing methods, relying on a fast multi-resolution spatial transformer module that is end-to-end trainable.
This is a paper in 2017 CVPR with over 200 citations. (Sik-Ho Tsang @ Medium)
- VESPCN Overall Network Architecture
- Spatial Transformer Motion Compensation (STMC) Module
- Experimental Results
1. VESPCN Overall Network Architecture
- First, motion estimation is performed for the t-1 and t+1 frames, i.e. ILRt-1 and ILRt+1, then warped. This part is done by Spatial Transformer Motion Compensation (STMC).
- After that, we got 2 motion compensated frames. With also the current frame ILRt, spatial-temporal ESPCN is performed to super revolve the frame as ISRt, which is the ESPCN with few frames as input.
2. Spatial Transformer Motion Compensation (STMC) Module
- Spatial Transformer Networks (STN) were initially shown to facilitate image classification by transforming images onto the same frame of reference.
- Recently, it has been shown how spatial transformers can encode optical flow features.
- First, a ×4 coarse estimate of the flow is obtained by early fusing the two input frames and downscaling spatial dimensions with ×2 strided convolutions.
- The estimated flow is upscaled with sub-pixel convolution and the result Δc is applied to warp the target frame producing I’ct+1.
- The warped image is then processed together with the coarse flow and the original images through a fine flow estimation module.
- This uses a single strided convolution with stride ×2 and a final ×2 upscaling stage to obtain a finer flow map Δf.
- The final motion compensated frame is obtained by warping the target frame with the total flow.
- To train the STMC, the loss function using MSE plus Huber loss is used:
- The Huber loss is to constrain the flow to behave smoothly in space. The Huber loss is approximated as:
- The flow map is obtained by the above STMC module.
- With STMC, the residual is much smaller.
- Finally, together with the super resolution problem, the loss become:
3. Experimental Results
- CDVL database is used for training. Vid4 is used for evaluation.
- A 5 layer 3 frame network (5L-E3) and a 9 layer 3 frame network with motion compensation (9LE3-MC) are evaluated.
- The metrics compared are PSNR, structural similarity (SSIM) and MOVIE indices.
- The MOVIE index was designed as a metric measuring video quality that correlates with human perception and incorporates a notion of temporal consistency.
- VESPCN surpasses any other methods in PSNR and SSIM by a large margin.
- The above figure shows the temporal profiles on the row highlighted by a dashed line through 25 consecutive frames, demonstrating a better temporal coherence of the reconstruction proposed.
- The great temporal coherence of VESPCN also explains the significant reduction in the MOVIE index.
- SRCNN and VSRnet upsample LR images before attempting to super-resolve them, which considerably increases the required number of operations.
- ESPCN ×4 runs at 29ms per frame on a K2 GPU.
- The enhanced capabilities of spatio-temporal networks allow to reduce the network operations of VESPCN relative to ESPCN while still matching its accuracy.
- VESPCN with 5L-E3, which reduces the number of operations by about 20% relative to ESPCN.
There are also in-depth discussions about early fusion, slow fusion and 3D convolution. Please feel free to read the paper if interested. :)
This is the 8th story in this month.
[2017 CVPR] [STMC / VESPCN]
Real-Time Video Super-Resolution with Spatio-Temporal Networks and Motion Compensation
[SRCNN] [FSRCNN] [VDSR] [ESPCN] [RED-Net] [DnCNN] [DRCN] [DRRN] [LapSRN & MS-LapSRN] [MemNet] [IRCNN] [WDRN / WavResNet] [MWCNN] [SRDenseNet] [SRGAN & SRResNet] [SelNet] [CNF] [EDSR & MDSR] [MDesNet] [RDN] [SRMD & SRMDNF] [DBPN & D-DBPN] [RCAN] [ESRGAN] [SR+STN]