Review: SR+STN — Super-Resolution Based on Geometric Similarity

Using Spatial Transformer Network (STN) for Super-Resolution (SR), Outperforms SRCNN & VDSR

5 min readSep 22, 2019

In this story, SR+STN, by Dalian University of Technology and Dalian University, is reviewed. By finding similar patches within the same image, image is super resolved with better quality.

**Similar patches recur within the same scale of a single image**

**HR patches similar to given LR patch can often be found in a single image**

First, similar patches are found by k-Nearest Neighbor (kNN).
Then, these similar patches are well-aligned by Spatial Transformer Network (STN).
Finally, the high-resolution (HR) image will be predicted gradually according to the complementary information provided by these aligned patches.

This is published in 2019 JSPIC (Journal on Signal Processing: Image Communication). (Sik-Ho Tsang @ Medium)

Outline

Finding Similar Patches Using kNN
Network Architecture
Experimental Results

1. Finding Similar Patches Using kNN

First, some facts are found.
According to the experiment in Ref. [42] in the paper, when much smaller patch size (e.g., 5 × 5) is used, on the average, more than 90% of the patches in an image have 9 or more other similar patches in the same image at the original image scale.
And more than 80% of the input patches have 9 or more similar patches at 1.254 of the input scale.
Thus, for each patch in Low Resolution (LR) image, its k nearest patch neighbors are found in the same image. i.e. for the source patch (red) as in (a), the similar patch found on the same scale is in blue rectangle.
To find similar patches at larger scales, the input LR image I0, is downsampled to 𝐼1⋯𝐼3, i.e. (b) to (d). The downsample ratio between the layers of the pyramid images is set to be 0.8.
The most similar patches of 𝑃 on different scales (green) in the same image are also found.

2. Network Architecture

2.1. STN

Since the similar patch may not be well-aligned with the source patch, STN is used to well align as shown above.

With learned θ, a well learned affine transform can be applied to input conditioned to input values. As it is differentiable, it can allow end-to-end training.
(STN is a very famous CNN for image classification which has learning-based affine transform to tackle the rotation, zooming problem, etc. If interested, please read my review on STN.)

2.1. Progressive SR via Deconvolution Pyramid

As shown in the figure above, a pyramid of deconvolutional layers is used to improve the spatial resolution of the input image layer by layer.
For example, for 4× magnification, 4-layer pyramid is used to enlarge LR image gradually. The pyramid is concatenated to the back end of the network. Therefore, the whole network include 3 input layers for patch extraction and representation, 4 STN layers for spatial transform, and 4 deconvolutional layers for enlargement. So, our model is an 11-layer deep network.
Finally, the loss function is the standard squared Euclidean distance between the super resolved and the original high resolution image:

3. Experimental Results

3.1. Dataset

The train-91 and urban-100 are used as the training dataset.
For testing, 519 HR images are collected from different databases, namely, 300 facial images selected randomly from LFW database and 219 other images from some standard test image databases: the Set5, Set14 dataset and BSD200. These images comes from different categories, such as face images, natural images, indoor and outdoor scenes, to ensure the algorithm is fully tested.
For both training and testing, we only applied the proposed method on the Y channel, which is extracted from the YCbCr color space, whereas the CbCr channels are up-scaled using the bicubic interpolation.

3.2. PSNR and SSIM

PSNR and SSIM comparison are shown as above where the proposed approach outperforms SCN (Sparse Coding-based Network), SRCNN, and VDSR.

3.3. Computational Time

Although being only slightly better than the VDSR on reconstruction performance, the proposed method is relatively fast.
Since STN is used in parallel, compared with the VDSR which stacks all layers in series, parallel structure of the proposed model can be calculated quickly by GPU in feed-forward propagation.

3.4. Visualization

Repetitive similar regions in an image are used to supply the high frequency detail information required by the reconstructed patch an obtain good image quality.