Reading: ZSSR — “Zero-Shot” Super-Resolution Using Deep Internal Learning (Super Resolution)

Outperforms EDSR+, SRGAN, VDSR & SRCNN

6 min readJul 20, 2020

In this story, “Zero-Shot” Super-Resolution Using Deep Internal Learning (ZSSR), by The Weizmann Institute of Science, and Institute for Advanced Study, is presented. In this paper:

“Zero-Shot” SR exploits the power of Deep Learning, but does not rely on prior training.
A small image-specific CNN is trained at test time, on examples extracted solely from the input image itself.

This is a paper in 2018 CVPR with over 100 citations. (Sik-Ho Tsang @ Medium)

Outline

Principle and Contributions of ZSSR
ZSSR: Overall Scheme
ZSSR: Network Architecture
Experimental Results
Visual Comparison for Different Types of Images

1. Principle and Contributions of ZSSR

1.1. Principle

**Internal predictive power of image-specific information.**

Fundamental to the approach is the fact that natural images have strong internal data repetition. For example, small image patches (e.g., 5×5, 7×7) were shown to repeat many times inside a single image, both within the same scale, as well as across different image scales.
This observation was empirically verified by [5, 24].
The above figure shows an example of a simple single-image SR based on internal patch recurrence.

In fact, the only evidence to the existence of these tiny handrails exists internally, inside this image, at a different location and different scale. It cannot be found in any external database of examples, no matter how large this dataset is!

Thus, SOTA SR methods fail to recover this image-specific information when relying on externally trained images.

1.2. Contributions

In this paper, the contributions are therefore several-fold:

This is the first unsupervised CNN-based SR method.
It can handle non-ideal imaging conditions, and a wide variety of images and data types (even if encountered for the first time).
It does not require pretraining and can be run with modest amounts of computational resources.
It can be applied for SR to any size and theoretically also with any aspect-ratio.
It can be adapted to known as well as unknown imaging conditions (at test time).
It provides SOTA SR results on images taken with ‘non-ideal’ conditions, and competitive results on ‘ideal’ conditions for which SOTA supervised methods were trained on.

2. ZSSR: Overall Scheme

**Left: Standard Supervised CNN, Right: ZSSR Overall Scheme**

2.1. Differences Between Supervised CNN & ZSSR

(a) Left, Standard Supervised CNN: which train on a large and diverse external collection of LR-HR image examples, must capture in their learned weights the large diversity of all possible LR-HR relations.
These networks tend to be extremely deep and very complex.
(b) Right, ZSSR: In contrast, the diversity of the LR-HR relations within a single image is significantly smaller, hence can be encoded by a much smaller and simpler image-specific network.

2.2. Steps

The examples are obtained by downscaling the LR image I, to generate a lower-resolution version of itself, I↓s (where s is the desired SR scale factor).
A relatively light CNN is trained to reconstruct the test image I from its lower-resolution version I↓s (top part of (b)).
Then, the resulting trained CNN is applied to the test image I, now using I as the LR input to the network, in order to construct the desired HR output I↑s (bottom of (b)).

2.3. Data Augmentation

The augmentation is done by downscaling the test image I to many smaller versions of itself. These play the role of the HR supervision and are called “HR fathers”. Each of the HR fathers is then downscaled by the desired SR scale-factor s to obtain the “LR sons”, which form the input training instances. The resulting training set consists of many image-specific LR-HR example pairs.
It further enriches the training set by transforming each LR-HR pair using 4 rotations (0◦, 90◦, 180◦, 270◦) and their mirror reflections in the vertical and horizontal directions. This adds ×8 more image-specific training examples.
For large SR scale factors s, at each intemediate scale si, the generated SR image HRi and its downscaled/rotated versions to the gradually growing training-set, can be used as new HR fathers.

3. ZSSR: Network Architecture

A simple, fully convolutional network, with 8 hidden layers, each has 64 channels. ReLU is used. Residual learning is used between the interpolated LR and its HR parent. Thus, it is just like VDSR.
Geometric self-ensemble is used, which generates 8 different outputs for the 8 rotations+flips of the test image I, and then combines them.
Although training is done at test time, the average runtime for SRx2 is only 9 sec on Tesla V100 GPU or 54 sec on K-80 (average taken on BSD100 dataset).

4. Experimental Results

4.1. The ‘Ideal’ Case: Bicubic Downsampling

**Comparison of SR results for the ’ideal’ case (bicubic downscaling).**

ZSSR is significantly better than the older SRCNN, and in some cases achieves comparable or better results than VDSR (which was the SOTA until a year ago).
SRGAN is also trained for the ideal case. In those cases, SRGAN methods tend to hallucinate visually pleasing information, hence score numerically worse than ZSSR.
Within the unsupervised-SR regime, ZSSR outperforms the leading method SelfExSR by a large margin.

Moreover, in images with very strong internal repetitive structures, such as above, ZSSR tends to surpass VDSR, and sometimes also EDSR+, even though these LR images were generated using the ‘ideal’ supervised setting.

4.2. The ‘Non-Ideal’ Case

SR in the presence of unknown downscaling kernels.

Each LR image was subsampled by a different random kernel.
Two cases are considered for applying ZSSR:

The more realistic scenario of unknown downscaling kernel. For this mode, [15] is used to evaluate the kernel directly from the test image and fed it to ZSSR.
We applied ZSSR with the true downscaling kernel used to create the LR image. Such a scenario is potentially useful for images obtained by sensors with known specs.
ZSSR outperforms SOTA methods by a large margin: +1db for unknown (estimated) kernels; +2db when provided the true kernels.

This shows that an accurate downscaling model is more important than sophisticated image priors, and using the wrong donwscaling kernel leads to oversmoothed SR results.

4.3. Poor-quality LR images

Each image from BSD100 chooses a random type of degradation out of 3 degradations: (i) Gaussian noise (0.05), (ii) Speckle noise (0.05), (iii) JPEG compression [quality = 45 (By MATLAB standard)].
ZSSR is robust to unknown degradation types, while these typically damage SR supervised methods to the point where bicubic interpolation outperforms current SOTA SR methods!

5. Visual Comparison for Different Types of Images

The above examples are not contrived cases, but rather occur often when dealing with real LR images — images downloaded from the internet, images taken by an iPhone, old historic images, etc. In those ‘non-ideal’ cases, SOTA SR methods often produce poor results.

When the LR image is generated with a nonideal (non-bicubic) downscaling kernel, or contains aliasing effects, or simply contains sensor noise or compression artifacts, EDSR+ cannot perform well but ZSSR does.