Review: LapSRN & MS-LapSRN — Laplacian Pyramid Super-Resolution Network (Super Resolution)

Progressively Reconstructs Residuals, Charbonnier Loss, Parameter Sharing, Local Residual Learning, Outperforms SRCNN, VDSR, DRCN, DRRN

Published in

Towards Data Science

10 min readFeb 5, 2019

In this story, LapSRN (Laplacian Pyramid Super-Resolution Network) and MS-LapSRN (Multi-Scale Laplacian Pyramid Super-Resolution Network) are reviewed. By progressively reconstructs the sub-band residuals, with Charbonnier loss functions, LapSRN outperforms SRCNN, FSRCNN, VDSR, and DRCN. With parameter sharing, local residual learning (LRL) and multi-scale training, MS-LapSRN even outperforms DRRN. LapSRN and MS-LapSRN are published in 2017 CVPR with more than 200 citations and 2018 TPAMI with tens of citations respectively. (Sik-Ho Tsang @ Medium)

Since MS-LapSRN is the extension of LapSRN, I will only present the stuffs in MS-LapSRN paper, yet it also involves the approaches and results of both LapSRN and MS-LapSRN.

Outline

Problems in Previous Approaches
LapSRN: Architecture
LapSRN: Charbonnier Loss Function
MS-LapSRN: Parameter Sharings
MS-LapSRN: Local Residual Learning (LRL)
MS-LapSRN: Multi-Scale Training
Ablation Study
Comparison with State-of-the-art Results

1. Problems in Previous Approaches

There are three issues in the previous approaches

1.1. Bicubic Interpolation

Bicubic interpolation is used to upscale an input LR image before going to network. However, this pre-upsampling step increases unnecessary computational cost and does not provide additional high-frequency information for reconstructing HR images.

1.2. L2 Loss

Existing methods optimize the networks with an L2 loss (i.e., mean squared error loss).
Since the same LR patch may have multiple corresponding HR patches and the L2 loss fails to capture the underlying multi-modal distributions of HR patches, the reconstructed HR images are often over-smoothed and inconsistent to human visual perception on natural images.

1.3. One-Step Upsampling

This one-step upsampling does not super-resolve the fine structures well, which makes learning mapping functions for large scaling factors (e.g., 8×) more difficult.

2. LapSRN: Architecture

In contrast to one-step upsampling, the network progressively reconstructs the sub-band residuals of high-resolution images at multiple pyramid levels, specifically at log2(S) levels where S is the scale factor (i.e., 2, 4, 8).
Without the use of bicubic, it directly extracts features from the low-resolution input image, and thereby low computational loads.
(Laplacian Pyramid has been used for decades. It is called Laplacian Pyramid because at the feature extraction branch, there is one residual image outputted at each level. If interest, please visit wiki about Pyramid (image processing), especially Gaussian Pyramid & Laplacian Pyramid.)
There are two branches: Feature extraction & Image Reconstruction.

2.1. Feature extraction

At level s, there are d convolutional layers and one transposed convolutional layer (Or deconvolutional layer) to upsample the extracted features by a scale of 2.
The output of each transposed convolutional layer is connected to two different layers: (1) a convolutional layer for reconstructing a residual image at level s, and (2) a convolutional layer for extracting features at the finer level s+1.
The feature representations at lower levels are shared with higher levels, and thus can increase the non-linearity of the network to learn complex mappings at the finer levels.

2.2. Image reconstruction

At level s, the input image is upsampled by a scale of 2 with a transposed convolutional (upsampling) layer. This layer is initialized with the bilinear kernel.
The upsampled image is then combined (using element-wise summation) with the predicted residual image from the feature extraction branch to produce a high-resolution output image.
The output HR image at level s is then fed into the image reconstruction branch of level s+1.

3. LapSRN: Charbonnier Loss Function

with

Instead of using standard MSE loss function, LapSRN uses the above loss functions. This ρ function is the Charbonnier penalty function (a differentiable variant of l1 norm), which is robust to handle outliers.
N: Number of samples in one batch, L, number of levels in pyramid, ε is set to 1e-3.
This deep supervision guides the network training to predict sub-band residual images at different levels and produce multi-scale output images.
The 8× model can produce 2×, 4× and 8× super-resolution results in one feed-forward pass.
The above Laplacian pyramid network architecture and Charbonnier loss function are used in both LapSRN and MS-LapSRN.

4. MS-LapSRN: Parameter Sharings

4.1. Parameter Sharing ACROSS Pyramid Levels

As we can see in the figure, the parameters of the feature embedding sub-network, upsampling layers, and the residual prediction layers are shared across all the pyramid levels.
As a result, the number of network parameters is independent of the upsampling scales.

4.2. Parameter Sharing WITHIN Pyramid Levels

**Parameter Sharing WITHIN Pyramid Levels Using Recursive Blocks for the Feature Embedding Subnetwork**

The feature embedding subnetwork has R recursive blocks. Each recursive block has D distinct convolutional layers, which controls the number of parameters in the entire model.
Pre-activation structure (Pre-Activation ResNet) but without the batch normalization layer is used in the recursive block.
Thus, the total depth becomes:

depth = (D × R + 1) × L + 2;
L = log2(S)

S is the upsample scaling factor.
The 1 in the bracket represents the transposed convolutional layers.
The 2 at the end represents the first convolutional layer applied on the input images and the last convolutional layer applied on the residuals.
Therefore, the feature embedding sub-network is extended using deeply recursive layers to effectively increase the network depth without increasing the number of parameters.

5. MS-LapSRN: Local Residual Learning (LRL)

**Different Kinds of Local Residual Learning (LRL)**

As LRL is an efficient component, MS-LapSRN also tested different variants of this as shown above.
(a) LapSRN_NS: No skip connection.
(b) LapSRN_DS: Uses skip connection with the previous output as a source input, i.e. distinct-source skip connection.
(c) LapSRN_SS: Uses skip connection with the very beginning output as a source, i.e. shared-source skip connection.

6. MS-LapSRN: Multi-Scale Training

Multi-scale Charbonnier loss is used.
Say for example the 3-level LapSRN, the Charbonnier loss from all 3 scales are summed as the total loss.
It is noted that here, the scale augmentation here become limited to 2^n× SR. Arbitrary upsampling rate become not supported.

7. Ablation Study

7.1. Some Details

64 filters in all convolutional layers except the first layer which applied on input image, the layers for predicting the residuals, and the image upsampling layer.
Filter size of the convolutional and transposed convolutional layers are 3×3 and 4×4 respectively.
Leaky ReLUs with slope of 0.2 is used.
Training set: 291 images where 91 images from Yang, and 200 images from Berkeley Segmentation Dataset.
Batch size of 64 and the size of HR patches is cropped as 128×128.
An epoch has 1000 iterations.
Data augmentation: (1) Scaling by randomly downscale images between [0.5, 1.0]. (2) Random rotation of 90, 180 or 270 degrees. (3) Random horizontally flipping with probability of 0.5.
LR training patchesare generated by bicubic downsampling.
MatConvNet toolbox is used.

7.2. Pyramid Structure

**LapSRN with Different Components (Here, residual learning is talking about GRL or the image reconstruction branch, not LRL)**

**LapSRN with Different Components (Here, GRL is the image reconstruction branch, not LRL)**

Here, the network used 10 convolutional layers, and PSNR on Set14 for 4× SR is measured.
The pyramid structure leads to considerable performance improvement with 0.7 dB increased on SET5 and 0.4 dB increased on SET14 compared with the one without pyramid structure (i.e. a network similar to FSRCNN) (Brown).

7.3. Global Residual Learning (GRL)

The image reconstruction branch is removed and directly predict the HR images at each level (Blue).
Blue curve converges slowly and fluctuates significantly during training.
On the other hand, the full LapSRN (Red) outperforms SRCNN within 10 epoches.

7.4. Charbonnier Loss Function

LapSRN using regular L2 loss function (Green) is inferior than the one using Charbonnier loss function (Red).

7.5. Parameter Sharing ACROSS Pyramid Levels

**Parameter Sharing Across Pyramid Levels for 4× SR model**

LapSRN 4× model has 812k parameters.
By sharing parameters across pyramid levels, the number of parameters is reduced to 407k.
Without LRL, this model has 10 convolutional layers (D) and 1 recursive block (R), it is called LapSRN_NS-D10R1.
LapSRN_NS-D10R1 achieves comparable performance with LapSRN while using half of the network parameters.

7.6. Parameter Sharing WITHIN Pyramid Levels

With LapSRN_NS-D5R2 and LapSRN_NS-D2R5, the model share parameters within the pyramid as well and have 222k and 112k parameters respectively. However, the performance is dropped.
This is because there is no LRL used within the pyramid levels.

7.7. Local Residual Learning (LRL)

**Different Kinds of Local Residual Learning (LRL) Using LapSRN-D5R5 on Set5 for 4× SR**

**Different Kinds of Local Residual Learning (LRL) Using Different Models on URBAN100 for 4× SR**

As in the figure and table, LapSRN_SS which uses the shared-source has the highest PSNR.

7.8. Study of D and R

Different values of D and R are tested.
D2R5, D5R2 and D10R1 perform comparably.
D4R8 achieves the best reconstruction accuracy with the network depth of more than 80.

7.9. Multi-Scale Training

MS-LapSRN supports multi-scale training with training sample scale combinations of {2×}, {4×}, {8×}, {2×, 4×}, {2×, 8×}, {4×, 8×} and {2×, 4×, 8×}.
MS-LapSRN trained with multi-scale of {2×, 4×, 8×} yields the best results.

8. Comparison with State-of-the-art Results

8.1. PSNR, SSIM, IFC

Five datasets are tested: Set5, Set14, BSDS100, URBAN100, MANGA190.
LapSRN_SS-D5R2 has similar depth as VDSR, DRCN and LapSRN.
LapSRN_SS-D5R5 has the same depth as DRRN.
LapSRN_SS-D5R8 has 84 layers for 4× SR.
MS means trained with multi-scale training using scales of {2×, 4×, 8×}.
LapSRN performs favorably especially on 4× and 8× SR.
LapSRN does not use any 3 SR samples for training but still generates comparable results as the DRRN.

8.2. Qualitative Results

**4× SR on BSDS100, URBAN100 and MANGA109**

**8× SR on BSDS100, URBAN100 and MANGA109**

As shown above, MS-LapSRN accurately reconstructs parallel straight lines, grid patterns, and texts.
In the case of 8× SR, those prior arts which use bicubic pre-upsampling or use one-step upsampling, they do not super-resolve the fine structures well.

8.3. Execution Time

3.4 GHz Intel i7 CPU (32G RAM) with NVidia Titan Xp GPU (12G Memory) is used.
SRCNN and FSRCNN are on CPU originally, but rebuilt by authors for GPU.
For 4× SR on URBAN100, MS-LapSRN-D5R2 is faster than all the existing methods except the FSRCNN.
MS-LapSRN-D5R8 outperforms DRRN.

SRCNN and VDSR complexity depends on the output image size. For 4× and 8× SR, their running time is increased relatively more than MS-LapSRN.
Though FSRCNN is the fastest, MS-LapSRN has much higher SR quality.

8.6. Model Parameters

MS-LapSRN has parameters about 73% less than LapSRN, 66% less than VDSR, 87% less than DRCN, and 25% less than DRRN.

8.7. Real-World Photos (JPEG Compression Artifacts)

MS-LapSRN reconstruct sharper and more accurate images.

8.8. Comparison with LAPGAN

Authors also compares with LAPGAN.
The original purpose of LAPGAN is not for super resolution. It is a generative image model for image texture synthesis. Here, authors train LAPGAN using the same training data and setting as LapSRN.
Finally, LapSRN performs better.

8.9. Adversarial Training

LapSRN is extended as a generative network, and a discriminative network is built using the discriminator of DCGAN.
The network with the adversarial training generates more plausible details on regions of irregular structures, e.g., grass, and feathers.

8.10 Limitations

LapSRN generates clean and sharp HR images for large upsampling scales, e.g., 8×, it does not “hallucinate” fine details.

The progressive upsampling makes me think of the gradual deconvolution or upsampling approaches for object detection (e.g. DSSD, FPN, RetinaNet) or semantic segmentation (e.g. DeconvNet).