Review: SRGAN & SRResNet — Photo-Realistic Super Resolution (GAN & Super Resolution)

Using Generative Adversarial Network (GAN), Lower PSNR But with More Photo-Realistic Super-Resolved Image Can Be Obtained

Sik-Ho Tsang
6 min readApr 22, 2020
Super-resolved image (left) is almost indistinguishable from original (right). (4× upscaling)

In this paper, a generative adversarial network (GAN) for image super-resolution (SR), SRGAN, by Twitter, is reviewed. The network wihout using GAN is SRResNet. Super-resolved images obtain high peak signal-to-noise ratios (PSNRs), but they are often lacking high-frequency details and are perceptually unsatisfying. One of the reason is that they are using MSE as loss function.

Deep residual generative adversarial network (SRGAN) optimized for a loss more sensitive to human perception
  • SRGAN is the first framework capable of inferring photo-realistic natural images for 4× upscaling factors.
  • A perceptual loss function which consists of an adversarial loss and a content loss is proposed for SR, which uses the high-level feature maps of VGG network, more invariant to changes in pixel space.
  • SRResNet also obtained new state-of-the-art results in terms of PSNR and SSIM at that moment.
  • An extensive mean opinion score (MOS) test on images from three public benchmark datasets also shows SRGAN was new state-of-the-art approach by large margin at that moment.

This is a paper in 2017 CVPR with over 3300 citations. (Sik-Ho Tsang @ Medium)

Outline

  1. SR Formulation
  2. Adversarial Network Architecture
  3. Perceptual Loss Function
  4. Experimental Results

1. SR Formulation

  • In Single Image Super Resolution (SISR) the aim is to estimate a high-resolution, super-resolved image I^SR from a low-resolution input image ILR.
  • Here ILR is the low-resolution version of its high-resolution counterpart IHR. The high-resolution images are only available during training.
  • In training, ILR is obtained by applying a Gaussian filter to IHR followed by a downsampling operation with downsampling factor r.
  • For an image with C color channels, ILR is described by a real-valued tensor of size W×H×C and IHR, ISR by rW×rH×C respectively.

2. Adversarial Network Architecture

  • The general idea behind this formulation is that it allows one to train a generative model G with the goal of fooling a differentiable discriminator D that is trained to distinguish super-resolved images from real images.
  • With this approach, the generator can learn to create solutions that are highly similar to real images and thus difficult to classified by D.
  • The min-max problem in order to train D and G:
  • (For more information, please read GAN if interested.)

2.1. Generator Network G

Generator Network G
  • There are B residual blocks (B=16), originated by ResNet.
  • Within the residual block, two convolutional layers are used, with small 3×3 kernels and 64 feature maps followed by batch-normalization layers and ParametricReLU as the activation function.
  • The resolution of the input image is increased with two trained sub-pixel convolution layers.

2.2. Discriminator Network D

Discriminator Network D
  • LeakyReLU activation (α=0.2) and avoid max-pooling throughout the network.
  • The discriminator network is trained to solve the maximization problem.
  • The network contains eight convolutional layers with an increasing number of 3×3 filter kernels, increasing by a factor of 2 from 64 to 512 kernels as in the VGG network.
  • Strided convolutions are used to reduce the image resolution each time the number of features is doubled.
  • The resulting 512 feature maps are followed by two dense layers and a final sigmoid activation function to obtain a probability for sample classification.

3. Perceptual Loss Function

  • The perceptual loss as the weighted sum of a content loss (lSRX) and an adversarial loss component:

3.1. Content Loss

  • Instead of using MSE loss:
  • Instead of relying on pixel-wise losses, SRGAN uses the VGG loss:
  • Φi,j indicates the feature map obtained by the j-th convolution (after activation) before the i-th maxpooling layer within the VGG19 network.
  • Wi,j and Hi,j describe the dimensions of the respective feature maps.
  • This VGG loss is the euclidean distance between the feature representations of a reconstructed image GθG(ILR) and the reference image IHR.

The content loss is motivated by perceptual similarity instead of similarity in pixel space.

3.2. Adversarial Loss

  • The generative loss lSRGen is defined based on the probabilities of the discriminator DθD(GθG(ILR)) over all training samples:
  • DθD(GθG(ILR)) is the probability that the reconstructed image GG(ILR) is a natural HR image.

The adversarial loss pushes our solution to the natural image manifold using a discriminator network that is trained to differentiate between the super-resolved images and original photo-realistic images.

4. Experimental Results

  • Three datasets are tested: Set5, Set14 and BSD100, with a scale factor of 4×, i.e. 16× reduction in image pixels.
  • All networks are trained on a NVIDIA Tesla M40 GPU using a random sample of 350 thousand images from the ImageNet database.
  • MSE-based SRResNet network is trained as initialization for the generator when training the actual GAN to avoid undesired local optima.
  • The generator and discriminator networks are alternate updated, which is equivalent to k = 1 as in GAN.
  • During test time batch-normalization update is off to obtain an output that deterministically depends only on the input.

4.1. Mean Opinion Score (MOS) Testing

  • 26 raters are asked to assign an integral score from 1 (bad quality) to 5 (excellent quality) to the super-resolved images.
PSNR, SSIM, MOS of SRResNet and SRGAN
  • As shown above, SRResNet obtain higher PSNR and SSIM.
  • But SRGAN can obtain much higher MOS due to more photo-realistic super-resolved images.
MOS scores on BSD100
SOTA Comparison
  • SRResNet obtains new SOTA results using PSNR and SSIM, which outperforms CNN approaches such as SRCNN,DRCN and ESPCN.
  • SRGAN obtains new SOTA results using MOS, which also outperforms other CNN approaches.

4.2. Qualitative Results

  • SRGAN-MSE: Adversarial network using MSE as content loss.
  • SRGAN-VGG22: VGG loss using lower-level features.
  • SRGAN-VGG54: VGG loss using higher-level features, which has a more photo-realistic result.

During the days of coronavirus, I hope to write 30 stories in this month to give myself a small challenge. This is the 20th story in this month. Thanks for visiting my story…

--

--

Sik-Ho Tsang
Sik-Ho Tsang

Written by Sik-Ho Tsang

PhD, Researcher. I share what I learn. :) Linktree: https://linktr.ee/shtsang for Twitter, LinkedIn, etc.